Yesterday about a hundred thousand people visited this blog due to my post on names, and the server it was on died several fiery deaths. This has been a persistent issue for me in dealing with Apache (the site dies nearly every time I get Reddited — with only about 10,000 visitors each time, which shouldn’t be a big number on the Internet), but no amount of enabling WordPress cache plugins, tweaking my Apache settings, upgrading the VPS’ RAM, or Googling lead me to a solution.

However, necessity is the mother of invention, and I finally figured out what was up yesterday. The culprit: KeepAlive.

Setting up and tearing down HTTP connections is expensive for both servers and clients, so Apache keeps connections open for a configurable amount of time after it has finished a request.  This is an extraordinarily sensible default, since the vast majority of HTTP requests will be followed by another HTTP request — fetch dynamically generated HTML, then start fetching linked static assets like stylesheets and images, etc.  Look, of 43 requests, 42 were not the last request in a 3 second interval.  It is a huge throughput win.  However, if you’re running a memory constrained VPS and get hit by a huge wave of traffic, KeepAlive will kill you.

When I started getting hit by the wave yesterday, I had 512MB of RAM and a cap (ServerLimit = MaxClients) of 20 worker processes to deal with them.  Each worker was capable of processing a request in a fifth of a second, because everything was cached.  This implies that my throughput should have been close to 20 * 60 * 5 = 60k satisfied clients a minute, enough to withstand even a mighty slashdotting.  (That is a bit of an overestimation, since there were also static assets being requested with each hit, but to fix an earlier Reddit attack I had manually hacked the heck out of my WordPress theme to load static assets from Bingo Card Creator’s Nginx, because there seems to be no power on Earth or under it that can take Nginx down.)

However, I had KeepAlive on, set to three seconds.  This meant that for every 250ms of a worker streaming cached content to a client, it spent 3 seconds sucking its thumb waiting for that client to come back and ask for something else.  In the meantime, other clients were stacking up like planes over O’Hare.  The first twenty clients get in and, from the perspective of every other client, the site totally dies for three seconds.  Then the next twenty clients get served, and the site continues to be dead for everybody else.  Cycle, rinse, repeat.  The worst part was people were joining the queue faster than their clients were either getting handled or timeouted, so it was essentially a denial of service attack caused by the default settings.  The throughput of the server went from about 60k requests per second to about 380 requests per second.  380 is, well, not quite enough.

Thus the solution: turning KeepAlive off.  This caused CPU usage to spike quite a bit, but since the caching plugin was working, it immediately alleviated all of the user-visible problems.  Bingo, done.

Since I tried about a dozen things prior to hitting on this, I thought I’d quick write them down in case you are an unlucky sod Googling for Apache settings for your VPS, possibly Ubuntu Apache settings, or that sort of thing:

  • Increase VPS RAM: Not really worth doing unless you’re on 256MB.  Apache should be able to handle the load with 20 processes.
  • Am I using pre-fork Apache or the worker MPM? If  you’re on Ubuntu, you’re probably using the pre-fork Apache.  MPM settings will be totally ignored.  You can check this by running apache2 -l .  (This is chosen at compile time and can’t be altered via the config files, so if — like me — you just apt-get your way around getting common programs installed, you’re likely stuck.)
  • What should my pre-fork settings be then?

Assuming 512 MB of RAM and you are only running Apache and MySQL on the box:

</div>
StartServers          2
MinSpareServers       2
MaxSpareServers      5
ServerLimit          20
MaxClients           20
MaxRequestsPerChild  10000
</IfModule>
You can bump ServerLimit and MaxClients to 48 or so if you have 1GB of RAM.  Note that this assumes you’re using a fairly typical WordPress installation, and you’ve tried to optimize Apache’s memory usage.  If you see your VPS swapping, move those numbers down (and restart Apache) until you see it stop swapping.  Apache being inaccessible is bad, swapping might slow your server down bad enough to kill even your SSH connection, and then you’ll have to reboot and pray you can get in fast enough to tweak settings before it happens again.
  • How do I tweak Apache’s memory usage? Turn off modules you don’t need.  Go to /etc/apache2/mods-enabled.  Take note of how many things there are that you’re not using.  Run sudo a2dismod (name of module) for them, then restart Apache.  This literally halved my per-process memory consumption last night, which let me run twice as many processes.  (That still won’t help you if KeepAlive is on, but it could majorly increase responsiveness if you’ve eliminated that bottleneck.)  Good choices for disabling are, probably, everything that starts with dav, everything that starts with auth (unless you’re securing wp-admin at the server layer — in that case, enable only the module you need for that), and userdir.
  • What cache to use? WordPress Super Cache.  Installs quickly (follow the directions to the letter, especially regarding permissions), works great.  Don’t try to survive a Slashdotting without it.
  • Any other tips?  Serve static files through Nginx.  Find a Rails developer to explain it to you if you haven’t done it before — it is easier than you’d think and will take major load off your server (Apache only serves like 3 requests of the 43 required to load a typical page on my site — and two of those are due to a plugin that I can’t be bothered to patch).
  • My server is slammed and I can’t get into the WordPress admin to enable the caching plugin I just installed:  Make sure Apache’s KeepAlive is off.   Change your permit directive in the Apache configuration to
<Directory /var/www/blog-directory-getting-slammed-goes-here> Options FollowSymLinks AllowOverride All Order deny,allow Deny from all Allow from </Directory>
This will have Apache just deny requests from clients other than yourself (although Apache will keep the connection open if you’re using KeepAlive, which won’t due you a lick of good since it will still hold the line open so that it can deny their next request promptly — don’t use KeepAlive).  That should let you get into the WordPress admin to enable and test caching.  After doing so, you can switch to Allow from All and then test to see if your site is now surviving.
Sidenote: If you can possibly help it, I recommend Nginx over Apache.  I use Apache because a couple of years ago it was not simple to use Nginx with PHP.  This is no longer the case.  The default settings (or whatever  you’ve copied from the My First Rails Nginx Configuration you just Googled) are _much_ more forgiving than Apache’s defaults.  It is extraordinarily difficult to kill Nginx unless you set out to do so.  Apache.conf, on the other hand, is a whole mess of black magic with subtle interactions that will kill you under plausible deployment scenarios, and the official documentation has copious explanations of What the settings do and almost nothing regarding Why or How you should configure them. Hopefully, this will save you, brave Googling blog owner from the future, from having to figure this out by trial and error while your server is down.  Godspeed.