Hrmm… As many of you no doubt noticed, the community site has been up and down over the past two weeks or so. Actually, its the whole server that runs community, wiki, tracker, and planet. There’s a problem somewhere (obviously) that we’ve had a hell of a time tracking down. We’ve moved some sites to other servers, switched OSes, upped the vm memory, etc. with no luck. Its gotta be in the software, so we’ll have to spend even more time tracking the thing down. Until then, we’ve implemented a magic/hackish “autoboot” function to save our sanity and the general availability of the affected sites. Implemented with Nagios technology of course…
I’ve never been a huge advocate of auto-restarting/rebooting/etc. anything with an automated script. It reminds of bad old days when non-techies thought they could fix a slow web app by walking into the server room and pressing the power button. Ouch. However, many Nagios users have done this type of thing using event handlers to automatically restart crashed servers. In what could either be considered a nice admin hack or a horrible half-baked workaround to a real solution, we’ve decided to implement this functionality with this particular web server.
We’re using a nice magic cron job that runs several Nagios plugins to ascertain the general health and well-being of the server. If enough of the performance of the server are out-of-bounds, the server will initiate an automatic restart. Not an ideal solution, but a reboot means the server will only be down for a minute instead of a few hours and the server should recover from memory exhaustion, etc. when its back up.
We’ll watch and see how this solution works over the next few days. The real solution will require deeper investigation, but its nice to have an elegant hack that gets 97% of the job done at 1% of the cost.