Complexity, Uptime and the End of the World

Michael Nugent

Issue #212, December 2011

Poorly implemented monitoring systems can drive an administrator crazy. At best, they are distracting. At worst, they'll keep whoever is on pager duty up for nights at a time. This article discusses the best practices for designing systems that will keep your systems up and stay quiet when nothing is wrong.

After being in the computer industry for 20-odd years, I've come to realize there is a single thing everyone can agree on: no matter how new, how stable or how awesome any piece of technology is, it will break.

Fortunately, system administrators plan for these things. Whether it's a redundant server in the data center or a second availability zone in EC2, the first and best way to ensure uptime is to decrease the number of single points of failure across the network. There are drawbacks to this approach though. Increasing a Web cluster from one to ten boxes decreases the chance of hardware failure taking down the entire site by a factor of ten. Although this increases redundancy, it also dramatically increases the expense and complexity of the network. Instead of running a single server, there's now a series of boxes with a shared data store and load balancers. This complexity comes with drawbacks. It's ten times as likely that hardware failure will occur and a system administrator will wake up, and that only counts the actual Web servers. Whether you're in a data center or in the cloud, this kind of layering of services significantly increases the chances that a single device will go down and alert in the middle of the night.

Preventing this kind of thing is usually high on a system administrator's list of desires, even if it tends to be pushed lower on the priority list in practice. Waking up in the middle of the night to fix a server or piece of software is bad for productivity and bad for morale. There are two steps that can help make sure this doesn't happen. The first is to implement the necessary amount of redundancy without increasing the complexity of the system past what is required for it to run. The second step is to implement a monitoring system that will allow you to monitor exactly what you want as opposed to worrying about which individual box is using how much RAM.

The End of the World methodology is a thought experiment designed to help choose the level of redundancy and complexity required for the application. It helps determine acceptable scenarios for downtime. Often when you ask people when it's acceptable for their sites to be down, they'll say that it never is, but that's not exactly true. If an asteroid strikes Earth and destroys most of the human race, is it necessary for site to stay up? If the application is NORAD, maybe it is necessary, but for Groupon, not so much. That kind of uptime requires massive infrastructure placed in strategic locations around the globe and the kind of capital investments and staffing to which only large governments usually have access.

Backing off step by step from this kind of over-the-top disaster, you can find where the acceptable level is. What if the disaster is localized to just the continent? Is it acceptable to be down at this time? If the site is focused on those customers, it may be. If the site is an international tool, such as Amazon or Google, possibly not. What if it's local to the data center or availability zone where your boxes are kept? Most shops would like to stay up even if a backhoe cuts the power to their data center.

When the problem is framed this way, it becomes obvious that there is an acceptable level of downtime. Administrators can find the sweet spot between uptime and complexity. Finding the outer bounds of these requirements will uncover the requirements for monitoring the service as a whole. Notice that this is a service and not a server. Although it's easy to monitor whether a network interface is available, it's far more interesting to monitor the health of an entire cluster. In our ten server cluster, if www6 goes down on a cluster that gets 40% utilization at night, it's probably not worth getting up for. If the entire Web service goes down, that's something that needs to be acted upon immediately.

A monitoring system is basically a scheduler and data collection tool that executes checks against a service and reports the results back to be presented on a common dashboard. It seems like one of those innocuous pieces of software that just runs in background, like network graphs or log analysis, but it has a hidden ability to hurt an entire engineering department. False positives can wake people up in the middle of the night and cause ongoing dread of going on pager duty. This results in people putting things in maintenance mode to quiet the false positives and can end up with unnoticed failure of services.

Dealing with false positives often is more of a policy problem than a design problem. Choosing what to monitor is far more important than choosing how to monitor it. Many companies have a history of monitoring things like CPU and RAM usage. They feel that sometimes spikes are precursors to crashes, so alerting on them is reasonable. The problem here is things that can cause the computer to use CPU and RAM, and most of them are within the normal bounds of an operating system. When the system administrator checks on the box, the resource is in use, but the application is functioning without a problem. Unless there is a clear documented link between RAM over a certain level and a crashing service, skipping on alerts for this kind of resource use leads to far fewer false positives. Monitors should be tied to a defined good or bad value with respect to a particular production service.

Another path that leads to a large number of false positives is using percentages in differently equipped boxes. For example, if a system has a 137G drive that's 95% full, it has only around 6G free. On sites with heavy traffic or sites with a lot of instrumentation in the code, 6G can go pretty quickly. Applying this monitor to the same Web server with a 2TB disk seems like less of an emergency. Leaving “only” 100G free on a system overnight is usually not a problem. If the average disk use for a day of work for a particular box is 5G, monitoring for 15G left and only allowing alerts for it during business hours will give three days notice. Alerts this far ahead of time let the system administrator plan downtime for the system if it is required, so that the server can be maintained without taking the supported service down.

The two most popular open-source monitoring systems are Zenoss and Nagios. Both of these systems offer similar monitoring capabilities. Zenoss provides more functionality and ease of use, incorporating some basic auto-discovery of nodes, built-in RRD graphing, syslog management and the ability to deduplicate events. Nagios provides a larger community and lighter install than Zenoss that allows administrators to use their own graphing solutions without duplicating software. The best part is that they have a common format for monitoring scripts; the processes that do the actual checking of services.

Although both systems come with basic templates for monitoring HTTP ports with similarly popular services, much of the power of these systems comes from the ability to write custom scripts. This is a great way to check not only that a Web server is up, but also that the application itself is working. Below is an example of a script that will monitor the success of Hudson jobs by calling its JSON API:

#!/usr/bin/env ruby
# Call as:
# check_hudson_job.rb ${jobname} ${hostname}

require 'rubygems'
require 'json'
require 'net/http'

jobname = ARGV[0]
hostname = ARGV[1]

url = URI.parse("http://#{hostname}/job/#{jobname}/
↪lastBuild/api/json")
res = JSON.parse(Net::HTTP.get_response(url).body)
lastResult = res["result"]

if lastResult == "SUCCESS"
 puts "OK|Status=0"
 exit(0)
else
 failurl = URI.parse("http://#{hostname}/job/
↪#{jobname}/api/json")
 failres = JSON.parse(Net::HTTP.get_response(failurl).body)
 health = failres["healthReport"][0]["description"]
 puts "Job #{jobname} broke: #{health}"
 exit(1)
end

The monitoring system calls the code with command-line parameters of the name of the job and the name of the host. The code then looks for the result from the Hudson server and checks for success. The return value and exit code are how the monitoring script replies to the monitoring system. A nonzero exit code indicates a failure, and the return value is a string that the system displays as the reason for the failure. On Zenoss, this is also used in deduplication. On success, the monitoring script has an exit code of 0 with a string returned in a special form for the system to process (see code).

Using this structure, system administrators can work with developers to build custom URLs that the monitoring system can access to determine the health of the application without worrying about every system in the set.

It may seem hard to swallow that it's acceptable to leave a box down overnight. It may be the first in a cascading series of failures that cause multiple servers to go down, eventually resulting in a downed service, but this can be addressed directly from the load balancer or front-end appliance instead of indirectly looking at the boxes themselves. Using this method, the alert can be set to go off after a certain number of boxes fail at certain times of day, and there is no need to solve harder problems, such as requiring each box to know the state of the entire cluster.

So far, the design for the systems has been fairly agnostic as far as geographies and cloud footprint. For most applications, this doesn't make a lot of difference. Usually, with multiple geographies, each data center has its own instance of the monitoring system with each one monitoring its siblings in the other locations. Operating in the cloud offers greater flexibility. Although it still is necessary to monitor the monitoring system, this can be done easily using Amazon's great, but far less configurable system to monitor Nagios or Zenoss EC2 instances.

What really stands out about Amazon's cloud is that it's elastic. Hooking up the EC2 command-line programs to the monitoring service will allow new boxes to be launched if some are experiencing problems due to resource starvation, load or programs crashing on the box. Of course, this needs to be kept in check, or the number of instances could spiral out of control, but within reasonable bounds, launching new instances in place of crashing or overloaded ones from inside of a monitoring script is relatively easy.

Here is an example of a script that monitors the load of a Hadoop cluster and adds more boxes as the number of jobs running increases:

#!/bin/bash
# Call as:
# increase_amazon_set.sh ${threshold} ${AMI}

THRESHOLD=$1
AMI=$2

NUM_JOBS=`/opt/hadoop/current/bin/hadoop job -list | 
 ↪head -1 | awk {'print $1'}`

if [[ $NUM_JOBS -gt $THRESHOLD ]] ; then
 echo "Warning: $NUM_JOBS running, increasing cluster size by 3"
 ec2-run-instances $AMI -n 3 --availability-zone us-east-1a
 exit 1;
else
 echo "OK|Status=0"
 exit 0;
fi

This follows the same format as the previous script, passing in variables from the command line and returning values to the monitoring system using the exit condition and returned strings. The big difference here is that you're not just monitoring a problem and passing it off to a system administrator act on it. This script acts as an orchestrator, attempting to fix the problem it sees. Although care should be taken to place proper bounds on the way this works, and the computer should not be able to run amuck on the network, this kind of intelligent scheduler can be a powerful tool in automating tasks.

Although the idea of setting up a new monitoring system from scratch with great alerting rules and intelligent orchestration is a great idea, it's often just not possible. Most organizations have a monitoring system in place already, and often it's full of old alerts and boxes that have been placed in maintenance mode because they're more noisy than broken. If this is the case, it's time to cut out the cruft. Delete all the current alerts and take everything out of maintenance mode that isn't actually undergoing maintenance. Take the top ten noisy and badly behaved devices, and either stop monitoring the items that are provoking false positives or rewrite the scripts so they provide more meaningful data. When these first ten under control, move to the next group. It make take a few iterations over a few days, but in the end, you'll care more about the messages coming from what could be a very powerful tool for you.

Monitoring systems often are overlooked as a required annoyance, but with a little bit of effort, they can be made to work for you. Monitoring for services, looking at clustered applications and alerting only on actual errors that can be handled provides real metrics to use for capacity planning and lets system administrators sleep through the night so that they can be more proactive from day to day.

Michael Nugent has spent a good deal of his time designing large-scale solutions to fit into a tiny budget and leveraging Linux to fulfill the roles that typically would be filled by large commercial appliances. Recently, Michael has been working to design map-reduce clusters and elastic cloud systems for growing startups in the Silicon Valley area. When not building systems, he likes sailing, cooking and making things out of other things. Michael can be reached at michael@michaelnugent.org.