If your server has high load when no sysadmin is logged in, use sar to find out what happened.
As someone who's been working as a system administrator for a number of years, it's easy to take tools for granted that I've used for a long time and assume everyone has heard of them. Of course, new sysadmins get into the field every day, and even seasoned sysadmins don't all use the same tools. With that in mind, I decided to write a few columns where I highlight some common-but-easy-to-overlook tools that make life as a sysadmin (and really, any Linux user) easier. I start the series with a classic troubleshooting tool: sar.
There's an old saying: “When the cat's away the mice will play.” The same is true for servers. It's as if servers wait until you aren't logged in (and usually in the middle of REM sleep) before they have problems. Logs can go a long way to help you isolate problems that happened in the past on a machine, but if the problem is due to high load, logs often don't tell the full story. In my March 2010 column “Linux Troubleshooting, Part I: High Load” (www.linuxjournal.com/article/10688), I discussed how to troubleshoot a system with high load using tools such as uptime and top. Those tools are great as long as the system still has high load when you are logged in, but if the system had high load while you were at lunch or asleep, you need some way to pull the same statistics top gives you, only from the past. That is where sar comes in.
sar is a classic Linux tool that is part of the sysstat package and should be available in just about any major distribution with your regular package manager. Once installed, it will be enabled on a Red Hat-based system, but on a Debian-based system (like Ubuntu), you might have to edit /etc/default/sysstat, and make sure that ENABLED is set to true. On a Red Hat-based system, sar will log seven days of statistics by default. If you want to log more than that, you can edit /etc/sysconfig/sysstat and change the HISTORY option.
Once sysstat is configured and enabled, it will collect statistics about your system every ten minutes and store them in a logfile under either /var/log/sysstat or /var/log/sa via a cron job in /etc/cron.d/sysstat. There is also a daily cron job that will run right before midnight and rotate out the day's statistics. By default, the logfiles will be date-stamped with the current day of the month, so the logs will rotate automatically and overwrite the log from a month ago.
After your system has had some time to collect statistics, you can use the sar tool to retrieve them. When run with no other arguments, sar displays the current day's CPU statistics:
$ sar . . . 07:05:01 PM CPU %user %nice %system %iowait %steal %idle . . . 08:45:01 PM all 4.62 0.00 1.82 0.44 0.00 93.12 08:55:01 PM all 3.80 0.00 1.74 0.47 0.00 93.99 09:05:01 PM all 5.85 0.00 2.01 0.66 0.00 91.48 09:15:01 PM all 3.64 0.00 1.75 0.35 0.00 94.26 Average: all 7.82 0.00 1.82 1.14 0.00 89.21
If you are familiar with the command-line tool top, the above CPU statistics should look familiar, as they are the same as you would get in real time from top. You can use these statistics just like you would with top, only in this case, you are able to see the state of the system back in time, along with an overall average at the bottom of the statistics, so you can get a sense of what is normal. Because I devoted an entire previous column to using these statistics to troubleshoot high load, I won't rehash all of that here, but essentially, sar provides you with all of the same statistics, just at ten-minute intervals in the past.
sar also supports a large number of different options you can use to pull out other statistics. For instance, with the -r option, you can see RAM statistics:
$ sar -r . . . 07:05:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit . . . 08:45:01 PM 881280 2652840 75.06 355284 1028636 8336664 183.87 08:55:01 PM 881412 2652708 75.06 355872 1029024 8337908 183.89 09:05:01 PM 879164 2654956 75.12 356480 1029428 8337040 183.87 09:15:01 PM 886724 2647396 74.91 356960 1029592 8332344 183.77 Average: 851787 2682333 75.90 338612 1081838 8341742 183.98
Just like with the CPU statistics, here I can see RAM statistics from the past similar to what I could find in top.
Back in my load troubleshooting column, I referenced sysstat as the source for a great disk I/O troubleshooting tool called iostat. Although that provides real-time disk I/O statistics, you also can pass sar the -b option to get disk I/O data from the past:
$ sar -b . . . 07:05:01 PM tps rtps wtps bread/s bwrtn/s . . . 08:45:01 PM 2.03 0.33 1.70 9.90 31.30 08:55:01 PM 1.93 0.03 1.90 1.04 31.95 09:05:01 PM 2.71 0.02 2.69 0.69 48.67 09:15:01 PM 1.52 0.02 1.50 0.20 27.08 Average: 5.92 3.42 2.50 77.41 49.97
I figure these columns need a little explanation:
tps: transactions per second.
rtps: read transactions per second.
wtps: write transactions per second.
bread/s: blocks read per second.
bwrtn/s: blocks written per second.
sar can return a lot of other statistics beyond what I've mentioned, but if you want to see everything it has to offer, simply pass the -A option, which will return a complete dump of all the statistics it has for the day (or just browse its man page).
So by default, sar returns statistics for the current day, but often you'll want to get information a few days in the past. This is especially useful if you want to see whether today's numbers are normal by comparing them to days in the past, or if you are troubleshooting a server that misbehaved over the weekend. For instance, say you noticed a problem on a server today between 5PM and 5:30PM. First, use the -s and -e options to tell sar to display data only between the start (-s) and end (-e) times you specify:
$ sar -s 17:00:00 -e 17:30:00 Linux 2.6.32-29-server (www.example.net) 02/06/2012 _x86_64_ (2 CPU) 05:05:01 PM CPU %user %nice %system %iowait %steal %idle 05:15:01 PM all 4.39 0.00 1.83 0.39 0.00 93.39 05:25:01 PM all 5.76 0.00 2.23 0.41 0.00 91.60 Average: all 5.08 0.00 2.03 0.40 0.00 92.50
To compare that data with the same time period from a different day, just use the -f option and point sar to one of the logfiles under /var/log/sysstat or /var/log/sa that correspond to that day. For instance, to pull statistics from the first of the month:
$ sar -s 17:00:00 -e 17:30:00 -f /var/log/sysstat/sa01 Linux 2.6.32-29-server (www.example.net) 02/01/2012 _x86_64_ (2 CPU) 05:05:01 PM CPU %user %nice %system %iowait %steal %idle 05:15:01 PM all 9.85 0.00 3.95 0.56 0.00 85.64 05:25:01 PM all 5.32 0.00 1.81 0.44 0.00 92.43 Average: all 7.59 0.00 2.88 0.50 0.00 89.04
You also can add all of the normal sar options when pulling from past logfiles, so you could run the same command and add the -r argument to get RAM statistics:
$ sar -s 17:00:00 -e 17:30:00 -f /var/log/sysstat/sa01 -r Linux 2.6.32-29-server (www.example.net) 02/01/2012 _x86_64_ (2 CPU) 05:05:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 05:15:01 PM 766452 2767668 78.31 361964 1117696 8343936 184.03 05:25:01 PM 813744 2720376 76.97 362524 1118808 8329568 183.71 Average: 790098 2744022 77.64 362244 1118252 8336752 183.87
As you can see, sar is a relatively simple but very useful troubleshooting tool. Although plenty of other programs exist that can pull trending data from your servers and graph them (and I use them myself), sar is great in that it doesn't require a network connection, so if your server gets so heavily loaded it doesn't respond over the network anymore, there's still a chance you could get valuable troubleshooting data with sar.