Work the Shell: Breaking Up Apache Log Files for Analysis

Dave tackles analysis of the ugly Apache web server log. By Dave Taylor

I know, in my last article I promised I'd jump back into the mail merge program I started building a while back. Since I'm having some hiccups with my AskDaveTaylor.com web server, however, I'm going to claim editorial privilege and bump that yet again.

What I need to do is be able to process Apache log files and isolate specific problems and glitches that are being encountered—a perfect use for a shell script. In fact, I have a script of this nature that offers basic analytics in my book Wicked Cool Shell Scripts from O'Reilly, but this is a bit more specific.

Oh Those Ugly Log Files

To start, let's take a glance at a few lines out of the latest log file for the site:


$ head sslaccesslog_askdavetaylor.com_3_8_2019 
18.144.59.52 - - [08/Mar/2019:06:10:09 -0600] "GET /wp-content/
↪themes/jumpstart/framework/assets/js/nivo.min.js?ver=3.2
 ↪HTTP/1.1" 200 3074
"https://www.askdavetaylor.com/how-to-play-dvd-free-windows-
↪10-win10/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
 ↪AppleWebKit/537.36 (KHTML, like Gecko) Chrome/
 ↪64.0.3282.140 Safari/537.36 Edge/18.17763 X-Middleton/1"
 ↪52.53.151.37 - - [08/Mar/2019:06:10:09 -0600] "GET
 ↪/wp-includes/js/jquery/jquery.js?ver=1.12.4 HTTP/1.1" 
 ↪200 33766 "https://www.askdavetaylor.com/how-to-play
↪-dvd-free-windows-10-win10/" "Mozilla/5.0 (Windows NT 
 ↪10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 
 ↪Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763 
 ↪X-Middleton/1" 18.144.59.52 - - [08/Mar/2019:06:10:09 
 ↪-0600] "GET /wp-content/plugins/google-analytics-for-
↪wordpress/assets/js/frontend.min.js?ver=7.4.2 HTTP/1.1" 
 ↪200 2544 "https://www.askdavetaylor.com/how-to-play
↪-dvd-free-windows-10-win10/"
 ↪"Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
 ↪AppleWebKit/537.36 (KHTML, like Gecko) 
 ↪Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763 
 ↪X-Middleton/1"

It's big and ugly, right? Okay, then let's just isolate a single entry to see how it's structured:


18.144.59.52 - - [08/Mar/2019:06:10:09 -0600] "GET
 ↪/wp-content/themes/jumpstart/framework/assets/js/
↪nivo.min.js?ver=3.2 HTTP/1.1" 200 3074
"https://www.askdavetaylor.com/how-to-play-dvd-free-windows-
↪10-win10/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 
 ↪Safari/537.36 Edge/18.17763 X-Middleton/1"

That's still obfuscated enough to kick off a migraine!

Fortunately, the Apache website has a somewhat clearer explanation of what's known as the custom log file format that's in use on my server. Of course, it's described in a way that only a programmer could love:


LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" 
 ↪'\"%{User-agent}i\""

That's enough info to help decode the log entry. I'll define each of the percent-format sequences as I go, so you can get a sense of how it's all tied together too:


%h = IP Address = 18.144.59.52
%l = ID of client = -
%u = UserID of client = -
%t = Time of request = [08/Mar/2019:06:10:09 -0600]
%r = Request = "GET /wp-content/themes/jumpstart/framework/
↪assets/js/nivo.min.js?ver=3.2 HTTP/1.1"
%>s = Status code = 200
%b = Size of request = 3074
Referrer = "https://www.askdavetaylor.com/how-to-play-dvd-
↪free-windows-10-win10/"
User Agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
 ↪AppleWebKit/537.36 (KHTML, like Gecko) 
 ↪Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763
 ↪X-Middleton/1"

Or, to make it a bit clearer yet:


IP - - TIMESTAMP REQUEST STATUS SIZE REFERRER USERAGENT

This becomes complicated to parse because there are two different types of field separator: a space for each of the major fields, but since some of the values can contain spaces, quotes to delimit the start/end of fields Request, Referrer and User Agent.

As a general rule, shell utilities aren't so good at these sort of mixed field separators, so it's time for a bit of out-of-the-box thinking!

Breaking Down Fields with Dissimilar Delimiters

It's true that the fields are divided up with dissimilar delimiters (say that ten times fast), but you can process the information in stages. You can examine the line by just processing the quote delimiter with this clumsy code block:


while read logentry
do
   echo "f1 = $(echo "$logentry" | cut -d\" -f1)"
   echo "f2 = $(echo "$logentry" | cut -d\" -f2)"
   echo "f3 = $(echo "$logentry" | cut -d\" -f3)"
   echo "f4 = $(echo "$logentry" | cut -d\" -f4)"
   echo "f5 = $(echo "$logentry" | cut -d\" -f5)"
   echo "f6 = $(echo "$logentry" | cut -d\" -f6)"
done < $accesslog

Since it's just an interim step on the development of the final shell script, I'm not going to bother cleaning it up or making it more efficient.

Running this against the first line of the accesslog, here's what's revealed:


f1 = 18.144.59.52 - - [08/Mar/2019:06:10:09 -0600] 
f2 = GET /wp-content/themes/jumpstart/framework/assets/
↪js/nivo.min.js?ver=3.2 HTTP/1.1
f3 =  200 3074 
f4 = https://www.askdavetaylor.com/how-to-play-dvd-free-
↪windows-10-win10/
f5 =  
f6 = Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 
Safari/537.36 Edge/18.17763 X-Middleton/1

What's important to notice here is field 3. Field 3 (f3) has both the return code—200, in this case—and the total number of bytes in this transaction, 3074.

This means that if f3 is then divided by the space delimiter, you can identify both return code and bytes easily enough:


f3=$(echo "$logentry" | cut -d\" -f3)
   returncode="$(echo $f3 | cut -f1 -d\  )"
   bytes="$(echo $f3 | cut -f2 -d\  )"

Using a space as a delimiter makes for a weird-looking command line, as you can see, but the \ forces the very next character to be interpreted as the specified value, first a double quote, then a space character.

Extracting Just the Errors

Now, can you spin through the entire log file and just pull out error codes? Sure you can, with just a simplification and tweak of the while loop:


while read logentry
do
   f3=$(echo "$logentry" | cut -d\" -f3)
      returncode="$(echo $f3 | cut -f1 -d\  )"
      bytes="$(echo $f3 | cut -f2 -d\  )"

   echo "$entry: returncode = $returncode, bytes = $bytes"
   entry=$(( $entry + 1 ))
done < $accesslog

Since a return code of 200 is a success, it's easy to grep -v and see what other return codes show up in the log file:


$ sh breakdown.sh  | grep -v 200 
113: returncode = 405, bytes = 42
138: returncode = 405, bytes = 42
177: returncode = 301, bytes = -
183: returncode = 301, bytes = -
186: returncode = 405, bytes = 42
187: returncode = 404, bytes = 11787
220: returncode = 404, bytes = 11795
279: returncode = 405, bytes = 42
397: returncode = 301, bytes = -

Error 405 is (according to the W3 Web standards site) "Method Not Allowed", while 301 is "Moved Permanently", and 404 is a standard "Not Found" error when someone requests a resource that the server cannot find.

Useful, but let's take the next step. For every query where the return code is not a 200 "OK" status, let's show the original log file line in question. This time, let's modify the script to do the 200 filtering too:


while read logentry
do
   f3=$(echo "$logentry" | cut -d\" -f3)
      returncode="$(echo $f3 | cut -f1 -d\  )"
      bytes="$(echo $f3 | cut -f2 -d\  )"

   if [ $returncode != "200" ] ; then
     echo "$returncode ($entry): $logentry"
   fi

   entry=$(( $entry + 1 ))
done < $accesslog

The results then look like this:


$ sh breakdown.sh
405 (113): 3.122.179.106 - - [08/Mar/2019:06:10:11 -0600] 
"GET /xmlrpc.php HTTP/1.1" 405 42 "-" "Mozilla/5.0 (X11; 
Linux i686; rv:2.0.1) Gecko/20100101 Firefox/4.0.1 
 ↪X-Middleton/1"
405 (138): 35.180.37.73 - - [08/Mar/2019:06:10:21 -0600] 
"GET /xmlrpc.php HTTP/1.1" 405 42 "-" "Mozilla/5.0 (X11; 
Linux i686; rv:2.0.1) Gecko/20100101 Firefox/4.0.1 
 ↪X-Middleton/1"
301 (177): 34.239.180.102 - - [08/Mar/2019:06:10:30 -0600] 
"GET /how_do_i_restructure_my_wordpress_blog_without_losing_seo 
 ↪HTTP/1.1" 301 - "-" "Mozilla/5.0 (Windows NT 6.1; 
 ↪WOW64; rv:29.0) Gecko/20120101 Firefox/29.0 
 ↪X-Middleton/1"

It's useful to be able to see the log file entry line, the return error code and the full log file entry line. Is there a pattern? Do they all have the same user agent (for example, a bot)? Are they from the same IP address? A pattern based on time of day?

With a judicious use of wc, I also can ascertain that this particular log file encompasses 99,309 total hits, of which 4,393 (4.4%) are non-200 results.

Another useful feature for this script would be to create multiple output files automatically, one per error code. I shall leave that, however, as an exercise for you, dear reader!

And, for my next article, I'll jump back into that mail merge script!

Dave Taylor has been hacking shell scripts on UNIX and Linux systems for a really long time. He's the author of Learning Unix for Mac OS X and Wicked Cool Shell Scripts. You can find him on Twitter as @DaveTaylor, and you can reach him through his tech Q&A site: Ask Dave Taylor.

Dave Taylor