Dave explores the many ways to solve programming problems in Linux with wegrep.
A project I'm involved with has made me think about how there are always many solution paths for any given problem in the Linux universe. For this other project, I wanted to cobble together a version of grep that let me specify proper regular expressions without having to worry about the -E flag and get a context for the matches too.
These are both popular expansions to grep, of course: the former demonstrated by both grep -E and the egrep shortcut, while the latter task is done with grep -C and, on some UNIX and Linux systems, wgrep.
But, there are a lot of different ways to create that particular functionality that don't involve relying on a modern version of grep; older versions might have the -E flag, but don't include support for contextualization.
So in this article, I thought it would be interesting to look at different ways to produce what I shall call wegrep, a version of grep that includes both the -C contextual window and the -E regular expression pattern support.
If you have the modern GNU grep, which you can ascertain by simply trying to use the -C flag, this all becomes easy:
$ grep -C grep: option requires an argument -- C
There's a pretty gnarly usage statement after this, but if your version can understand the -C or its wordy sibling -context, you're in luck.
Enter a “wrapper”, a simple script that changes the default behavior of a program. At its simplest, it actually can be a system alias, so this:
alias ls="/bin/ls -F"
is a sort of wrapper, ensuring that whenever I run the ls command, the -F flag is specified.
For this smarter version of grep, I simply could tell the user what flags to use or set specific flags with GREP_OPTIONS, an environment variable, but let's build out wegrep, as discussed.
For usage, it's going to be as simple as possible: command, pattern, source file. Like this:
wegrep '^Alice' wonderland.txt
This would search the file wonderland.txt for the regex “Alice”, rooted to the beginning of a line.
Easily done:
grep=/usr/bin/grep if [ $# -ne 2 ] ; then echo "Usage: wegrep [pattern] filename" ; exit 1 fi $grep -C2 -n -E "$1" "$2"
I even added some error checking to ensure that the user specified the right number of parameters, with a simple error message to hide some of the complexity of the real grep command.
For a test file, I'm going to use the first four paragraphs of Lewis Carrol's immortal Alice in Wonderland, as downloaded from Project Gutenberg (www.gutenberg.org).
Here's the result of my first invocation:
$ sh wegrep '^Alice' wonderland.txt 11-Down the Rabbit-Hole 12- 13:Alice was beginning to get very tired of sitting by her 14-sister on the bank, and of having nothing to do: once 15-or twice she had peeped into the book her sister was -- -- 26- 27-There was nothing so very remarkable in that; nor did 28:Alice think it so very much out of the way to hear the 29-Rabbit say to itself, 'Oh dear! Oh dear! I shall be 30-late!' (when she thought it over afterwards, it
You can see that grep does a good job with this task, showing me two lines of context above and below each match, and denoting which line contains the match itself by having the : separate the line number from the content.
But what if your version of grep doesn't have support for the -C flag? What if you actually need to identify which lines match the pattern, then roll your own context display?
Since grep is still available, and all but the most ancient of grep implementations support the -E flag to allow the user to specify a regular expression, the task can be broken into two parts: identify which lines match, then figure out a way to list lines (n-2)..n..(n+2), as shown in the above output.
The first task can be done surprisingly easily because grep has a handy -n flag that appends line numbers. With that, getting a list of which lines match the specified pattern is straightforward.
But, let's see what's output first:
$ grep -n -E '^Alice' wonderland.txt 13:Alice was beginning to get very tired of sitting by her 28:Alice think it so very much out of the way to hear the
Now it's a job for Superman! I mean, um, cut:
grep -n -E "$pattern" "$file | \ cut -d: -f1 13 28
Let's switch to the other task of showing a range of lines centered on the specified line. You could do this with a tortured pairing of head and tail, but sed is a much better tool for the job this time.
In fact, sed makes it easy. Want to grab lines 12, 13 and 14? This'll do the trick:
sed '12,14p' wonderland.txt
Well, not quite. The problem is that the default behavior of sed is to echo every line it sees in addition to whatever the user specifies, so you'll end up with every line from wonderland.txt and additionally have lines 12–14 appear a second time as the statement is matched and executed (the p suffix means “print”).
That's why if you're going to do anything with sed, it's critical to know its -n flag, which surpasses its desire to output every line it reads. Now here's a working command:
$ sed -n '12,14p' wonderland.txt Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once
Can you see how to chain these together? It all can be done in a simple for loop (particularly if you ignore error checking for now). But again, there's another small step required: the line count n prior and n subsequent to the matching line n need to be calculated. That's easy math:
before=$(( $match - $context )) after=$(( $match + $context ))
Here context specifies whether you want 1, 2, 3 or more lines of context above and below the matching line.
Let's give this a whirl:
#!/bin/sh # wegrep - grep with context and regular expressions grep=/usr/bin/grep sed=/usr/bin/sed if [ $# -ne 2 ] ; then echo "Usage: wegrep [pattern] filename" ; exit 1 fi for match in $($grep -n -E "$1" "$2" | cut -d: -f1) do before=$(( $match - $context )) after=$(( $match + $context )) $sed -n '${before},${after}p' "$2" done exit 0
Except it turns out that there are two critical bugs in the above code, as is immediately apparent when you run your first test:
$ sh wegrep '^Alice' wonderland.txt wegrep: line 14: 13:Alice - : syntax error in expression ↪(error token is ":Alice - ")
Can you see the first bug? Line 14 is the calculation for the variable before.
So what's wrong? You need to initialize context with a value, so the mathematical expression is essentially:
15 +
Which is correctly flagged as an error. Easily fixed.
The second bug is more subtle, however, but here's the clue when you run the script with context defined as 1 near the top of the script:
$ sh wegrep '^Alice' wonderland.txt sed: 1: "${before},${after}p": unexpected EOF (pending }'s) sed: 1: "${before},${after}p": unexpected EOF (pending }'s)
That's definitely odd. It's sed that's complaining, but what's wrong with the line that invokes sed?
Let's have another look at that line:
$sed -n '${before},${after}p' "$2"
Now can you see the error? It's a subtle and common problem in shell scripts: I'm using the wrong quotation marks. Remember, in a shell script, single quotation marks prevent the interpretation of variables. Switch it to double quotation marks, and everything now works great:
$ sh wegrep '^Alice' wonderland.txt Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be
Now another problem rears its head: how do you differentiate between blocks that have matched? Easy, add - - - - before and after each match by adding a few echo statements to the for loop:
for match in $($grep -n -E "$1" "$2" | cut -d: -f1) do before=$(( $match - $context )) after=$(( $match + $context )) echo "-----" sed -n "${before},${after}p" "$2" echo "-----" done
This works, but it's a bit clunky as output goes, although it pretty closely matches what modern grep does with the -C flag:
$ sh wegrep '^Alice' wonderland.txt ----- Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once ----- ----- There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be -----
As a purist, I'd much rather have one dashed line between output blocks, one before the first match and one after the last, with no doubling of lines.
That's not hard to do, and there's a second task of adding back line numbers and ideally denoting which line has the match to the regular expression. But I'm out of room, so those tasks will have to wait until next month.