Got a burning desire to cheat on your next crossword or Scrabble game? If so, you'll want to explore Dave's latest shell script.
It's been a while since I've delved into the world of games and game programming in this column, so I thought it'd be interesting to look at writing a word-finding or matching script from the perspective of crossword puzzles and Scrabble-like games.
You know the drill. You're playing and have encountered a six-letter word with the clue “cat, family”, and you know only a few letters: _ E _ I _ _. You've already figured out “feline”, but can you write a script that you could give that sort of partial match and have it figure out possibilities? Let's give it a whirl.
Linux includes the slick GNU aspell program, but the dictionary is compiled, and it's also not designed for word games but is instead based on common English language usage—not optimal.
Fortunately, you can grab some free word-list files off the Internet to use as the basis for this particular script. The one I work with here is the English Open Word List, which contains almost 129K words. Grab a copy for yourself at dreamsteep.com/projects/the-english-open-word-list.html.
The download is a ZIP archive, and it's a bit of a mess, sprawling across multiple directories and ultimately creating a set of 26 files called A Words.txt, B Words.txt and so on. I dumped it all into a single file called words.txt by using the following:
cat *Words.txt > words.txt
The resultant file is not tiny:
$ wc ~/words.txt 128985 128985 1123743 /Users/taylor/words.txt
That's okay though. On a modern Linux system, it'll still be lightning fast, even with 128,985 lines of data.
Let's start by looking up that partial match for “feline” from earlier. Fortunately, we don't need to look much further than grep, because we can express what we need with a regular expression:
$ grep -E '.e.i..' words.txt
The problem is, a quick glance at the output reveals that this pattern isn't good enough:
abbreviate aberdevine abiogenist abstemious academical
To root the search so that it's an exact match, not a portion within a larger word, we need to root the regular expression. This is done with ^ to denote the beginning of the line and $ to denote the end:
$ grep -E '^.e.i..$' words.txt aecium aedile aerial aerier
That's better, except there are a lot of words that match this pattern. In fact, it's surprisingly common. How common? Let's see:
$ grep -E '^.e.i..$' words.txt | wc -l 264
Sheesh, there are 264 possibilities. Drop in another letter though, and it's a bit more manageable:
$ grep -E '^.e.in.$' words.txt | wc -l 58
One more letter, and although theoretically we should be able to guess it anyway, at least the choices are reduced to a viewable list:
$ grep -E '^.elin.$' words.txt feline heling reline
That's actually the problem with crossword-puzzle-finder apps. There simply are too many words in the English language. Still, it's quite possible that for some letter combinations, even having most letters missing still produces a short list:
$ grep -E '^...cc.t.$' words.txt braccate placcate spiccato staccato stoccata
This should be motivation to learn how to solve as much of that darn crossword puzzle as possible before turning to our little shell solution!
Games like Scrabble and Words With Friends present an opposite problem. We know all the letters; we just want to figure out what words we can create with them.
On first blush, the solution could be as simple as searching the words database for all words that contain zero or one occurrence of each letter in the sequence. Not so fast though—look at this:
$ grep -E '^s*t*a*c*c*a*t*o*$' words.txt aa act at cat coo oo sac sat scat scatt so st staccato ta taco tact tao tat tatt tattoo to too
There are two problems here. The first is that the * actually is shorthand for zero or more matches, and the second is that the letters in the pattern are analyzed in order of occurrence. So, although “tao” is a valid word extracted from the letters “STACCATO”, the word “tots” never will be found, because the letter s appears before the others in the pattern.
The first problem can be solved by making a more complex regular expression: X{0,1} means zero or exactly one occurrence of the letter X in the pattern. It's definitely more complex:
$ grep -E '^s{0,1}t{0,1}a{0,1}c{0,1}c{0,1}a{0,1}t{0,1}o{0,1}$' ↪words.txt aa act at cat sac sat scat so st staccato ta taco tact tao tat to
The second problem of letter order is a lot harder to solve. We could generate random sequences, but some quick math shows that if we have ten characters, that's 10*9*8*7*6*5*4*3*2 possible combinations—a lot!
However, a more interesting way to try to address this is to take a complete left turn and lose this increasingly complicated regular-expression approach by using a chained set of simple one-letter grep patterns instead.
That's not quite what we'll do though, because the first thing we need to do is screen out all words that contain letters not in our acceptable set. That's done with the set notation:
$ grep -vE '[^staccato]' words.txt aa accoast accoasts accost accosts
That's still not quite right, because we're not taking into account the frequency of individual letters (note that “accoasts” has two occurrences of the letter s, but there's only one in our original letter set).
Let's wrap this up here for this month, however. Next month, I'll dig deeper into the problem and see if I can come up with some sort of smart solution.