Work the Shell

A Word Finder for Words With Friends—Continued

Dave Taylor

Issue #216, April 2012

Three months after starting his project, Dave completes his Scrabble-helper script with a rather ingenious algorithm and some observations about where to go from here.

For my last few articles, we've been looking at Scrabble and Scrabble-like games, exploring different ways to work with a dictionary and pattern matching. The first installment focused on a crossword-puzzle word finder, something that proved surprisingly easy once we identified the regex pattern of <letter> <questionmark>, as in “??AR??C?Sâ”, to match words like “Starbucks”.

My last article, however, opened up the far more complicated world of Scrabble, wherein it became obvious that there's not going to be a simple solution to “find all the words that I can make out of the letters S R C R A E M” or similar.

The real challenge was letter frequency: we can make the word “RACE” out of the letters above, but can we make the word “ERASE”? We can't, because there's only one occurrence of the letter e, but that's a tricky thing to ascertain in a shell script.

As a result, we came up with a shell function that calculates how many times a letter occurs in a word as a simple way to test for out-of-bounds results:

occurrences()
{
  # how many times does 'letter' occur in 'word'?
  local count=$( echo $2 | grep -o $1 | wc -l )
  echo $count
}

We were using that in conjunction with a script called findword that extracts words from our previously downloaded word dictionary that match the set of seven letters, constrained to just those that contain five or more letters.

With this set of letters, we'd have:

access is a possibility -- length = 6
accesses is a possibility -- length = 8
acers is a possibility -- length = 5
acmes is a possibility -- length = 5
acres is a possibility -- length = 5

It's immediately clear that these aren't in fact all valid possibilities because of that darn letter-frequency problem. We have one c and one s, how can “accesses” be possible? Let's fix it.

Screening Against Letter Frequency

Here's a straightforward way to calculate the frequency of each letter in our pattern:

while [ $idx -lt 8 ] ; do
  letter=$(echo $1 | cut -c$idx) ; occurrences $letter $1
  echo letter $letter occurs $freq times in $1
  idx=$(( $idx + 1 ))
done

Note that we've had to tweak the occurrences script to set the global variable $freq to be the frequency of the given letter in the pattern. It's sloppy, but as with most shell script programming, this is intended to be more of a quick prototype than a grand elegant solution.

Running this on our pattern produces:

letter s occurs 1 times in srcraem
letter r occurs 2 times in srcraem
letter c occurs 1 times in srcraem
letter r occurs 2 times in srcraem
letter a occurs 1 times in srcraem
letter e occurs 1 times in srcraem
letter m occurs 1 times in srcraem

We can add some code to make the script more efficient by removing duplicate tests (for example, r should be tested only once), but we can skip that concern because of how our final approach folds that into the solution. Plus, the next step is the interesting code portion, where we'll use this data to test possible words against letter frequency in the original pattern.

The basic idea is we're going to test each possible word (identified earlier and saved in a temp file) by stepping through each letter, calculating both the occurrences of the letter in that word and in the original set of letters. If the possible word has more, it's a fail. If it has less or the same, it's a continued possibility.

Here's how that looks in code:

for word in $(cat $possibilities)
do
  length=$(echo $word | wc -c)
  idx=1
  while [ $idx -lt $length ] ; do
    letter=$(echo $word | cut -c$idx)
    occurrences $letter $word
    wordfreq=$freq  # number of times letter occurs #1
    occurrences $letter $1  # and letter occurrences #2
    if [ $wordfreq -gt $freq ] ; then
      echo discarding $word because $letter occurs too \
         many times vs pattern
      break  # get out of the "nearest" loop
    else
      echo ... continuing to test $word, $letter ok
    fi
    idx=$(( $idx + 1 ))   # increment loop counter
  done
done

It's a rather complicated piece of code, but let's run it and see what happens, then I'll explain a bit more about what's going on:

testing word access from possibilities file
... continuing to test access, a freq is acceptable
discarding access because c occurs too many times vs pattern
testing word accesses from possibilities file
... continuing to test accesses, a freq is acceptable
discarding accesses because c occurs too many times vs pattern

To start out, it has correctly identified that neither ACCESS nor ACCESSES are actually possible matches to our original set of letters because the letter c occurs too many times in both cases. What about a word that is valid?

testing word acers from possibilities file
... continuing to test acers, a freq is acceptable
... continuing to test acers, c freq is acceptable
... continuing to test acers, e freq is acceptable
... continuing to test acers, r freq is acceptable
... continuing to test acers, s freq is acceptable

By not failing out after the last letter, we can conclude that ACERS is indeed a valid word that can be created from the original set of letters.

Great. So we're close to a solution. Let's add a bit of code logic to have it know when it's succeeded at testing each and every letter of a word without a fail, and get rid of these interim status messages. Here's the result:

-- word access was skipped (too many c)
-- word accesses was skipped (too many c)
++ word acers can be constructed from the letters srcraem
++ word acmes can be constructed from the letters srcraem
++ word acres can be constructed from the letters srcraem

Awesome. We're actually done with the algorithm and the problem is solved. We just need to clean things up. Here's what I did to the code for the output to look pretty:

for word in $(cat $possibilities)
do
  length=$(echo $word | wc -c); length="$(( $length - 1 ))"
  idx=1
  while [ $idx -lt $length ] ; do
    letter=$(echo $word | cut -c$idx)
    occurrences $letter $word
    wordfreq=$freq  # number of times letter occurs #1
    occurrences $letter $1 # and letter occurrences #2
    if [ $wordfreq -gt $freq ] ; then
      echo "-- word $word was skipped (too many $letter)"
      break # get out of the "nearest" loop
    else
      if [ $idx -eq $length ] ; then
        echo "++ word $word can be constructed from \
          the letters $1"
      fi
    fi
    idx=$(( $idx + 1 )) # increment loop counter
  done
done

I haven't changed a lot actually, other than the conditional test when the letter occurrence is acceptable to see if our index = the length of the word.

Want to see only the valid possibilities and not the words that were discarded because of letter frequency? That's easy enough, just add a #? before the appropriate echo statement to comment it out.

And, finally, here's a list of all the five or more letter words you could produce from the original letter list SRCRAEM: acers, acmes, acres, cames, carer, carers, cares, carrs, carse, crams, crare, crares, cream, creams, macer, macers, maces, marcs, mares, maser, racer, racers, races, reams, rearm, rearms, rears, scare, scarer, scrae, scram, scream, serac, serra, smear.

Now you know.

Me? I'd play “racers”. It's not as offbeat as the other words that the program produced.

In fact, it'd be interesting to sort the results by length or, better, by score, since different letters have different point values in Scrabble. Hmmm...