Tuesday, September 14th, 2010...18:11


Jump to Comments

About a month ago, I came across a post from Jon McLoone on the Wolfram Blog where he detailed his method for finding the 25 best Hangman words. His metric was the percentage of games his computer algorithm would fail to guess a given word. His results make sense; short words with low frequency characters are harder for a letter frequency based algorithm since there are more letters which will result in an incorrect guess and low frequency letters will be guessed less often.

I decided to try and improve on the guessing algorithm and see if that would change the results. Since I wasn’t going to be using Mathmatica, I first needed to replicate his experiment. I wrote a Python script to parse the English dictionary on my EeePC running Ubuntu.

# Build a map of word length -> list of words
file = open("/usr/share/dict/words","r")
word_lengths = dict()
line = ""
for line in file:
 word = line.strip().lower()
 # ignore words with punctuation
 if not word or not word.isalpha():
 word_length = len(word)
 words = []
 if word_length in word_lengths:
  words = word_lengths[word_length]
 word_lengths[word_length] = words

# Output a file of words for each word length e.g.
#word_length_3, word_length_4, etc.
for word_length, value in word_lengths.items():
 # remove duplicates due to lowercasing
 words = list(set(value))
 if words:
  file = open('word_length_' + str(word_length), 'w')
  _ = [file.write(word + '\n') for word in words]

I ended up with 72,707 words of length 3-22 letters (no punctuation), slightly less than the 90k used in the Wolfram post, but good enough. Next I wrote a function which given weighted character frequencies (a simple dictionary where the keys are characters and the values are their weights e.g. {‘a’:10, ‘b’:3, ‘c’:24}, would randomly pick one.

def ProbabilisticCharacterGuess(character_frequency):
 occurances = sum(character_frequency.values())
 rand = random.randint(1, occurances)
 total = 0
 for character, frequency in character_frequency.items():
  total += frequency
  if rand <= total:
   return character

Followed by functions to compute the overal character frequency from a list of words, filtering out characters which had already been guessed.

def FilterCharacterFrequency(character_frequency, characters):
  for character in characters:
    if character in character_frequency:
      del character_frequency[character]
  return character_frequency

def OverallFrequency(words, guessed):
  character_frequency = dict()
  for word in words:
    for character in word:
      if character in character_frequency:
        character_frequency[character] += 1
        character_frequency[character] = 1
  return FilterCharacterFrequency(character_frequency, guessed)

Initially the word list would be all the words of a given length and the guessed list would be empty. As characters are guessed, I would need to filter the word list. If a guessed character was incorrect, then all the words containing that character could be removed. If a character was guessed correctly, then I could filter out all the words which didn't have that character in the correct position(s). Here are those two functions. They are intended to be used with Python's built in filter(), so they return functions which operate on one word at a time.

def WordWithoutFilter(c):
  return (lambda word: word.find(c) == -1)

def WordWithFilter(c, positions):
  def f(word):
    found = True
    for position in positions:
      if word[position] != c:
        found = False
    return found
  return f

At this point, I was ready to write the function which would actually play a game of Hangman. It takes a word to guess, a list of candidate words, a function to compute character frequencies and an initial character frequency dictionary. The last two parameters while not absolutely necessary, help make the Play function more flexible and optimize the playing of many repeated games. The function outputs the number of guesses required to correclty guess all the characters in a given word.

def Play(word, words, frequency_function, initial_cf):
  word_length = len(word)
  tries = 0
  guessed = [] 
  in_progress_word = ['' for x in word]
  remaining_letters = word_length
  while remaining_letters > 0:
    if len(guessed) == 0:
      character_frequency = initial_cf
      character_frequency = frequency_function(words, guessed, in_progress_word)
    guess = ProbabilisticCharacterGuess(character_frequency)
    index = 0
    indexes = []
    while index < word_length:
      index = word.find(guess, index)
      if index == -1:
      in_progress_word[index] = guess
      index += 1
    indexes_correct = len(indexes)
    if indexes_correct > 0:
      words = filter(WordWithFilter(guess, indexes), words)
      remaining_letters -= indexes_correct 
      tries += 1
      words = filter(WordWithoutFilter(guess), words)
  return tries

There are a few other little driver and analytic functions I wrote to play large numbers of games for one or more words, but I won't detail them here. After playing a 50 games for a given word, the output looks like this:

overall|jazz|[13, 16, 16, 10, 16, 14, 13, 15, 14, 16, 15, 18, 7, 16, 14, 1, 7, 14, 15, 11, 17, 15, 7, 2, 11, 15, 14, 14, 15, 2, 14, 17, 13, 16, 4, 15, 16, 16, 11, 10, 15, 18, 14, 11, 15, 15, 13, 9, 17, 9]

This lets me know the algorithm used for determining character frequencies was the "overall" one, the word played was "jazz" and finally a list of the number of tries it took to guess the word correctly in each game. Next I set about trying to come up with an algorithm that would beat the naieve "overall" character frequency algorithm. My first thought was to see how character frequency varied by position. Below are graphs for words of 4 characters of the overall character frequency as well as the character frequency for each position in the word (y axis is % of characters in that position).

It's interesting to note that while the character "e" has the highest distribution of all characters it is not the highest frequency character in any one position. Also the shape of the graph for character positions 2 an 4 has a very sharp drop-off after the first few characters. In fact, with the character "s" occuring in the final position of nearly 20% of the 4 character it's likely 10% or more of the 4 character words are just plurals of 3 character words.

In light of this I figured I could eek out some improvement by using the character frequency from the position with the sharpest drop off. I roughly approximated that as the character frequency where the first half of the characters remaining a accounted for the largest percentage of the overall characters in a given position. First I wrote a function to be used with Python's reduce built in that would keep track of a position's character frequency:

def PositionalCharacterFrequencyReducer(position):
  def Reducer(d, word):
    c = word[position]
    if c in d:
      d[c] += 1
      d[c] = 1
    return d
  return Reducer

Next I wrote a function to calculate the skew of a position's character frequency:

def Skewed(values):
  total = sum(values)
  top = sum(values[:len(values)/2])
  return top / float(total)

Finally I wrote a drop in replacement for the OverallFrequency function defined earlier (with the addition of in_progress_word):

def TopHeavyFrequency(words, guessed, in_progress_word):
  max_skew = -1
  top_heaviest_position = 0
  indexes = []
  character_frequencies = []
  for index in range(len(in_progress_word)):
    if in_progress_word[index] == '':
  for word in words:
    for index in indexes: 
      character_frequency = character_frequencies[index]
      character = word[index]
      if character in character_frequency:
        character_frequency[character] += 1
        character_frequency[character] = 1
  for index in indexes:
    cf = FilterCharacterFrequency(character_frequencies[index], guessed)
    skew = Skewed(cf.values())
    if (skew > max_skew):
      max_skew = skew
      top_heaviest_index = index
  return character_frequencies[top_heaviest_index]

After some tweaks to the framework code so that it would play multiple algorithms per word, I was ready to pit TopHeavyFrequency against OverallFrequency. I played 50 games with each algorithm with each word (almost 7.3 million games total), and then calculated the winning percentage of the algorithm where winning is defined as guessing all the characters in a word with fewer than 6 incorrect character guesses (that seemed to be a more popular style i.e. drawing just the head, body, arms and legs, than the 10 and 13 game versions where the gallows are also drawn). Some of the results are graphed below.

For words of 3-6 characters, the TopHeavyFrequency algorithm has a winning percentage consistently 4-7 points higher than the OverallFrequency algorithm. Once the algorithms encounter words of 8 characters, they both have a > 75% winning percentage.

And by 11 characters, the winning percentage > 95.

I then took all the words for which the algorithm was able to only guess correctly 10% of the time or less (with fewer than 6 incorrect guesses) and played them each 500 times. I culled the output again and played 2000 games with the "hardest" words. In my trials, the word jazzed was the most difficult for both algorithms to guess, with the TopHeavyFrequency algorithm only winning 87 out of 2000 games (4.35%) and the OverallFrequency algorithm winning 115 (5.75%). Jazzy was next with the algorithms producing winning percentages of 4.95 and 5.95 respectively followed by jazz at 5.25 and 6.25.

It makes sense that the OverallFrequency outperforms the TopHeavyFrequency for these "hard" words since the TopHeavyFrequency algorithm trades off a small decline in winning percentage for a few "hard" words for a large increase in winning percentage for the majority of words.

Comments are closed.