Huffman 101

πŸ‘‹ This page was last updated ~5 years ago. Just so you know.

Let's play a game: your objective is to guess a word, but you can only ask yes or no questions. You should also aim to ask as few questions as possible.

You might have played a variant of this game before, guessing famous actors or musicians. You'd usually ask questions like "Are they alive?", or "Have they won an Oscar"? And that would allow you to narrow down the possibilities, until you finally resort to a list of direct guesses ("Is it Amy Adams?") or simply give up.

However, my version of the game is slightly harder. Firstly: you cannot give up. You can use as many questions as you need, but eventually you have to find the answer. Secondly: the words can be anything. They can be completely made up. You may have to guess "Sesquipedalian" (which is a real word), or you may have to guess "Bendified" (which is a made up word).

By now, you may be starting to realize your usual strategy may not be applicable. You can't even fall back to asking about every word in the dictionary, because there isn't a dictionary of all the made-up words.

The simplest reasonable strategy

What can you do? Well, made up or not, all the possible words are made up of a finite number of letters. And, in English, there's only 26 of those.

So, one possible strategy is to ask "Is the first letter of the word A?" - if not, try the next letter of the alphabet: "Is the first letter of the word B?". Once you get a positive answer, write down that letter and move on to guessing the next one. If you ask about the twenty six letters of the alphabet and none of them are it, then you'll know you've gotten all the letters.

For comparison's sake, and since we're going to be comparing strategies, let's agree on a notation. Let's say for each negative answer, we write down a 0, and for each positive answer, we write down a 1.

If we were guessing the word "axe", our answer sheet would look like:

In total, we had to ask 56 questions! For a three letter word. That's not very good. Let's think of ways to improve that.

Knowing when to stop

Right now, every game will end with us asking 26 questions, all of which will be met with a negative answer. This is a high price to pay, especially for short words. Instead, after each letter we guess, we could ask "is that all?". Let's guess "axe" again with that strategy and the same notation:

This brings the total number of questions from 56 down to 33. Pretty good! Of course, this is just for that one word, "axe". To be sure it's a good method, we'd have to test it against, say, a list of the most commonly used English words.

Comparing strategies

And this is exactly what we're going to do. All we need is:

  • A list of commonly used words (460 thousands should do)
  • For each word, to count how many questions we'd have to ask with each strategy
  • To export that data as CSV, import it somewhere and generate a graph for it.

We can make that happen with about a hundred lines of Go:

Go code
package main

import (
  "fmt"
  "io/ioutil"
  "strings"
)

type Measures struct {
  byLength map[int]ByLength
}

type ByLength struct {
  totalQuestions int64
  numWords       int64
}

func NewMeasures() *Measures {
  return &Measures{
    byLength: make(map[int]ByLength),
  }
}

func (m *Measures) Record(wordLength int, numQuestions int) {
  bl := m.byLength[wordLength]
  bl.numWords += 1
  bl.totalQuestions += int64(numQuestions)
  m.byLength[wordLength] = bl
}

type Strat struct {
  // Name of the strategy
  name string

  // Function that returns the number of questions
  // we need to ask to guess a given word
  f func(word string) int

  // Stats
  m *Measures
}

func main() {
  // (error handling is omitted here)
  payload, _ := ioutil.ReadFile("words.txt")

  // split the source file into an array of words.
  words := strings.Split(string(payload), "\n")
  for i := range words {
    words[i] = strings.TrimSpace(words[i])
  }

  strats := []Strat{
    Strat{name: "simplest", f: simplest, m: NewMeasures()},
    Strat{name: "whentostop", f: whentostop, m: NewMeasures()},
  }

  for _, w := range words {
    for i := range strats {
      numQuestions := strats[i].f(w)
      strats[i].m.Record(len(w), numQuestions)
    }
  }

  cols := []string{"word_length"}
  for _, strat := range strats {
    cols = append(cols, strat.name)
  }
  fmt.Printf("%s\n", strings.Join(cols, ","))

  // for words of length 1 to 20 (inclusive)
  for i := 1; i <= 20; i++ {
    // print word length
    fmt.Printf("%d", i)
    for _, strat := range strats {
      // ...and average number of questions for strategy
      if bl, ok := strat.m.byLength[i]; ok {
        fmt.Printf(",%.2f", float64(bl.totalQuestions)/float64(bl.numWords))
      } else {
        // except if we have no data, in which case, just print 0
        fmt.Printf(",0")
      }
    }
    fmt.Printf("\n")
  }
}

func simplest(word string) int {
  numQuestions := 0
  for _, c := range word {
    // index is the 0-based index of the letter
    // in the alphabet. A = 0, B = 1, C = 2, etc.
    index := int(c - 'a')
    // we have to ask 1 question for A, 2 questions for B, etc.
    numQuestions += index + 1
  }
  numQuestions += 26
  return numQuestions
}

func whentostop(word string) int {
  numQuestions := 0
  for _, c := range word {
    // same as 'simplest'
    index := int(c - 'a')
    numQuestions += index + 1
    // after each letter, we need to ask whether the word
    // is over
    numQuestions += 1
  }
  return numQuestions
}

The results are clear: using a question to determine whether the word is ended is better. It's much better for short words, and only becomes marginally better for longer words:

The vertical bars indicate "average number of questions to ask in order to guess a word of this length".

Frequency

Now that we've run some tests against many words, I'm satisfied that this strategy change was indeed an improvement.

We can do better though! Some letters tend to appear more frequently than others in English words. Since we're asking about each letter one by one, we might be able to reduce the number of questions if we first ask about the most common letters!

Again, we can use our database of common English words, and a bit of Go code to measure that:

Go code
package main

import (
  "fmt"
  "io/ioutil"
  "strings"
)

func main() {
  // (error handling is omitted here)
  payload, _ := ioutil.ReadFile("words.txt")

  // split the source file into an array of words.
  words := strings.Split(string(payload), "\n")
  for i := range words {
    words[i] = strings.TrimSpace(words[i])
  }

  freqs := make([]int64, 26)
  var total int64
  for _, w := range words {
    for _, c := range w {
      freqs[int(c-'a')] += 1
      total += 1
    }
  }

  fmt.Printf("letter,frequency\n")
  for i := 0; i < 26; i++ {
    letter := 'a' + rune(i)
    freq := float64(freqs[i]) / float64(total)
    fmt.Printf("%v,%.5f\n", string([]rune{letter}), freq)
  }
}

And we end up with the following graph:

If our calculations are correct, this means the most common letters in the English alphabet are: 'e', 'i', 'a', and 'o'. No surprise there!

So what happens if we ask about letters starting with the most likely? Well, our answer sheet for the word "axe" now looks like this:

Notice that the top row is different! Instead of A, B, C, etc., we now have letters from most to least frequent. It turns out 'x' is in the exact same place as before. However, when guessing 'e', we now only have to ask one question instead of three.

We went from 56 to 33, down to 31 questions. Well, it's shorter! Time for mass production - we'll adjust our previous program to add this new strategy, and run it against our common words database to see if it's any improvement.

Our new strategy function can be implemented like this (it's not the only way, but it's a simple way that works):

Go code
var freqlet = "eiaonsrtlcupdmhgybfvkwxzqj"

func frequency(word string) int {
  numQuestions := 0
  for _, c := range word {
    for _, l := range freqlet {
      // ask about each letter, from most to least likely
      numQuestions += 1
      if c == l {
        break
      }
    }
    // after each letter, we need to ask whether the word
    // is over
    numQuestions += 1
  }
  return numQuestions
}

...and it is! It doesn't change much for very short words, but as words grow longer, it becomes a clear winner. We only need to ask 165 questions on average to guess twenty-letter words!

Partitioning

We've made great improvements to our initial strategy - however, we're still not making the most of our questioning powers. While we can stop early (when guessing "a", for example), we still only eliminate one letter per question.

If our word contains a 'j', we'll need 26 questions. And even though 'j' is the least common letter, words do contains the letter 'j' all the time!

Let's start with the basics - our alphabet has 26 letters, so any letter has to be either in the first half (from A to M) or the second half (from N to Z) of the alphabet, right? And "is the letter in the first half of the alphabet?" is definitely a yes and no question, so it's allowed!

Let's see what our answer sheet would look like if we were to guess, as usual, the word "axe".

Each question eliminates half of the remaining letters. The first question eliminates thirteen possible letters, the second eliminates six or seven, and so on until there's only one letter left.

With that strategy, guessing "axe" only requires 17 questions! In fact, each letter requires exactly five questions (+ 1 question per letter to know if the word is over).

Adding this to our little comparison plot is extremely easy, since we can now just multiply the length of the word by a constant:

Go code
func partition(word string) int {
  return (5 + 1) * len(word)
}

How well does it perform? Pretty freaking well:

We're down to 120 questions for 20-letter words.

But let's not stop there.

Tree, take 1

Our fourth attempt (partition) outperformed all the other strategies, but in some sense, it's a step back from the third. We did not use frequency at all there!

What we need is a way to:

  1. Eliminate a large amount of letters with each question
  2. Ask fewer than five questions for the most frequent letters

But first, let's add a rule to our little guessing game: the guesser and the guessee can share information beforehand, /as long as the word to be guessed cannot be deduced from that information/.

So, for example, they could share this decision tree:

Now, I made up this tree - I don't know yet if it's going to outperform our "partition" strategy, so let's try it out at a larger scale.

Our function to compute the number of questions will simply be:

Go code
func tree1(word string) int {
  numQuestions := 0
  for _, c := range word {
    // if the letter is 'i' or 'e', we only need two questions
    if c == 'i' || c == 'e' {
      numQuestions += 2
    } else {
      // otherwise, we need 6
      numQuestions += 6
    }
    // ask whether word is over
    numQuestions += 1
  }
  return numQuestions
}

And the results are in:

Note: I've omitted "simplest" and "whentostop", since they were so much worse.

For every word length, "tree1" performs slightly worse than "partition". This is actually good news! We went on a limb, and everything didn't go terribly wrong, only slightly.

In our defense, this is a pretty bad decision tree: the letter 'a' is almost as likely as the letter 'i', and yet we spend two questions on 'i' and six questions on 'a'. Even though we now spend six questions on most of the alphabet, the average number of questions is still, more or less, the same.

Tree, take 2

We could go all day coming up with trees and comparing their performance - but as it turns out, there is a way to come up with the optimal decision tree in our particular situation.

It goes like this:

  • Make a 'node' out of each letter and the probability of it appearing in a word.
  • Make a list of all our nodes
  • Take the two least likely nodes and make a new node with it:
    • Its probability being the sum of the two nodes
    • It doesn't have a letter associated with it

Looking back at our letter frequency data, our two least likely letters are 'j' (0.16%) and 'q' (0.17%). So, let's make a new node out of them, which will have probability 0.32%.

This gives us part of our decision tree, like so:

Our next two nodes are 'x' (0.30%), and.. the node we just created (0.32%).

Now, our partly-built decision tree looks like this:

(I've switched to circles so it's a bit easier to visualize).

Instead of doing everything by hand, let's write a bit of code to do it for us. We're going to adjust the code we used to find the frequency of letters in common English words in the first place:

Go code
package main

import (
  "fmt"
  "io/ioutil"
  "sort"
  "strings"

  "github.com/disiqueira/gotree"
)

type Node struct {
  letters  []rune
  prob     float64
  children []*Node
}

func (n *Node) String() string {
  return fmt.Sprintf("%q (%.2f%%)", string(n.letters), n.prob*100)
}

func main() {
  // (error handling is omitted here)
  payload, _ := ioutil.ReadFile("words.txt")

  // split the source file into an array of words.
  words := strings.Split(string(payload), "\n")
  for i := range words {
    words[i] = strings.TrimSpace(words[i])
  }

  freqs := make([]int64, 26)
  var total int64
  for _, w := range words {
    for _, c := range w {
      freqs[int(c-'a')]++
      total++
    }
  }

  var leafs []*Node
  for i := 0; i < 26; i++ {
    letter := 'a' + rune(i)
    freq := float64(freqs[i]) / float64(total)
    leafs = append(leafs, &Node{
      letters:  []rune{letter},
      prob:     freq,
      children: nil,
    })
  }

  for len(leafs) > 1 {
    sort.Sort(ByAscendingProb(leafs))
    l := leafs[0]
    r := leafs[1]
    newNode := &Node{
      letters:  append(l.letters, r.letters...),
      prob:     l.prob + r.prob,
      children: []*Node{l, r},
    }
    leafs = append([]*Node{newNode}, leafs[2:]...)
  }

  tree := gotree.New(leafs[0].String())
  mktree(tree, leafs[0])

  fmt.Println(tree.Print())
}

func mktree(tree gotree.Tree, node *Node) {
  for _, n := range node.children {
    t := tree.Add(n.String())
    mktree(t, n)
  }
}

type ByAscendingProb []*Node

func (a ByAscendingProb) Len() int           { return len(a) }
func (a ByAscendingProb) Swap(i, j int)      { a[i], a[j] = a[j], a[i] }
func (a ByAscendingProb) Less(i, j int) bool { return a[i].prob < a[j].prob }

I've spent a few lines on visualization so that we could get this beautiful output:

"iygfwkhelmdtpursnobvzxjqca" (100.00%)
β”œβ”€β”€ "iygfwkhelmd" (41.11%)
β”‚   β”œβ”€β”€ "iygfwkh" (18.51%)
β”‚   β”‚   β”œβ”€β”€ "i" (8.96%)
β”‚   β”‚   └── "ygfwkh" (9.56%)
β”‚   β”‚       β”œβ”€β”€ "yg" (4.38%)
β”‚   β”‚       β”‚   β”œβ”€β”€ "y" (2.02%)
β”‚   β”‚       β”‚   └── "g" (2.36%)
β”‚   β”‚       └── "fwkh" (5.17%)
β”‚   β”‚           β”œβ”€β”€ "fwk" (2.53%)
β”‚   β”‚           β”‚   β”œβ”€β”€ "f" (1.12%)
β”‚   β”‚           β”‚   └── "wk" (1.41%)
β”‚   β”‚           β”‚       β”œβ”€β”€ "w" (0.64%)
β”‚   β”‚           β”‚       └── "k" (0.77%)
β”‚   β”‚           └── "h" (2.64%)
β”‚   └── "elmd" (22.60%)
β”‚       β”œβ”€β”€ "e" (10.77%)
β”‚       └── "lmd" (11.83%)
β”‚           β”œβ”€β”€ "l" (5.58%)
β”‚           └── "md" (6.25%)
β”‚               β”œβ”€β”€ "m" (3.01%)
β”‚               └── "d" (3.24%)
└── "tpursnobvzxjqca" (58.89%)
    β”œβ”€β”€ "tpurs" (27.83%)
    β”‚   β”œβ”€β”€ "tpu" (13.62%)
    β”‚   β”‚   β”œβ”€β”€ "t" (6.61%)
    β”‚   β”‚   └── "pu" (7.02%)
    β”‚   β”‚       β”œβ”€β”€ "p" (3.25%)
    β”‚   β”‚       └── "u" (3.76%)
    β”‚   └── "rs" (14.21%)
    β”‚       β”œβ”€β”€ "r" (7.04%)
    β”‚       └── "s" (7.16%)
    └── "nobvzxjqca" (31.06%)
        β”œβ”€β”€ "no" (14.39%)
        β”‚   β”œβ”€β”€ "n" (7.19%)
        β”‚   └── "o" (7.20%)
        └── "bvzxjqca" (16.66%)
            β”œβ”€β”€ "bvzxjqc" (8.20%)
            β”‚   β”œβ”€β”€ "bvzxjq" (3.82%)
            β”‚   β”‚   β”œβ”€β”€ "b" (1.83%)
            β”‚   β”‚   └── "vzxjq" (1.99%)
            β”‚   β”‚       β”œβ”€β”€ "v" (0.95%)
            β”‚   β”‚       └── "zxjq" (1.05%)
            β”‚   β”‚           β”œβ”€β”€ "z" (0.42%)
            β”‚   β”‚           └── "xjq" (0.62%)
            β”‚   β”‚               β”œβ”€β”€ "x" (0.30%)
            β”‚   β”‚               └── "jq" (0.32%)
            β”‚   β”‚                   β”œβ”€β”€ "j" (0.16%)
            β”‚   β”‚                   └── "q" (0.17%)
            β”‚   └── "c" (4.38%)
            └── "a" (8.46%)

How does it work? Well, looking at the tree, assuming we're guessing the letter 'a', our first question would be:

  • Is the letter one of i, y, g, f, w, k, h, e, l, m, or d?

We'd get a negative answer, so we can carry on to this part of the tree:

"tpursnobvzxjqca" (58.89%)
β”œβ”€β”€ "tpurs" (27.83%)
β”‚   β”œβ”€β”€ "tpu" (13.62%)
β”‚   β”‚   β”œβ”€β”€ "t" (6.61%)
β”‚   β”‚   └── "pu" (7.02%)
β”‚   β”‚       β”œβ”€β”€ "p" (3.25%)
β”‚   β”‚       └── "u" (3.76%)
β”‚   └── "rs" (14.21%)
β”‚       β”œβ”€β”€ "r" (7.04%)
β”‚       └── "s" (7.16%)
└── "nobvzxjqca" (31.06%)
    β”œβ”€β”€ "no" (14.39%)
    β”‚   β”œβ”€β”€ "n" (7.19%)
    β”‚   └── "o" (7.20%)
    └── "bvzxjqca" (16.66%)
        β”œβ”€β”€ "bvzxjqc" (8.20%)
        β”‚   β”œβ”€β”€ "bvzxjq" (3.82%)
        β”‚   β”‚   β”œβ”€β”€ "b" (1.83%)
        β”‚   β”‚   └── "vzxjq" (1.99%)
        β”‚   β”‚       β”œβ”€β”€ "v" (0.95%)
        β”‚   β”‚       └── "zxjq" (1.05%)
        β”‚   β”‚           β”œβ”€β”€ "z" (0.42%)
        β”‚   β”‚           └── "xjq" (0.62%)
        β”‚   β”‚               β”œβ”€β”€ "x" (0.30%)
        β”‚   β”‚               └── "jq" (0.32%)
        β”‚   β”‚                   β”œβ”€β”€ "j" (0.16%)
        β”‚   β”‚                   └── "q" (0.17%)
        β”‚   └── "c" (4.38%)
        └── "a" (8.46%)

Our next question:

  • Is the letter one of t, p, u, r, or s ?

Again, we'd get a no, so we'd carry on to this part of the tree:

"nobvzxjqca" (31.06%)
β”œβ”€β”€ "no" (14.39%)
β”‚   β”œβ”€β”€ "n" (7.19%)
β”‚   └── "o" (7.20%)
└── "bvzxjqca" (16.66%)
    β”œβ”€β”€ "bvzxjqc" (8.20%)
    β”‚   β”œβ”€β”€ "bvzxjq" (3.82%)
    β”‚   β”‚   β”œβ”€β”€ "b" (1.83%)
    β”‚   β”‚   └── "vzxjq" (1.99%)
    β”‚   β”‚       β”œβ”€β”€ "v" (0.95%)
    β”‚   β”‚       └── "zxjq" (1.05%)
    β”‚   β”‚           β”œβ”€β”€ "z" (0.42%)
    β”‚   β”‚           └── "xjq" (0.62%)
    β”‚   β”‚               β”œβ”€β”€ "x" (0.30%)
    β”‚   β”‚               └── "jq" (0.32%)
    β”‚   β”‚                   β”œβ”€β”€ "j" (0.16%)
    β”‚   β”‚                   └── "q" (0.17%)
    β”‚   └── "c" (4.38%)
    └── "a" (8.46%)

To the question: "is the letter one of n, or o?" we'd also get a no (no pun intended):

"bvzxjqca" (16.66%)
β”œβ”€β”€ "bvzxjqc" (8.20%)
β”‚   β”œβ”€β”€ "bvzxjq" (3.82%)
β”‚   β”‚   β”œβ”€β”€ "b" (1.83%)
β”‚   β”‚   └── "vzxjq" (1.99%)
β”‚   β”‚       β”œβ”€β”€ "v" (0.95%)
β”‚   β”‚       └── "zxjq" (1.05%)
β”‚   β”‚           β”œβ”€β”€ "z" (0.42%)
β”‚   β”‚           └── "xjq" (0.62%)
β”‚   β”‚               β”œβ”€β”€ "x" (0.30%)
β”‚   β”‚               └── "jq" (0.32%)
β”‚   β”‚                   β”œβ”€β”€ "j" (0.16%)
β”‚   β”‚                   └── "q" (0.17%)
β”‚   └── "c" (4.38%)
└── "a" (8.46%)

Finally, we can ask if the letter is one of b, v, x, z, j, q, or c - and when we get this no, we know the letter is a.

In total, we've spent four questions to guess the letter a - which is less than the five questions "partition" needs for any letter. This is good, because 'a' is in the top 3 most frequent letters.

If we were to guess x, by looking at the tree, we can see that we would be asking eight questions! However, for e, we'd only be asking three questions.

In total, to guess "axe", we'd be asking 4 + 8 + 3 = 15 questions, plus the three questions we need to know if the word is over, which adds up to 17 - exactly the same number of questions as with the "partition" method. However, this decision tree takes frequency into account, so it should perform better.

To test that out, we need to carry our decision tree from one program (the frequency measurement / tree building one) to the other (the "strategy efficiency comparator").

I'll simply use JSON for that. First, we add JSON tags to our Node data structure (this article is now a Go course):

Go code
// we need to make sure our fields are exported (start with an uppercase letter)
// otherwise they won't get marshalled.
type Node struct {
  Letters  []rune  `json:"letters"`
  Prob     float64 `json:"prob"`
  Children []*Node `json:"children"`
}

Then, once the tree is generated, we simply serialize it to a file, huff.json.

Go code
bs, _ := json.MarshalIndent(leafs[0], "", "  ")
_ = ioutil.WriteFile("huff.json", bs, 0644)

Here's a small excerpt from the resulting file:

JSON
{
    "letters": [
    121,
    103
    ],
    "prob": 0.04383930500372708,
    "children": [
    {
        "letters": [
        121
        ],
        "prob": 0.020195753849763715,
        "children": null
    },
    {
        "letters": [
        103
        ],
        "prob": 0.023643551153963365,
        "children": null
    }
    ]
}

Back to our benchmarking program, let's load it at startup, using an init() func:

Go code
var rootNode *Node

func init() {
  bs, _ := ioutil.ReadFile("huff.json")
  var node Node
  json.Unmarshal(bs, &node)
  // *not* a dangling pointer, all hail the garbage collecting gods!
  rootNode = &node
}

And finally, our "number of questions" function will look like this: (again, the slow, but somewhat-easy-to-follow way):

Go code
func tree2(word string) int {
  numQuestions := 0
  for _, wordLetter := range word {
    node := rootNode
    for len(node.Letters) > 1 {
    searchChild:
      for _, child := range node.Children {
        for _, nodeLetter := range child.Letters {
          if nodeLetter == wordLetter {
            node = child
            break searchChild
          }
        }
      }
      // every time we travel to a new node, that's a question!
      numQuestions++
    }

    // ask if the word is over
    numQuestions++
  }
  return numQuestions
}

So, how does it perform? Were our efforts worth it?

They were! The results are great - we're almost at 100 questions on average to guess a 20-letter word - which is much, much better.

What we have built here, is a Huffman code for the latin alphabet, suitable for compressing common English words. We know the tree we ended up with is the best tree possible (for those probabilities), because Huffman codes are optimal.

However, there is one more thing we can do.

Tree, take 3

We've learned how to construct a "decision tree" to minimize the number of questions to ask when guessing a word, and in particular, how to build the optimal tree for that.

But something's bothering me - between every letter, we need to ask whether the word is over, or if it keeps going. What if we decided that the nodes in our tree should not correspond only to "letters", but to symbols, which can be either:

  • One of the 26 letters, from a to z, OR
  • A signal for the end of the word

You can think of it as making up a new letter, so that we end up with a 27-letter alphabet. Only the 27th letter is only used to mark the end of a word.

First of all, we're going to need to measure the frequency of this new symbol. We can easily adjust our freq program.

Go code
  freqs := make([]int64, 27)
  var total int64
  for _, w := range words {
    for _, c := range w {
      freqs[int(c-'a')]++
      total++
    }
    // we've encountered the end of a word!
    // its frequency is at the 27th position,
    // so, 26 for 0-based indexing.
        freqs[26]++
        total++
  }

For the sake of visualization, I've decided that we'll use the real-world symbol '$' (dollar sign) to represent the end of a word.

Our new decision tree is as follows:

"$elmdtpursnobvzxjqcaiygfwkh" (100.00%)
β”œβ”€β”€ "$elmdtpu" (42.33%)
β”‚   β”œβ”€β”€ "$e" (19.32%)
β”‚   β”‚   β”œβ”€β”€ "$" (9.58%)
β”‚   β”‚   └── "e" (9.74%)
β”‚   └── "lmdtpu" (23.01%)
β”‚       β”œβ”€β”€ "lmd" (10.69%)
β”‚       β”‚   β”œβ”€β”€ "l" (5.04%)
β”‚       β”‚   └── "md" (5.65%)
β”‚       β”‚       β”œβ”€β”€ "m" (2.72%)
β”‚       β”‚       └── "d" (2.93%)
β”‚       └── "tpu" (12.32%)
β”‚           β”œβ”€β”€ "t" (5.97%)
β”‚           └── "pu" (6.34%)
β”‚               β”œβ”€β”€ "p" (2.94%)
β”‚               └── "u" (3.40%)
└── "rsnobvzxjqcaiygfwkh" (57.67%)
    β”œβ”€β”€ "rsno" (25.86%)
    β”‚   β”œβ”€β”€ "rs" (12.84%)
    β”‚   β”‚   β”œβ”€β”€ "r" (6.37%)
    β”‚   β”‚   └── "s" (6.48%)
    β”‚   └── "no" (13.02%)
    β”‚       β”œβ”€β”€ "n" (6.51%)
    β”‚       └── "o" (6.51%)
    └── "bvzxjqcaiygfwkh" (31.81%)
        β”œβ”€β”€ "bvzxjqca" (15.07%)
        β”‚   β”œβ”€β”€ "bvzxjqc" (7.42%)
        β”‚   β”‚   β”œβ”€β”€ "bvzxjq" (3.46%)
        β”‚   β”‚   β”‚   β”œβ”€β”€ "b" (1.65%)
        β”‚   β”‚   β”‚   └── "vzxjq" (1.80%)
        β”‚   β”‚   β”‚       β”œβ”€β”€ "v" (0.86%)
        β”‚   β”‚   β”‚       └── "zxjq" (0.95%)
        β”‚   β”‚   β”‚           β”œβ”€β”€ "z" (0.38%)
        β”‚   β”‚   β”‚           └── "xjq" (0.56%)
        β”‚   β”‚   β”‚               β”œβ”€β”€ "x" (0.27%)
        β”‚   β”‚   β”‚               └── "jq" (0.29%)
        β”‚   β”‚   β”‚                   β”œβ”€β”€ "j" (0.14%)
        β”‚   β”‚   β”‚                   └── "q" (0.15%)
        β”‚   β”‚   └── "c" (3.96%)
        β”‚   └── "a" (7.65%)
        └── "iygfwkh" (16.74%)
            β”œβ”€β”€ "i" (8.10%)
            └── "ygfwkh" (8.64%)
                β”œβ”€β”€ "yg" (3.96%)
                β”‚   β”œβ”€β”€ "y" (1.83%)
                β”‚   └── "g" (2.14%)
                └── "fwkh" (4.68%)
                    β”œβ”€β”€ "fwk" (2.29%)
                    β”‚   β”œβ”€β”€ "f" (1.02%)
                    β”‚   └── "wk" (1.27%)
                    β”‚       β”œβ”€β”€ "w" (0.58%)
                    β”‚       └── "k" (0.69%)
                    └── "h" (2.39%)

We need to adjust our "number of questions" measurement function a bit. First, let's extract "letter cost estimation" into its own function:

Go code
// Calculate the number of questions required to guess
// a letter (or '$') using our latest tree
func letterCost(wordLetter rune) int {
  numQuestions := 0
  node := rootNode
  for len(node.Letters) > 1 {
  searchChild:
    for _, child := range node.Children {
      for _, nodeLetter := range child.Letters {
        if nodeLetter == wordLetter {
          node = child
          break searchChild
        }
      }
    }
    // every time we travel to a new node, that's a question!
    numQuestions++
  }
  return numQuestions
}

And now our word cost computation function is trivial:

Go code
func tree3(word string) int {
  numQuestions := 0
  for _, wordLetter := range word {
    // add the cost of each letter
    numQuestions += letterCost(wordLetter)
  }
  // and finally, our end of word marker
  numQuestions += letterCost('$')
  return numQuestions
}

Was it a good idea? Let's pit our three trees together: our naive one early on, our Huffman code with one explicit "end of word" question per letter, and our Huffman code that has an "end of word" symbol in its alphabet.

This time, I used a logarithmic scale for the Y axis, to make the graph more readable.

I want to take a moment to point out that I did not know what the results would be exactly when I started writing this article. All I had was a blind faith in David A. Huffman and the word "optimal", and a feeling of what would work well and what wouldn't.

I'm really excited to see that, yes, indeed, having "end of word" as a symbol is worth it.. for words of 4 letters and above. Below that, our eager "is it over yet?" strategy performs better.

The grand finale

Finally, let's pit all of our strategies against each other, from the very first one (asking about each letter of the alphabet in sequence) to the last one (a Huffman code with an "end of word" symbol).

And that's it for this post! I hope you enjoyed learning about Huffman codes in the most roundabout way possible. In the real world, they're used for data compresion, in formats like DEFLATE (PDF), which I might do another article on since it has a lot of interesting tidbits of its own!

This is hopefully the first of many articles sponsored by my patrons on Patreon, I look forward to write some more!

If you liked what you saw, please support my work!

Github logo Donate on GitHub Patreon logo Donate on Patreon

Here's another article just for you:

When rustc explodes

One could say I have a bit of an obsession with build times.

I believe having a "tight feedback loop" is extremely valuable: when I work on a large codebase, I want to be able to make small incremental changes and check very often that things are going as expected.

Especially if I'm working on a project that needs to move quickly: say, the product for an early-stage startup, or a side-project for which I only ever get to do 1-hour work bursts at most.