Huffman Encoding

Huffman Encoding is one of the earliest schemes used in text compression. This reduces the total number of bits required for transmission of a message. This process first requires analyzing the text to compress and then based the statistics of the characters in the processed text we encode the letters.

Consider a message with the following symbols: t, u, $, e, a, w

Since we have 6 symbols, conventionally, we would need 3 bits for each symbol. So, something like "tweet" would take up 15 bits of data. However, this method of encoding doesn't take into account the frequent of the letters. Suppose the probability of the occurrence of each symbol in our example is known to be the following:

Probabilities of letters

Building a Huffman Tree

Organize the Data: Start by arranging all the symbols in increasing order based on their probabilities or frequencies. This means the symbols with the smallest probabilities should be at the beginning of your list.
Select Minimum Probability Symbols: From your organized list, pick out the two symbols that have the smallest probabilities.
Create Initial Trees: For each of these two symbols, build a binary tree. These initial trees should each have just one node that represents one of the selected symbols.
Combine into Sub-Tree: Construct a new binary sub-tree using these two one-node trees. The root node of this new sub-tree should be a composite node, holding the combined probability of the two initial symbols. The two one-node trees will be its left and right children.
Update the List: Add this newly constructed sub-tree back into your original list, ensuring to keep the list in increasing order of probabilities.
Repeat the Process: Continue with this process, repeating steps 2 through 5. Each round, you'll combine the two smallest probabilities into a new sub-tree and update your list accordingly.
Finalize the Huffman Tree: The process comes to an end when only one tree remains in your list. This final tree is your completed Huffman tree.

Here is a great animation from Tom Scott on YouTube on this topic: Huffman Encoding by Tom Scott

Note, he uses frequency rather than probability but they are the same the same idea. He also starts at the bottom of the list rather than the top but that's again the same idea.

Implementation of the Huffman Encoding

In the implementation, I decided to use frequency rather than probability since I feel comparing integers would more precise and efficient.

Here's the Github link: Huffman Encoding Repository

HuffmanTree Class

The first thing we create is the HuffmanTree class which is a version of the binary tree. We store the left, right, character, frequency in each node. To serialize the tree, I've decided to use pre-order traversal and write it to an ASCII file so that the tree can be by deciphered by other programs as well.

HuffmanEncoder Class

This is the main program that does all the math. It has a bunch of method to do different things but it doesn't handle actually doing them. In HuffmanCompressionDriver, the encodeFileBinary for example, uses the methods in this class.

Create the frequency table.
Build the huffman tree from the frequency table.
Generate the huffman codes from the huffman tree.

Then the HuffmanCompressionDriver handles things like reading and writing to a particular file, etc.

HuffmanCompressionDriver class

This class is the user interface and decided which methods are to be called in what order. This also handles all the I/O operations, converting from string to binary and vice-versa, etc.

Conclusion

In my particular version of the compression algorithm, I found it saved about 25% of the space on disk, however, it's more useful for larger bodies of text since the tree that is also required is rather large.

Things I learned:

Building a binary tree from useful data.
Serializing and De-serializing a binary tree.
Writing binary data to a file using java (especially non-byte sized data).
Parsing user input in Java from CLI options.

Huffman Encoding

Enjoy the notes on this website? Consider supporting me in this adventure in you preferred way: Support me.