C# Huffman 实现

好久没有认真写一点东西了，今天来实现一个 Huffman 编码的算法。

原理分析

先来看一下 Huffman 编码和解码的基本原理：

编码

统计频率

遍历待编码的字符串 / 字节数组，统计每个字符 / 字节出现的频率

构建 Huffman 树

将每个字符和其频率作为节点
使用优先队列 (最小堆) 来构建 Huffman 树。
每次取出频率最小的两个节点，合并成一个新节点，直到只剩下一个节点

生成编码

从根节点开始，遍历树，左子节点为 0，右子节点为 1，直到叶节点
记录每个字符的编码

替换

使用生成的编码替换原字符串中的字符

解码

构建 Huffman 树
- 使用编码时生成的树结构
遍历编码
- 从根节点开始，根据编码的 0 和 1 走左子节点或右子节点
- 到达叶节点时，将对应字符 / 字节添加到结果中，并返回根节点继续遍历
重复直到遍历完整个编码

具体实现

先列举我们需要的几个类：

BitStream：用于处理位流的类
HuffmanNode：表示 Huffman 树的节点
PriorityQueue：优先队列

其中，PriorityQueue 在 .NET 6 及以上版本中已经有了，我们可以直接使用。另外两个类需要我们自己实现。

BitStream

BitStream 的主要功能是处理位流的读写，通过维护一个 underlying Stream，记录当前缓存的字节、当前位的位置和移位实现。

Huffman 需要的是 MSB (Most Significant Bit) 顺序的位流，我们这里把二者 (MSB、LSB) 都实现了。

▶

BitStream.cs

public sealed class BitStream : IDisposable
{
    /// <summary>
    /// Base stream that the BitStream operates on.
    /// </summary>
    public Stream BaseStream { get; }

    /// <summary>
    /// Endianness used for reading and writing data for the <see cref="BitStream" />.
    /// </summary>
    /// <remarks>Endianness determines the byte order in which data is processed. This field is used
    /// internally to specify whether the bit stream operates in big-endian or little-endian mode.</remarks>
    /// <example>
    /// Read mode example:
    /// 0b10100001
    /// In big-endian (MSB), the most significant bit 10100001 is processed first.
    ///                                               ^
    /// In little-endian (LSB), the least significant bit 10100001 is processed first.
    ///                                                          ^
    /// Write mode example:
    /// If we keep writing bit 1 into the stream:
    /// In big-endian (MSB), we get: 0b10000000, 0b11000000, 0b11100000, 0b11110000, 0b11111000, 0b11111100, 0b11111110, 0b11111111
    /// In little-endian (LSB), we get: 0b00000001, 0b00000011, 0b00000111, 0b00001111, 0b00011111, 0b00111111, 0b01111111, 0b11111111
    /// </example>
    private readonly BitStreamEndianness endianness;

    /// <summary>
    /// Mode of operation for the <see cref="BitStream" />.
    /// </summary>
    /// <remarks>Specifies whether the <see cref="BitStream" /> operates in a read or write
    /// mode. It is intended for internal use to control the behavior of the stream.</remarks>
    private readonly BitStreamMode mode;

    /// <summary>
    /// Current byte being processed.
    /// </summary>
    private byte currentByte;

    /// <summary>
    /// Bit position in the current byte.
    /// </summary>
    private int bitPosition;

    /// <summary>
    /// Original byte position in the stream when the BitStream was created.
    /// </summary>
    /// Used for resetting the stream.
    private readonly int originalPos;

    public BitStream(Stream stream, BitStreamEndianness order = BitStreamEndianness.Msb, BitStreamMode mode = BitStreamMode.Read)
    {
        BaseStream = stream ?? throw new ArgumentNullException(nameof(stream));
        this.endianness = order;
        this.mode = mode;
        this.currentByte = 0;
        this.bitPosition = mode is BitStreamMode.Read ? 8 : 0;
        this.originalPos = (int)stream.Position;
        switch (mode)
        {
            case BitStreamMode.Write when !stream.CanWrite:
                throw new InvalidOperationException("Stream must be writable in write mode.");
            case BitStreamMode.Read when !stream.CanRead:
                throw new InvalidOperationException("Stream must be readable in read mode.");
        }
    }

    public int ReadBits(int bitCount)
    {
        if (mode != BitStreamMode.Read)
            throw new InvalidOperationException("Stream is not in Read mode.");
        if (bitCount < 1 || bitCount > 32)
            throw new ArgumentOutOfRangeException(nameof(bitCount), "Bit count must be between 1 and 32.");

        int result = 0;

        for (int i = 0; i < bitCount; i++)
        {
            if (bitPosition == 8)
            {
                int readByte = BaseStream.ReadByte();
                if (readByte == -1)
                    throw new EndOfStreamException();

                currentByte = (byte)readByte;
                bitPosition = 0;
            }

            int shift = endianness is BitStreamEndianness.Msb
                ? 7 - bitPosition
                : bitPosition;
            int bit = (currentByte >> shift) & 1;

            result = (result << 1) | bit;
            bitPosition++;
        }

        return result;
    }

    public int ReadBit()
    {
        return ReadBits(1);
    }

    public void WriteBit(int bit)
    {
        if (mode != BitStreamMode.Write)
            throw new InvalidOperationException("Stream is not in Write mode.");
        if (bit != 0 && bit != 1)
            throw new ArgumentOutOfRangeException(nameof(bit), "Bit must be 0 or 1.");

        int shift = endianness == BitStreamEndianness.Msb
            ? 7 - bitPosition
            : bitPosition;
        currentByte = (byte)((currentByte & ~(1 << shift)) | (bit << shift));
        bitPosition++;

        if (bitPosition == 8)
        {
            BaseStream.WriteByte(currentByte);
            currentByte = 0;
            bitPosition = 0;
        }
    }

    public void WriteBit(bool bit)
    {
        WriteBit(bit ? 1 : 0);
    }

    public void WriteBits(byte[] bits, int bitCount)
    {
        if (mode != BitStreamMode.Write)
            throw new InvalidOperationException("Stream is not in Write mode.");
        if (bits == null || bits.Length == 0)
            return;
        if (bitCount < 1 || bitCount > bits.Length * 8)
            throw new ArgumentOutOfRangeException(nameof(bitCount), "Bit count must be between 1 and the number of bits in the byte array.");

        int bitCountWritten = 0;
        int currentByteIndex = 0;
        int currentBitIndex = 0;

        while (bitCountWritten < bitCount)
        {
            int shift = endianness == BitStreamEndianness.Msb
                ? 7 - currentBitIndex
                : currentBitIndex;
            int bit = (bits[currentByteIndex] >> shift) & 1;
            WriteBit(bit);
            bitCountWritten++;
            currentBitIndex++;

            if (currentBitIndex == 8)
            {
                currentBitIndex = 0;
                currentByteIndex++;
            }
        }
    }

    public void WriteBits(byte[] bits)
    {
        WriteBits(bits, bits.Length * 8);
    }

    public void WriteBits(byte bits)
    {
        WriteBits([bits], 8);
    }

    public void WriteBits(bool[] bits)
    {
        Debug.WriteLine("You are using bool[] as input. The BitStream will just traverse the array and write 1 for true and 0 for false regardless of the endianness.");
        if (bits == null || bits.Length == 0)
            return;
        foreach (bool bit in bits)
        {
            WriteBit(bit ? 1 : 0);
        }
    }

    public void Reset()
    {
        if (mode != BitStreamMode.Read)
            throw new InvalidOperationException("Stream not in Read mode cannot be reset.");

        BaseStream.Position = originalPos;
        currentByte = 0;
        bitPosition = 8;
    }

    #region IDisposable Members
    private bool is_disposed;

    private void Dispose(bool disposing)
    {
        if (!is_disposed)
        {
            if (disposing)
            {
            }

            if (mode is BitStreamMode.Write && bitPosition > 0)
            {
                BaseStream.WriteByte(currentByte);
            }
            currentByte = 0;
            is_disposed = true;
        }
    }

    public void Dispose()
    {
        Dispose(true);
        GC.SuppressFinalize(this);
    }
    #endregion
}

HuffmanNode

HuffmanNode 包含以下属性：父节点、左右节点、该节点的值 (字符 / 字节)、该节点的频率、对应的编码 (可选)。

对于编码的表示，思来想去还是 bool[] 最合适。

同时注意到，构造节点有两种情况：

通过频率和字符 / 字节构造叶节点
通过左右子节点构造非叶节点

因此需要两个构造函数。而且显然，构造时即可确定该节点是否为叶节点，后续无需变动。

▶

HuffmanNode.cs

internal sealed class HuffmanNode
{
    /// <summary>
    /// Creates a new huffman leaf node.
    /// </summary>
    /// <param name="freq"> Frequency of this leaf node. </param>
    /// <param name="val"> Leaf node value. </param>
    public HuffmanNode(int freq, byte val)
    {
        Frequency = freq;
        Value = val;
        IsLeaf = true;
    }

    /// <summary>
    /// Creates a new huffman node with two children.
    /// </summary>
    /// Used for building the huffman tree.
    /// <param name="leftChild"> Left side child node. </param>
    /// <param name="rightChild"> Right side child node. </param>
    public HuffmanNode(HuffmanNode leftChild, HuffmanNode rightChild)
    {
        LeftChild = leftChild;
        RightChild = rightChild;
        leftChild.Bit = false;
        rightChild.Bit = true;
        Frequency = leftChild.Frequency + rightChild.Frequency;
        leftChild.Parent = rightChild.Parent = this;
        IsLeaf = false;
    }

    /// <summary>
    /// Parent node of this Huffman node.
    /// </summary>
    public HuffmanNode Parent { get; private set; }

    /// <summary>
    /// Left child node of this Huffman node.
    /// </summary>
    public HuffmanNode LeftChild { get; }

    /// <summary>
    /// Right child node of this Huffman node.
    /// </summary>
    public HuffmanNode RightChild { get; }

    /// <summary>
    /// Represents whether this node is a leaf node.
    /// </summary>
    /// This property is set when the node is created.
    public bool IsLeaf { get; }

    /// <summary>
    /// Represents whether this node is the root of the Huffman tree.
    /// </summary>
    public bool IsRoot => Parent == null;

    /// <summary>
    /// Represents the value of the edge leading to this node.
    /// </summary>
    /// 0 for left child, 1 for right child.
    /// Part of the Huffman Code.
    public bool Bit { get; private set; }

    /// <summary>
    /// Leaf node value. Represents the byte value of the leaf node.
    /// </summary>
    public byte Value { get; }

    /// <summary>
    /// Node frequency.
    /// </summary>
    public int Frequency { get; }
}

PriorityQueue

在高版本 .NET 中，可以直接使用 System.Collections.Generic.PriorityQueue<TElement, TPriority>。

在低版本 .NET 中，我们可以自行用最小堆实现一个简易的优先队列。

此处实现遵循 .NET Framework 4.8 的语法。

▶

PriorityQueue.cs

internal sealed class PriorityQueue<T>
{
    private readonly List<T> _heap;
    private readonly IComparer<T> _comparer;

    public PriorityQueue(IComparer<T> comparer)
    {
        _heap = new List<T>();
        _comparer = comparer ?? throw new ArgumentNullException(nameof(comparer));
    }

    public int Count => _heap.Count;

    public void Enqueue(T item)
    {
        _heap.Add(item);
        HeapifyUp(_heap.Count - 1);
    }

    public T Dequeue()
    {
        if (_heap.Count == 0)
        {
            throw new InvalidOperationException("Queue is empty");
        }

        T firstItem = _heap[0];
        int lastIndex = _heap.Count - 1;
        _heap[0] = _heap[lastIndex];
        _heap.RemoveAt(lastIndex);

        if (_heap.Count > 0)
        {
            HeapifyDown(0);
        }
        return firstItem;
    }

    public T Peek()
    {
        if (_heap.Count == 0)
        {
            throw new InvalidOperationException("Queue is empty");
        }
        return _heap[0];
    }

    public void Clear()
    {
        _heap.Clear();
    }

    private void HeapifyUp(int index)
    {
        while (index > 0)
        {
            int parentIndex = (index - 1) / 2;
            if (_comparer.Compare(_heap[index], _heap[parentIndex]) >= 0)
            {
                break;
            }
            Swap(index, parentIndex);
            index = parentIndex;
        }
    }

    private void HeapifyDown(int index)
    {
        int current = index;
        while (true)
        {
            int smallest = current;
            int leftChild = 2 * current + 1;
            int rightChild = 2 * current + 2;

            if (leftChild < _heap.Count &&
                _comparer.Compare(_heap[leftChild], _heap[smallest]) < 0)
            {
                smallest = leftChild;
            }
            if (rightChild < _heap.Count &&
                _comparer.Compare(_heap[rightChild], _heap[smallest]) < 0)
            {
                smallest = rightChild;
            }

            if (smallest == current)
                break;

            Swap(current, smallest);
            current = smallest;
        }
    }

    private void Swap(int i, int j)
    {
        (_heap[i], _heap[j]) = (_heap[j], _heap[i]);
    }
}

构造函数中需要传入一个比较器 IComparer<T>，用于比较节点的优先级：

▶

HuffmanNodeComparer.cs

internal sealed class HuffmanNodeComparer : IComparer<HuffmanNode>
{
    public int Compare(HuffmanNode x, HuffmanNode y)
    {
        if (x == null || y == null)
        {
            throw new ArgumentNullException("HuffmanNode cannot be null");
        }
        return x.Frequency.CompareTo(y.Frequency);
    }
}

使用独立的比较器而不是让 HuffmanNode 实现 IComparable<HuffmanNode> ，是为了保证 HuffmanNode 在不同版本间的一致性。

HuffmanCompression

接下来是 Huffman 编码和解码的核心类 HuffmanCompression。

编码

统计频率

int[] freqs = new int[256];
foreach (byte b in data)
{
    freqs[b]++;
}

freqs 数组的索引是字节值，值是该字节出现的频率。

构建 Huffman 树

使用内置的 PriorityQueue 来构建 Huffman 树：

private HuffmanNode BuildTree(int[] freqs)
{
    PriorityQueue<HuffmanNode, int> queue = new();
    for (int i = 0; i < freqs.Length; i++)
    {
        if (freqs[i] > 0)
        {
            queue.Enqueue(new(freqs[i], (byte)i), freqs[i]);
        }
    }
    while (queue.Count > 1)
    {
        HuffmanNode left = queue.Dequeue();
        HuffmanNode right = queue.Dequeue();
        HuffmanNode parent = new(left, right);
        queue.Enqueue(parent, parent.Frequency);
    }
    return queue.Count > 0 ? queue.Dequeue() : null;
}

for 循环中构造所有叶节点，并将它们加入优先队列。然后通过不断取出频率最小的两个节点，合并成一个新节点，直到只剩下一个节点。频率的加和与叶节点的判断均在 HuffmanNode 的构造函数中完成。

使用自定义的 PriorityQueue 大同小异：

private HuffmanNode BuildTree(int[] freqs)
{
    PriorityQueue<HuffmanNode> queue = new PriorityQueue<HuffmanNode>(new HuffmanNodeComparer());
    for (int i = 0; i < freqs.Length; i++)
    {
        if (freqs[i] > 0)
        {
            queue.Enqueue(new HuffmanNode(freqs[i], (byte)i));
        }
    }
    while (queue.Count > 1)
    {
        HuffmanNode left = queue.Dequeue();
        HuffmanNode right = queue.Dequeue();
        HuffmanNode parent = new HuffmanNode(left, right);
        queue.Enqueue(parent);
    }
    return queue.Count > 0 ? queue.Dequeue() : null;
}

生成编码

用一个字典存储字节值和对应的编码。

如果当前节点是叶节点，则从该节点开始向上遍历，记录路径上的 0 和 1，直到到达根节点。将路径反转即为该字节的编码。

private void GenerateCodes(HuffmanNode node, Dictionary<byte, bool[]> codeTable)
{
    if (node == null)
    {
        return;
    }

    HuffmanNode current;
    if (node.IsLeaf)
    {
        List<bool> code = [];
        current = node;
        while (!current.IsRoot)
        {
            code.Add(current.Bit);
            current = current.Parent;
        }
        code.Reverse();
        codeTable[node.Value] = [.. code];
    }
    else
    {
        if (node.LeftChild != null)
        {
            GenerateCodes(node.LeftChild, codeTable);
        }
        if (node.RightChild != null)
        {
            GenerateCodes(node.RightChild, codeTable);
        }
    }
}

写入树

为了在解码时能够重建 Huffman 树，我们需要将树结构序列化到输出流中。可以使用前序遍历的方式来写入树。

private void WriteTree(HuffmanNode node)
{
    if (node.IsLeaf)
    {
        input.WriteBit(0);
        input.WriteBits(node.Value);
    }
    else
    {
        input.WriteBit(1);
        WriteTree(node.LeftChild);
        WriteTree(node.RightChild);
    }
}

替换

这一步就简单了，遍历原数据，将每个字节替换为对应的编码。

foreach (byte b in data)
{
    if (codeTable.TryGetValue(b, out bool[] code))
    {
        input.WriteBits(code);
    }
    else
    {
        throw new InvalidDataException($"No code found for byte {b}.");
    }
}

解码

有一个很巧的思路：字节的取值是 0~255，因此我们可以直接拿一个二维数组 ushort[512, 2]：

第一位表示字节值（<256）或父节点的索引（>=256），第二位表示左右子节点。

读取树

private const ushort MAX = 512;
private ushort index = 256;
private readonly ushort[,] children = new ushort[MAX, 2];

private ushort RebuildTree()
{
    switch (input.ReadBit())
    {
        case 0:
            return (ushort)input.ReadBits(8);
        case 1:
            ushort parent = index++;
            if (parent >= MAX)
            {
                throw new InvalidDataException("Exceeded huffman tree.");
            }
            children[parent, 0] = RebuildTree();
            children[parent, 1] = RebuildTree();
            return parent;
        default:
            throw new InvalidDataException("Invalid bit.");
    }
}

解码

这种表示的好处在于判断是否为叶节点非常简单：如果索引小于 256，则是叶节点，且该值即为字节值；如果索引大于等于 256，则是非叶节点，且可以通过 children 数组获取左右子节点。

public MemoryStream Decompress()
{
    if (mode != CompressionMode.Decompress)
    {
        throw new InvalidOperationException("Not in decompression mode.");
    }

    ushort root = RebuildTree();
    while (output.Length != decompressedLength)
    {
        ushort node = root;
        while (node >= 256)                 // Keep reading bits until we reach a leaf node
        {
            int bit = input.ReadBit();
            if (bit != -1)
            {
                node = children[node, bit]; // Traverse the tree based on the bit read
            }
        }
        output.WriteByte((byte)node);
    }
    return output;
}

完整代码

见 GitHub。

技术

#Csharp #算法

C# Huffman 实现

https://blog.refrain69.isgre.at/posts/65470/

发布于

2025年6月28日

许可协议

Python 虚拟环境配置下一篇

C# Huffman 实现

原理分析

编码

解码

具体实现

BitStream

HuffmanNode

PriorityQueue

HuffmanCompression

编码

解码

完整代码

Waline.init({el:"#waline",serverURL:"https://my-blog-comment.vercel.app/"})