C# Huffman 实现

好久没有认真写一点东西了,今天来实现一个 Huffman 编码的算法。

原理分析

先来看一下 Huffman 编码和解码的基本原理:

编码

  1. 统计频率
  • 遍历待编码的字符串 / 字节数组,统计每个字符 / 字节出现的频率
  1. 构建 Huffman 树
  • 将每个字符和其频率作为节点
  • 使用优先队列 (最小堆) 来构建 Huffman 树。
  • 每次取出频率最小的两个节点,合并成一个新节点,直到只剩下一个节点
  1. 生成编码
  • 从根节点开始,遍历树,左子节点为 0,右子节点为 1,直到叶节点
  • 记录每个字符的编码
  1. 替换
  • 使用生成的编码替换原字符串中的字符

解码

  1. 构建 Huffman 树
    • 使用编码时生成的树结构
  2. 遍历编码
    • 从根节点开始,根据编码的 01 走左子节点或右子节点
    • 到达叶节点时,将对应字符 / 字节添加到结果中,并返回根节点继续遍历
  3. 重复直到遍历完整个编码

具体实现

先列举我们需要的几个类:

  • BitStream:用于处理位流的类
  • HuffmanNode:表示 Huffman 树的节点
  • PriorityQueue:优先队列

其中,PriorityQueue 在 .NET 6 及以上版本中已经有了,我们可以直接使用。另外两个类需要我们自己实现。

BitStream

BitStream 的主要功能是处理位流的读写,通过维护一个 underlying Stream,记录当前缓存的字节、当前位的位置和移位实现。

Huffman 需要的是 MSB (Most Significant Bit) 顺序的位流,我们这里把二者 (MSB、LSB) 都实现了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
public sealed class BitStream : IDisposable
{
/// <summary>
/// Base stream that the BitStream operates on.
/// </summary>
public Stream BaseStream { get; }

/// <summary>
/// Endianness used for reading and writing data for the <see cref="BitStream" />.
/// </summary>
/// <remarks>Endianness determines the byte order in which data is processed. This field is used
/// internally to specify whether the bit stream operates in big-endian or little-endian mode.</remarks>
/// <example>
/// Read mode example:
/// 0b10100001
/// In big-endian (MSB), the most significant bit 10100001 is processed first.
/// ^
/// In little-endian (LSB), the least significant bit 10100001 is processed first.
/// ^
/// Write mode example:
/// If we keep writing bit 1 into the stream:
/// In big-endian (MSB), we get: 0b10000000, 0b11000000, 0b11100000, 0b11110000, 0b11111000, 0b11111100, 0b11111110, 0b11111111
/// In little-endian (LSB), we get: 0b00000001, 0b00000011, 0b00000111, 0b00001111, 0b00011111, 0b00111111, 0b01111111, 0b11111111
/// </example>
private readonly BitStreamEndianness endianness;

/// <summary>
/// Mode of operation for the <see cref="BitStream" />.
/// </summary>
/// <remarks>Specifies whether the <see cref="BitStream" /> operates in a read or write
/// mode. It is intended for internal use to control the behavior of the stream.</remarks>
private readonly BitStreamMode mode;

/// <summary>
/// Current byte being processed.
/// </summary>
private byte currentByte;

/// <summary>
/// Bit position in the current byte.
/// </summary>
private int bitPosition;

/// <summary>
/// Original byte position in the stream when the BitStream was created.
/// </summary>
/// Used for resetting the stream.
private readonly int originalPos;

public BitStream(Stream stream, BitStreamEndianness order = BitStreamEndianness.Msb, BitStreamMode mode = BitStreamMode.Read)
{
BaseStream = stream ?? throw new ArgumentNullException(nameof(stream));
this.endianness = order;
this.mode = mode;
this.currentByte = 0;
this.bitPosition = mode is BitStreamMode.Read ? 8 : 0;
this.originalPos = (int)stream.Position;
switch (mode)
{
case BitStreamMode.Write when !stream.CanWrite:
throw new InvalidOperationException("Stream must be writable in write mode.");
case BitStreamMode.Read when !stream.CanRead:
throw new InvalidOperationException("Stream must be readable in read mode.");
}
}

public int ReadBits(int bitCount)
{
if (mode != BitStreamMode.Read)
throw new InvalidOperationException("Stream is not in Read mode.");
if (bitCount < 1 || bitCount > 32)
throw new ArgumentOutOfRangeException(nameof(bitCount), "Bit count must be between 1 and 32.");

int result = 0;

for (int i = 0; i < bitCount; i++)
{
if (bitPosition == 8)
{
int readByte = BaseStream.ReadByte();
if (readByte == -1)
throw new EndOfStreamException();

currentByte = (byte)readByte;
bitPosition = 0;
}

int shift = endianness is BitStreamEndianness.Msb
? 7 - bitPosition
: bitPosition;
int bit = (currentByte >> shift) & 1;

result = (result << 1) | bit;
bitPosition++;
}

return result;
}

public int ReadBit()
{
return ReadBits(1);
}

public void WriteBit(int bit)
{
if (mode != BitStreamMode.Write)
throw new InvalidOperationException("Stream is not in Write mode.");
if (bit != 0 && bit != 1)
throw new ArgumentOutOfRangeException(nameof(bit), "Bit must be 0 or 1.");

int shift = endianness == BitStreamEndianness.Msb
? 7 - bitPosition
: bitPosition;
currentByte = (byte)((currentByte & ~(1 << shift)) | (bit << shift));
bitPosition++;

if (bitPosition == 8)
{
BaseStream.WriteByte(currentByte);
currentByte = 0;
bitPosition = 0;
}
}

public void WriteBit(bool bit)
{
WriteBit(bit ? 1 : 0);
}

public void WriteBits(byte[] bits, int bitCount)
{
if (mode != BitStreamMode.Write)
throw new InvalidOperationException("Stream is not in Write mode.");
if (bits == null || bits.Length == 0)
return;
if (bitCount < 1 || bitCount > bits.Length * 8)
throw new ArgumentOutOfRangeException(nameof(bitCount), "Bit count must be between 1 and the number of bits in the byte array.");

int bitCountWritten = 0;
int currentByteIndex = 0;
int currentBitIndex = 0;

while (bitCountWritten < bitCount)
{
int shift = endianness == BitStreamEndianness.Msb
? 7 - currentBitIndex
: currentBitIndex;
int bit = (bits[currentByteIndex] >> shift) & 1;
WriteBit(bit);
bitCountWritten++;
currentBitIndex++;

if (currentBitIndex == 8)
{
currentBitIndex = 0;
currentByteIndex++;
}
}
}

public void WriteBits(byte[] bits)
{
WriteBits(bits, bits.Length * 8);
}

public void WriteBits(byte bits)
{
WriteBits([bits], 8);
}

public void WriteBits(bool[] bits)
{
Debug.WriteLine("You are using bool[] as input. The BitStream will just traverse the array and write 1 for true and 0 for false regardless of the endianness.");
if (bits == null || bits.Length == 0)
return;
foreach (bool bit in bits)
{
WriteBit(bit ? 1 : 0);
}
}

public void Reset()
{
if (mode != BitStreamMode.Read)
throw new InvalidOperationException("Stream not in Read mode cannot be reset.");

BaseStream.Position = originalPos;
currentByte = 0;
bitPosition = 8;
}

#region IDisposable Members
private bool is_disposed;

private void Dispose(bool disposing)
{
if (!is_disposed)
{
if (disposing)
{
}

if (mode is BitStreamMode.Write && bitPosition > 0)
{
BaseStream.WriteByte(currentByte);
}
currentByte = 0;
is_disposed = true;
}
}

public void Dispose()
{
Dispose(true);
GC.SuppressFinalize(this);
}
#endregion
}

HuffmanNode

HuffmanNode 包含以下属性:父节点、左右节点、该节点的值 (字符 / 字节)、该节点的频率、对应的编码 (可选)。

对于编码的表示,思来想去还是 bool[] 最合适。

同时注意到,构造节点有两种情况:

  • 通过频率和字符 / 字节构造叶节点
  • 通过左右子节点构造非叶节点

因此需要两个构造函数。而且显然,构造时即可确定该节点是否为叶节点,后续无需变动。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
internal sealed class HuffmanNode
{
/// <summary>
/// Creates a new huffman leaf node.
/// </summary>
/// <param name="freq"> Frequency of this leaf node. </param>
/// <param name="val"> Leaf node value. </param>
public HuffmanNode(int freq, byte val)
{
Frequency = freq;
Value = val;
IsLeaf = true;
}

/// <summary>
/// Creates a new huffman node with two children.
/// </summary>
/// Used for building the huffman tree.
/// <param name="leftChild"> Left side child node. </param>
/// <param name="rightChild"> Right side child node. </param>
public HuffmanNode(HuffmanNode leftChild, HuffmanNode rightChild)
{
LeftChild = leftChild;
RightChild = rightChild;
leftChild.Bit = false;
rightChild.Bit = true;
Frequency = leftChild.Frequency + rightChild.Frequency;
leftChild.Parent = rightChild.Parent = this;
IsLeaf = false;
}

/// <summary>
/// Parent node of this Huffman node.
/// </summary>
public HuffmanNode Parent { get; private set; }

/// <summary>
/// Left child node of this Huffman node.
/// </summary>
public HuffmanNode LeftChild { get; }

/// <summary>
/// Right child node of this Huffman node.
/// </summary>
public HuffmanNode RightChild { get; }

/// <summary>
/// Represents whether this node is a leaf node.
/// </summary>
/// This property is set when the node is created.
public bool IsLeaf { get; }

/// <summary>
/// Represents whether this node is the root of the Huffman tree.
/// </summary>
public bool IsRoot => Parent == null;

/// <summary>
/// Represents the value of the edge leading to this node.
/// </summary>
/// 0 for left child, 1 for right child.
/// Part of the Huffman Code.
public bool Bit { get; private set; }

/// <summary>
/// Leaf node value. Represents the byte value of the leaf node.
/// </summary>
public byte Value { get; }

/// <summary>
/// Node frequency.
/// </summary>
public int Frequency { get; }
}

PriorityQueue

在高版本 .NET 中,可以直接使用 System.Collections.Generic.PriorityQueue<TElement, TPriority>

在低版本 .NET 中,我们可以自行用最小堆实现一个简易的优先队列。

此处实现遵循 .NET Framework 4.8 的语法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
internal sealed class PriorityQueue<T>
{
private readonly List<T> _heap;
private readonly IComparer<T> _comparer;

public PriorityQueue(IComparer<T> comparer)
{
_heap = new List<T>();
_comparer = comparer ?? throw new ArgumentNullException(nameof(comparer));
}

public int Count => _heap.Count;

public void Enqueue(T item)
{
_heap.Add(item);
HeapifyUp(_heap.Count - 1);
}

public T Dequeue()
{
if (_heap.Count == 0)
{
throw new InvalidOperationException("Queue is empty");
}

T firstItem = _heap[0];
int lastIndex = _heap.Count - 1;
_heap[0] = _heap[lastIndex];
_heap.RemoveAt(lastIndex);

if (_heap.Count > 0)
{
HeapifyDown(0);
}
return firstItem;
}

public T Peek()
{
if (_heap.Count == 0)
{
throw new InvalidOperationException("Queue is empty");
}
return _heap[0];
}

public void Clear()
{
_heap.Clear();
}

private void HeapifyUp(int index)
{
while (index > 0)
{
int parentIndex = (index - 1) / 2;
if (_comparer.Compare(_heap[index], _heap[parentIndex]) >= 0)
{
break;
}
Swap(index, parentIndex);
index = parentIndex;
}
}

private void HeapifyDown(int index)
{
int current = index;
while (true)
{
int smallest = current;
int leftChild = 2 * current + 1;
int rightChild = 2 * current + 2;

if (leftChild < _heap.Count &&
_comparer.Compare(_heap[leftChild], _heap[smallest]) < 0)
{
smallest = leftChild;
}
if (rightChild < _heap.Count &&
_comparer.Compare(_heap[rightChild], _heap[smallest]) < 0)
{
smallest = rightChild;
}

if (smallest == current)
break;

Swap(current, smallest);
current = smallest;
}
}

private void Swap(int i, int j)
{
(_heap[i], _heap[j]) = (_heap[j], _heap[i]);
}
}

构造函数中需要传入一个比较器 IComparer<T>,用于比较节点的优先级:

1
2
3
4
5
6
7
8
9
10
11
internal sealed class HuffmanNodeComparer : IComparer<HuffmanNode>
{
public int Compare(HuffmanNode x, HuffmanNode y)
{
if (x == null || y == null)
{
throw new ArgumentNullException("HuffmanNode cannot be null");
}
return x.Frequency.CompareTo(y.Frequency);
}
}

使用独立的比较器而不是让 HuffmanNode 实现 IComparable<HuffmanNode> ,是为了保证 HuffmanNode 在不同版本间的一致性。

HuffmanCompression

接下来是 Huffman 编码和解码的核心类 HuffmanCompression

编码

  1. 统计频率
1
2
3
4
5
int[] freqs = new int[256];
foreach (byte b in data)
{
freqs[b]++;
}

freqs 数组的索引是字节值,值是该字节出现的频率。

  1. 构建 Huffman 树

使用内置的 PriorityQueue 来构建 Huffman 树:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
private HuffmanNode BuildTree(int[] freqs)
{
PriorityQueue<HuffmanNode, int> queue = new();
for (int i = 0; i < freqs.Length; i++)
{
if (freqs[i] > 0)
{
queue.Enqueue(new(freqs[i], (byte)i), freqs[i]);
}
}
while (queue.Count > 1)
{
HuffmanNode left = queue.Dequeue();
HuffmanNode right = queue.Dequeue();
HuffmanNode parent = new(left, right);
queue.Enqueue(parent, parent.Frequency);
}
return queue.Count > 0 ? queue.Dequeue() : null;
}

for 循环中构造所有叶节点,并将它们加入优先队列。然后通过不断取出频率最小的两个节点,合并成一个新节点,直到只剩下一个节点。频率的加和与叶节点的判断均在 HuffmanNode 的构造函数中完成。

使用自定义的 PriorityQueue 大同小异:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
private HuffmanNode BuildTree(int[] freqs)
{
PriorityQueue<HuffmanNode> queue = new PriorityQueue<HuffmanNode>(new HuffmanNodeComparer());
for (int i = 0; i < freqs.Length; i++)
{
if (freqs[i] > 0)
{
queue.Enqueue(new HuffmanNode(freqs[i], (byte)i));
}
}
while (queue.Count > 1)
{
HuffmanNode left = queue.Dequeue();
HuffmanNode right = queue.Dequeue();
HuffmanNode parent = new HuffmanNode(left, right);
queue.Enqueue(parent);
}
return queue.Count > 0 ? queue.Dequeue() : null;
}
  1. 生成编码

用一个字典存储字节值和对应的编码。

如果当前节点是叶节点,则从该节点开始向上遍历,记录路径上的 01,直到到达根节点。将路径反转即为该字节的编码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
private void GenerateCodes(HuffmanNode node, Dictionary<byte, bool[]> codeTable)
{
if (node == null)
{
return;
}

HuffmanNode current;
if (node.IsLeaf)
{
List<bool> code = [];
current = node;
while (!current.IsRoot)
{
code.Add(current.Bit);
current = current.Parent;
}
code.Reverse();
codeTable[node.Value] = [.. code];
}
else
{
if (node.LeftChild != null)
{
GenerateCodes(node.LeftChild, codeTable);
}
if (node.RightChild != null)
{
GenerateCodes(node.RightChild, codeTable);
}
}
}
  1. 写入树

为了在解码时能够重建 Huffman 树,我们需要将树结构序列化到输出流中。可以使用前序遍历的方式来写入树。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
private void WriteTree(HuffmanNode node)
{
if (node.IsLeaf)
{
input.WriteBit(0);
input.WriteBits(node.Value);
}
else
{
input.WriteBit(1);
WriteTree(node.LeftChild);
WriteTree(node.RightChild);
}
}
  1. 替换

这一步就简单了,遍历原数据,将每个字节替换为对应的编码。

1
2
3
4
5
6
7
8
9
10
11
foreach (byte b in data)
{
if (codeTable.TryGetValue(b, out bool[] code))
{
input.WriteBits(code);
}
else
{
throw new InvalidDataException($"No code found for byte {b}.");
}
}

解码

有一个很巧的思路:字节的取值是 0~255,因此我们可以直接拿一个二维数组 ushort[512, 2]

第一位表示字节值(<256)或父节点的索引(>=256),第二位表示左右子节点。

  1. 读取树
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
private const ushort MAX = 512;
private ushort index = 256;
private readonly ushort[,] children = new ushort[MAX, 2];

private ushort RebuildTree()
{
switch (input.ReadBit())
{
case 0:
return (ushort)input.ReadBits(8);
case 1:
ushort parent = index++;
if (parent >= MAX)
{
throw new InvalidDataException("Exceeded huffman tree.");
}
children[parent, 0] = RebuildTree();
children[parent, 1] = RebuildTree();
return parent;
default:
throw new InvalidDataException("Invalid bit.");
}
}
  1. 解码

这种表示的好处在于判断是否为叶节点非常简单:如果索引小于 256,则是叶节点,且该值即为字节值;如果索引大于等于 256,则是非叶节点,且可以通过 children 数组获取左右子节点。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public MemoryStream Decompress()
{
if (mode != CompressionMode.Decompress)
{
throw new InvalidOperationException("Not in decompression mode.");
}

ushort root = RebuildTree();
while (output.Length != decompressedLength)
{
ushort node = root;
while (node >= 256) // Keep reading bits until we reach a leaf node
{
int bit = input.ReadBit();
if (bit != -1)
{
node = children[node, bit]; // Traverse the tree based on the bit read
}
}
output.WriteByte((byte)node);
}
return output;
}

完整代码

GitHub


C# Huffman 实现
https://blog.refrain69.isgre.at/posts/65470/
发布于
2025年6月28日
许可协议