03: Huffman Coding Shannon-Fano Coding

Add '0' to the codes in the first set and '1' to the rest. ➢ Recursively assign the rest of the code bits for the two subsets ...... Identical to Huffman code for {1, 2, 3, …} ...
540KB taille 27 téléchargements 442 vues
03: Huffman Coding

CSCI 6990 Data Compression

UNIVERSITY of NEW ORLEANS DEPARTMENT OF COMPUTER SCIENCE

CSCI 6990.002: Data Compression

03: Huffman Coding Vassil Roussev

Shannon-Fano Coding ™

The first code based on Shannon’s theory ¾

™

Suboptimal (it took a graduate student to fix it!)

Algorithm ¾ ¾ ¾ ¾ ¾ ¾

Start with empty codes Compute frequency statistics for all symbols Order the symbols in the set by frequency Split the set to minimize* difference Add ‘0’ to the codes in the first set and ‘1’ to the rest Recursively assign the rest of the code bits for the two  subsets, until sets cannot be split 2

Vassil Roussev

1

03: Huffman Coding

CSCI 6990 Data Compression

Shannon-Fano Coding (2) 0

0

1

1

1

1

a

b

c

d

e

f

9

8

6

5

4

2

0

1

a

b

c

d

e

f

9

8

6

5

4

2

3

Shannon-Fano Coding (3) 0

0

1

1

1

1

a

b

c

d

e

f

9

8

6

5

4

2

0

0

1

a

b

9

8

1

c

d

e

f

6

5

4

2

4

Vassil Roussev

2

03: Huffman Coding

CSCI 6990 Data Compression

Shannon-Fano Coding (4) 00

01

1

1

1

1

a

b

c

d

e

f

9

8

6

5

4

2

0

0

1

a

b

9

8

1

c

d

e

f

6

5

4

2

5

Shannon-Fano Coding (5) 00

01

1

1

1

1

a

b

c

d

e

f

9

8

6

5

4

2

0

1

0

1

a

b

c

d

e

f

9

8

6

5

4

2

0

1

6

Vassil Roussev

3

03: Huffman Coding

CSCI 6990 Data Compression

Shannon-Fano Coding (6) 00

01

10

10

11

11

a

b

c

d

e

f

9

8

6

5

4

2

0

1

0

1

0

1

a

b

c

d

e

f

9

8

6

5

4

2

7

Shannon-Fano Coding (7) 00

01

10

10

11

11

a

b

c

d

e

f

9

8

6

5

4

2

0

Vassil Roussev

0

1

a

b

9

8

1

0

0

1

1

c

d

6

5

e

f

4

2

8

4

03: Huffman Coding

CSCI 6990 Data Compression

Shannon-Fano Coding (8) 00

01

100

101

11

11

a

b

c

d

e

f

9

8

6

5

4

2

0

0

1

a

b

9

8

1

0

0

1

1

c

d

6

5

e

f

4

2

9

Shannon-Fano Coding (9) 00

01

100

101

11

11

a

b

c

d

e

f

9

8

6

5

4

2

0

Vassil Roussev

0

1

a

b

9

8

1

0

0

1

1

0

1

c

d

e

f

6

5

4

2

10

5

03: Huffman Coding

CSCI 6990 Data Compression

Shannon-Fano Coding (10) 00

01

100

101

110

111

a

b

c

d

e

f

9

8

6

5

4

2

0

0

1

a

b

9

8

1

0

0

1

1

0

1

c

d

e

f

6

5

4

2

11

Optimum Prefix Codes Key observations on optimal codes

™ 1. 2.

Symbols that occur more frequently will have shorter codewords  The two least frequent symbols will have the same length

Proofs

™ 1. 2.

Assume the opposite—code is clearly sub‐optimal Assume the opposite ƒ ƒ

Let X, Y be the least frequent symbols & |code(X)| = k, |code(Y)| = k+1

Then ƒ by unique decodability (UD), code(X) cannot be a prefix for code(Y) ƒ also, all other codes are shorter Î Dropping the last bit of |code(Y)| would generate a new, shorter, uniquely decodable code

!!! This contradicts optimality assumption !!! 12

Vassil Roussev

6

03: Huffman Coding

CSCI 6990 Data Compression

Huffman Coding ™

David Huffman (1951) ¾

Grad student of Robert M. Fano (MIT) ƒ Term paper(!)

™

Explained by example Letter

Code

Probability

a

0.2

b

0.4

c

0.2

d

0.1

e

0.1

Set

Set Prob

13

Huffman Coding by Example ™ Letter

Init: Create a set out of each letter Code

Probability

Set

Set Prob

a

0.2

a

0.2

b

0.4

b

0.4

c

0.2

c

0.2

d

0.1

d

0.1

e

0.1

e

0.1

14

Vassil Roussev

7

03: Huffman Coding

CSCI 6990 Data Compression

Huffman Coding by Example 1.

Sort sets according to probability (lowest first)

Letter

Code

Probability

Set

Set Prob

a

0.2

d a

0.2 0.1

b

0.4

b e

0.4 0.1

c

0.2

a c

0.2

d

0.1

d c

0.1 0.2

e

0.1

b e

0.1 0.4

15

Huffman Coding by Example 2.

Insert prefix ‘1’ into the codes of top set letters Letter

Probability

Set

Set Prob

a

0.2

d

0.1

b

0.4

e

0.1

c

0.2

a

0.2

0.1

c

0.2

0.1

b

0.4

d e

Code

1

16

Vassil Roussev

8

03: Huffman Coding

CSCI 6990 Data Compression

Huffman Coding by Example 3.

Insert prefix ‘0’ into the codes of the second set letters Letter

Code

Probability

Set

Set Prob

a

0.2

d

0.1

b

0.4

e

0.1

0.2

a

0.2

d

c 1

0.1

c

0.2

e

0

0.1

b

0.4

17

Huffman Coding by Example 4. Letter

Merge the top two sets

Code

Probability

Set

Set Prob

a

0.2

de d

0.1 0.2

b

0.4

e a

0.1 0.2

c

0.2

a c

0.2

d

1

0.1

d c

0.2 0.4

e

0

0.1

b

0.4

18

Vassil Roussev

9

03: Huffman Coding

CSCI 6990 Data Compression

Huffman Coding by Example 1.

Sort sets according to probability (lowest first)

Letter

Code

Probability

Set

Set Prob

a

0.2

de

0.2

b

0.4

a

0.2

0.2

c

0.2

d

c 1

0.1

b

0.4

e

0

0.1

19

Huffman Coding by Example 2.

Insert prefix ‘1’ into the codes of top set letters Letter

Code

Probability

Set

Set Prob

a

0.2

de

0.2

b

0.4

a

0.2

c

0.2

c

0.2

b

0.4

d

11 1

0.1

e

10 0

0.1

20

Vassil Roussev

10

03: Huffman Coding

CSCI 6990 Data Compression

Huffman Coding by Example 3.

Insert prefix ‘0’ into the codes of the second set letters Letter

Code

a

0

b c

Probability

Set

Set Prob

0.2

de

0.2

0.4

a

0.2

0.2

c

0.2

d

11

0.1

b

0.4

e

10

0.1

21

Huffman Coding by Example 4. Letter a

Merge the top two sets

Code 0

Probability

Set

Set Prob

0.2

dea de

0.2 0.4

b

0.4

a c

0.2

c

0.2

b c

0.2 0.4

b

0.4

d

11

0.1

e

10

0.1

22

Vassil Roussev

11

03: Huffman Coding

CSCI 6990 Data Compression

Huffman Coding by Example 1.

Sort sets according to probability (lowest first)

Letter

Code

a

0

b c

Probability

Set

Set Prob

0.2

dea c

0.4 0.2

0.4

dea c

0.2 0.4

0.2

b

0.4

d

11

0.1

e

10

0.1

23

Huffman Coding by Example 2.

Insert prefix ‘1’ into the codes of top set letters Letter a

Code 0

b

Probability

Set

Set Prob

0.2

c

0.2

0.4

dea

0.4

c

1

0.2

b

0.4

d

11

0.1

e

10

0.1

24

Vassil Roussev

12

03: Huffman Coding

CSCI 6990 Data Compression

Huffman Coding by Example 3.

Insert prefix ‘0’ into the codes of the second set letters Letter

Code

a

Probability 00 0

b

Set

Set Prob

0.2

c

0.2

0.4

dea

0.4

b

0.4

c

1

0.2

d

011 11

0.1

e

001 10

0.1

25

Huffman Coding by Example 4. Letter a

Merge the top two sets

Code 00

b

Probability

Set

Set Prob

0.2

cdea c

0.2 0.6

0.4

dea b

0.4

c

1

0.2

b

0.4

d

011

0.1

e

010

0.1

26

Vassil Roussev

13

03: Huffman Coding

CSCI 6990 Data Compression

Huffman Coding by Example 1.

Sort sets according to probability (lowest first)

Letter

Code

a

00

b

Probability

Set

Set Prob

0.2

cdea b

0.6 0.4

0.4

cdea b

0.4 0.6

c

1

0.2

d

011

0.1

e

010

0.1

27

Huffman Coding by Example 2.

Insert prefix ‘1’ into the codes of top set letters Letter

Code

Probability

Set

Set Prob

a

00

0.2

b

0.4

b

1

0.4

cdea

0.6

c

1

0.2

d

011

0.1

e

010

0.1

28

Vassil Roussev

14

03: Huffman Coding

CSCI 6990 Data Compression

Huffman Coding by Example 3.

Insert prefix ‘0’ into the codes of the second set letters Letter

Code

Probability

Set

Set Prob

a

000 00

0.2

b

0.4

b

1

0.4

cdea

0.6

c

01 1

0.2

d

0011 011

0.1

e

0010 010

0.1

29

Huffman Coding by Example 4. Letter

Merge the top two sets Probability

Set

Set Prob

000

0.2

abcde b

0.4 1.0

b

1

0.4

cdea

0.6

c

01

0.2

d

0011

0.1

e

0010

0.1

a

Code

™

The END 30

Vassil Roussev

15

03: Huffman Coding

CSCI 6990 Data Compression

Example Summary ™

Average code length ¾

™

Entropy ¾

™

l = 0.4x1 + 0.2x2 + 0.2x3 + 0.1x4 + 0.1x4 = 2.2 bits/symbol H = Σs=a..eP(s) log2P(s) = 2.122 bits/symbol

Redundancy ¾

l ‐ H = 0.078 bits/symbol

31

Huffman Tree

0 Letter a

1

Code 000

b

1

c

01

d

0011

e

0010

0 0

a 0.2

1

0

b

1

0.4

c 0.2

1

e

d

0.1

0.1 32

Vassil Roussev

16

03: Huffman Coding

CSCI 6990 Data Compression

Building a Huffman Tree

Letter

Code

b

a

0.4

b

c

c d

1

e

0

0.2

a

0

0.2

0.2

1

e

d

0.1

0.1 33

Building a Huffman Tree

Letter a

Code

b

0

0.4

b c d

11 1

e

10 0

0

a 0.2

0.4

0

c

1

0.2

0.2

1

e

d

0.1

0.1 34

Vassil Roussev

17

03: Huffman Coding

CSCI 6990 Data Compression

Building a Huffman Tree

Letter

Code

a

00 0

0

b

1

0.6

0.4

b c

1

d

011 11

e

010 10

0

a

0.4

0

0.2

c

1

0.2

0.2

1

e

d

0.1

0.1 35

Building a Huffman Tree

0 Letter

1.0

Code

a

000 00

b

1

c

01 1

d

0011 011

e

0010 010

0 0

a 0.2

0.4

0

0.6

1

b

1

0.4

c

1

0.2

0.2

1

e

d

0.1

0.1 36

Vassil Roussev

18

03: Huffman Coding

CSCI 6990 Data Compression

An Alternative Huffman Tree

Letter

Code

b

a

0.4

b c d

1

e

0

0

0.2

1

a

c

e

d

0.2

0.2

0.1

0.1

37

An Alternative Huffman Tree

Letter a

Code

b

0

0.4

b c

1

d

1

e

0

0

0.4

1

0

0.2

1

a

c

e

d

0.2

0.2

0.1

0.1

38

Vassil Roussev

19

03: Huffman Coding

CSCI 6990 Data Compression

An Alternative Huffman Tree

Letter

Code

a

0

00 0

b

1

0.6

0.4

b c

01 1

d

11 1

e

10 0

0

0.4

1

0

1

0.2

a

c

e

d

0.2

0.2

0.1

0.1

39

An Alternative Huffman Tree

0 Letter a

™

1

Code

0

000 00

b

1

c

001 01

d

011 11

e

010 10

0

0.4

0.6

1

b

1

0

0.4

0.2

1

a

c

e

d

0.2

0.2

0.1

0.1

Average code length ¾

l = 0.4x1 + (0.2 + 0.2 + 0.1 + 0.1)x3= 2.2 bits/symbol 40

Vassil Roussev

20

03: Huffman Coding

CSCI 6990 Data Compression

Yet Another Tree Letter

Code

1

0

a

00

b

11

c

01

d

101

a

c

e

100

0.2

0.2

0

0.4

1

0

0

™

Average code length ¾

0.6

1

b 0.2

0.4

1

e

d

0.1

0.1

l = 0.4x2+ (0.2 + 0.2)x2 + (0.1 + 0.1)x3= 2.2 bits/symbol 41

Min Variance Huffman Trees ™

Huffman codes are not unique ¾

™

All versions yield the same average length

Which one should we choose? ¾

The one with the minimum variance in codeword lengths ƒ I.e. with the minimum height tree

™

Why? ¾

™

It will ensure the least amount of variability in the encoded stream

How to achieve it? ¾

During sorting, break ties by placing smaller sets higher ƒ Alternatively, place newly merged sets as low as possible

42

Vassil Roussev

21

03: Huffman Coding

CSCI 6990 Data Compression

Extended Huffman Codes ™

Consider the source: ¾ ¾

™

Huffman code: a b c ¾ ¾

™

A = {a, b, c}, P(a) = 0.8, P(b) = 0.02, P(c) = 0.18 H = 0.816 bits/symbol 0 11 10

l = 1.2 bits/symbol Redundancy = 0.384 b/sym (47%!)

Q: Could we do better?

43

Extended Huffman Codes (2) ™

Idea ¾

Consider encoding sequences of two letters as opposed to single  letters Letter

Probability

Code

aa

0.6400

0

ab

0.0160

10101

ac

0.1440

11

ba

0.0160

101000

bb

0.0004

10100101

bc

0.0036

1010011

ca

0.1440

100

cb

0.0036

10100100

cc

0.0324

1011

l = 1.7228/2 = 0.8614 Red. = 0.0045

bits/symbol

44

Vassil Roussev

22

03: Huffman Coding

CSCI 6990 Data Compression

Extended Huffman Codes (3) ™

The idea can be extended further ¾

™ ™

In theory, by considering more sequences we can improve the coding In reality, the exponential growth of the alphabet makes this impractical ¾

™

Consider all possible nm sequences (we did 32)

E.g., for length 3 ASCII seq.: 2563 = 224 = 16M

Most sequences would have zero frequency Î Other methods are needed

45

Adaptive Huffman Coding ™

Problem ¾ ¾

Huffman requires probability estimates This could turn it into a two‐pass procedure: 1. Collect statistics, generate codewords 2. Perform actual encoding

¾

Not practical in many situations ƒ

™

Theoretical solution ¾ ¾

™

E.g. compressing network transmissions

Start with equal probabilities Based on the first k symbol statistics (k = 1, 2, …) regenerate  codewords and encode k+1st symbol 

Too expensive in practice 46

Vassil Roussev

23

03: Huffman Coding

CSCI 6990 Data Compression

Adaptive Huffman Coding (2) ™

Basic idea ¾ ¾ ¾ ¾

Alphabet A = {a1, …, an} Pick a fixed default binary codes for all symbols Start with an empty Huffman tree Read symbol s from source ƒ If NYT(s)

// Not Yet Transmitted

• Send NYT, default(s) • Update tree (and keep it Huffman)

ƒ Else • Send codeword for s • Update tree ¾

™

Until done

Notes: ¾ ¾

Codewords will change as a function of symbol frequencies  Encoder & decoder follow the same procedure so they stay in sync 47

Adaptive Huffman Tree Tree has at most 2n - 1 nodes ™ Node attributes ™

¾ ¾

symbol, left, right, parent, siblings, leaf  weight ƒ If xk is leaf then weight(xk) = frequency of symbol(xk) ƒ Else xk = weight( left(xk)) + weight( right(xk))

¾

id, assigned as follows: ƒ If weight(x1) ≤ weight(x2) ≤ … ≤ weight(x2n-1) then ƒ id(x1) ≤ id(x2) ≤ … ≤ id(x2n-1) ƒ Also, parent(x2k-1) = parent(x2k), for 1≤ k ≤ n • Sibling property  48

Vassil Roussev

24

03: Huffman Coding

CSCI 6990 Data Compression

Updating the Tree ™ ™ ™ ™

Assign id(root) = 2n-1, weight(NYT) = 0 Start with an NYT node Whenever a new symbols is seen, a new node is formed by splitting the NYT Maintaining sibling property Whenever node x is updated

¾

ƒ Repeat • If weight(x)