Arithmetic Coding Algorithm and Implementation Issues

Arithmetic Coding Algorithm and
Implementation Issues
You’ve probably seen that there are a lots of articles or textbooks that talk about this algorithm
detailed. Unfortunately, not all of them explain it clearly and sometimes hard to understand what
it’s really talking about, since it just give a common basic theory without considering how actually
to implement in the real world. However, there are still great stuff that I think is very good for any
beginners who learn this algorithm. You can visit Eric Bodden, Dipperstein or Arturo Campos that
explain this algorithm conceptually and followed by implementation issues and an easy to follow
source code. But, based on my experience, I had got trouble about implementation it although
theoretically it isn’t too complicated. I don’t really understand how this algorithm works and then
achieve a compression, and I can say, until now I don’t understand yet what is the exactly concept
behind it. You might have been playing with the other compression algorithms like Huffman, LZW,
or Shannon Fano, and I believe you can easily follow its flow in achieving the compression and I
believe too you can implemented it in any programming language in a straightforward way by
simply looking its instructions (steps/algorithms). But not for Arithmetic Coding, mastering the
theory isn’t a guarantee that you can implement it straightforward, many of students (included
me) have sucked when dealing with this problem. It turns out that there are some issues that
should be understood before implemented it (I’ll give it latter). Basically, AC is often given as an
introduction with the real number implementation, in fact it should be implemented using the
integer value. I found a great paper due to this issue from Eric Bodden, you can download it freely
here. That was really a huge help in understanding the Arithmetic Coding, I highly recommend
you to read it.
Algorithm Overview
Generally, data compression algorithms works by using a certain codes to represent one or some
symbols that occur in a file repeatedly, don’t care what technique is used by that algorithms to
generate the codes. Arithmetic Coding (AC) is slight different, the main idea of AC is to encode
the whole message of a file with a single floating point number that fall in fraction [0, 1), means
“zero” is included and “one” is excluded. Initially, when I first time I learn it, I was wondering how
could it be. I didn’t believe that it could be done. It turns out that there are infinitely numbers
between 0 and 1, where the intended numbers here are all real numbers in interval 0 <= x < 1.
Those numbers will be used to encode all symbols from an input stream, which means the final
result will always represented by any numbers from that interval. So, how to convert this number
to binary so we can estimate the ratio of this algorithm? (oh, that was a stupid question that I’ve
ever asked when I first time learn AC).
Actually AC has a long history, since when it had been found, it had been defeating Huffman
algorithm. Please correct me if I was wrong or please refer a textbook from Mengyi Pu or Mark
Nelson for more detail of this algorithm.
To perform compression or decompression, AC needs two faces. The first face is building the
probability and range each symbols and then doing the proses encoding and decoding based on
that probabiltity.
To keep everything is simple, let’s assume a set of symbols S in message M is {S1, S2, S3, ….
SN} with probability P = {P1, P2, P3,….PN} where ∑P =1, Would be produced an unique sub interval
for each symbol that fall in fraction [0, 1).
S1 = [0, P1)
S2 = [P1, P1+P2)
S3 = [P1+P2, P1+P2+P3)
…
S3 = [P1+P2+P3+…+PN-1, P1+P2+P3+…+PN)
For example string “HELLO” would give you a probability table as follows: symbol that fall in
fraction [0, 1).
Table 1.0
Symbol Probability Range
H 1/5 = 0.2 [0 – 0.2)
E 1/5 = 0.2 [0.2 – 0.4)
L 2/5 = 0.4 [0.4 – 0.8)
O 1/5 = 0.2 [0.8 – 1)
To encode the string, information like the above table is required either in encoding or decoding
to generate codeword or in a reversing from the encoded stream. So, when you compress a string
you need to include that information in your file header so the decoder would have information in
decoding the encoded stream. In the table probability, the order of symbols is optional, but the
encoder and decoder must do the same way (so far so good).
Encoding
The encoding process of Arithmetic Coding is started by reading a byte at a time, and give it an
unique floating point value to represent its lower and upper bound that derived by the arithmetic
calculation, and the encoder keeps this track to encode the next symbol. The final result is a value
that fall in the final fraction of lower bound and upper bound. However, many of references used
the lower bound as the final of the encoding process, and I feel that’s fine. Okay, the Arithmetic
Coding’s Encoding algorithm is given as below pseudo code.
Low = 0.0
High = 1.0
while characters in input stream is exist do this jobs:
Read Symbol by Symbol;
CurrentSymbol = Symbol;
CodeRange = High - Low;
High = Low + CodeRange * High Range of the CurrentSymbol;
Low = Low + CodeRange * LowRange of the CurrentSymbol;
end while
EncodedStream = [Low, High);
By using the table probability in table 1.0, let’s to encode the previous string (HELLO).
// init : Low = 0.0 & High = 1.0

Step 1:
Read Symbol “H”
CodeRange = 1.0 – 0.0 = 1.0
High = 0.0 + 1.0 * 0.2 = 0.2
Low = 0.0 + 1.0 * 0 = 0
Step 2:
Read Symbol “E”
Code Range = 0.2 – 0 = 0.2
// Update the symbol bounds and keep that tracks
High = 0 + 0.2 * 0.4 = 0.08
Low = 0 + 0.2 * 0.2 = 0.04
Step 3:
Read Symbol “L”
Code Range = 0.08 – 0.04 = 0.04
High = 0.04 + 0.04 * 0.8 = 0.064
Low = 0.04 + 0.04 * 0.4 = 0.032
Step 4:
Read Symbol “L”
Code Range = 0.064 – 0.032 = 0.032
High = 0.032 + 0.032 * 0.8 = 0.0512
Low = 0.032 + 0.032 * 0.4 = 0.0256
Step 5:
Read Symbol “O”
Code Range = 0.0512 – 0.0256 = 0.0256
High = 0.0256 + 0.0256 * 1 = 0.0512
Low = 0.0256 + 0.0256 * 0.8 = 0.04608
So, the final result is any value that falls in interval [0.04608, 0.0512), but to keep everything is
simple, we just choose the lower bound, that is 0.04608 and it’s right falls in fraction 0 -1. So, this
value that represents string “HELLO”. Recall that we can’t perform the ceiling to that value,
otherwise the decoding will error.
Decoding
In the decoding, decoder simply looks the symbol that has a correspondent to the current code.
For instance, if the current code is being searched is 0.3, then the selected symbol is “E”, if code
is 0.4 then the symbol is “L”, and so on. As the symbol is found, the range of current symbol is
updated to make sure the effect of previous symbol is clear. Here is the pseudo code for decoding:
Low = 0.0
High = 1.0
Take code
Do
Search symbol which has a correspondent with the current code
CodeRange = High - Low
Code = Code - low
Code = Code/CodeRange
Until symbol = EOF
Using the table probability from table 1.0 let’s decode code 0.04608 to the original string.
// Init : Low = 0.0 & High = 1.0

Step 1:
Code = 0.04608
Since ‘Code’ falls in interval 0 – 0.2, then we choose symbol “H”, and then we update some
variables:
CodeRange = 1.0 – 0.0 = 1.0
Code = Code – Low = ?? – ??
Code = Code/CodeRange = ??/??
Step 2:
The rest of the decoding is leaved as a homework assignment to the students…. :-D.
Implementation Issues
As you can see from the previous example, the decoder and encoder are assumed could hold
the infinite precision of number. Actually this isn’t effective since practically none of computer
today (even a super computer) can handle the infinite precision, since the higher symbol to be
encoded in the input stream the higher precision are needed, and the bad new is we aren’t allowed
to lose even a byte of precision, in other words, no rounded can be performed otherwise the
decoder will get error in the decoding process. Fortunately, there is a pretty way to handle this
problem, instead of using the real number, we can use integer value 16 or 32 bits by changing
the interval from 0 – 1 to be 0000h – FFFFh (in the case we use 16 bit integer value).
So, the question is how the compression can be achieved from that modification. Well, in fact, the
compression result is generated from MSB (Most Significant Bits) from lower and upper bound
when they are converge. (please refer this issue from Michael). Based on the nature of AC
algorithm, once the the MSBs of lower and upper bound are converge, they will never changed
which means the next encoding can’t be perform anymore. To handle this problem we need to
remove such bits, and add bit 0s and 1s to the most right of lower and upper bound. By this way,
it’s possible to generate the infinite codewords from interval 0000h – FFFFh.
Example of The Converge MSBs:

Lower = 0110 1100 0001 1101
Upper = 0111 1110 1010 1100
in the above example the MSBs of the lower and upper are converge and we need to shift-out
those bits to the output and shift-in some bits to the bounds, so the new lower and upper bits
would be something like this:
Lower = 0110 0000 1110 1000

Upper = 1111 0101 0110 0111
output = 011, (we simply change this output bits to the ASCII characters to estimate the
compression ratio).
In the other hand, there is one more a serious problem for AC when encoding is works. Once
again, based on the nature of this algorithm once the 2nd MSB of lower is ‘1’ and upper is ‘0’,
they will get closer. If this happen, the encoder would not generate anything to the output since
there isn’t any bit is shifted out. This problem is called as underflow. Fortunately, there is a simple
way to handle this problem, that’s by shifting them to the output. As those bits are shifted, we
must shift in some bits to the most right of lower and upper bound, just like we’ve done in the
previous MSBs’ shifting.
Example of The Underflow Condition:
Lower = 0110 0000 1110 1011
Upper = 1001 0101 0110 1110
The red highlighted bits are the cause of underflow. Remove them! so our bounds would be
something like this:
The Fresh Bits:

Lower = 0000 0011 1010 1100
Upper = 1101 0101 1011 1011
The green highlighted bits are the new bits, so our bounds is right consists of 16 bits in the case
we are using 16 bits integer value. The great pseudo code of encoding and decoding for the
integer implementation can be found here.
Conclusions (Issues for AC Implementation)
Here are the most important issues that I think you should know before implementing the
Arithmetic Coding. (Hey, please don’t forget it! I’ve spent 3 months to summarized these
issues….:-D Just kidding)
1. In the “real” implementation of AC using the floating point number, encoder/decoder are
assummed can handle the numbers with infinite precision. But practically, none of computers can
handle such numbers since the encoder or decoder just use the interval from 0 – 1 to encode the
symbols, and the bad news is when the input stream get larger, the precision would get larger too
and no rounding can be performed.
2. Due to the problem in the 1st issue, we seem need to rescale the symbols probability using
the integer value instead of the floating point value. For instance by using the interval from 0000xh
– FFFFxh. To make the output bits is infinite from that interval, we must perform some “shifting”
on MSBs of 0000h-FFFFh.
3. Shifting roles can be done for these issues:
– Once MSBs Low and High match, they will never change, and we can’t output any bits to the
output after few iterations. To handle this problem, we need to shift out such bits. As we shifted
it out, we have to add bits 0s in the most right of the lower bound and bits 1s for the upper
bound.
-Underflow occurs when the 2nd MSB of lower is ‘1’ and upper is ‘0’ . So, we have to shift out
that MSBs to the output, and add bit 0 to the lower bound and 1 to the upper bound exactly in
most right of them. Underflow can be detected by looking the first two of the Lower and Upper
MSB.
4. The “Scaling” can be seen here as the 2nd point said.
That’s all..
(what do u say, it’s crazy, right? please correct me if I was wrong for some issues)
Thx…:-)

Arithmetic Coding Algorithm and Implementation Issues

Uploaded by

Arithmetic Coding Algorithm and Implementation Issues

Uploaded by

Arithmetic Coding Algorithm and

// init : Low = 0.0 & High = 1.0

// Init : Low = 0.0 & High = 1.0

Example of The Converge MSBs:

Lower = 0110 0000 1110 1000

The Fresh Bits:

Conclusions (Issues for AC Implementation)

You might also like