Arithmetic Coding
There was a seminar about Data Compression at March 2002. For that reason Malte Clasen, Joachim Kneis and me did a lecture and wrote an article about Arithmetic Coding. You can download the papers below…
The article:
The presentation (German):
- Powerpoint, 1.02 MB
C++ – source codes developed and used in the article:
- With English comments (ZIP, 5KB)
- For Unix (German comments, including makefile) (ZIP, 3KB)
- For Win32 (German comments, including Visual C++ 6 workspace) (ZIP, 5KB)
Executable:
- Coder-Standalone for Win32-Systems(EXE, 68KB)
- Arithmetic Coding visualization – a tool which shows how coding works (Win32) (ZIP, 330KB)
The Visualization Tool – this is what it looks like
Main view:
C# implementation by Sina Momken
In August 2010, Sina Momken developed a C# implementation of AC, which is based on our original C++ implementation. The encoder/decoder was tested on .NET for Windows and Mono for MacOS X. The user interface was tested on .NET for Windows only.
Download the VisualStudio Project here (also works with MonoDevelop)
Download binaries (requires .NET runtime)
Further downloads and links:
Test corpus used during the presentation:
- Corpus (ZIP, 514 KB)
Arithmetic Coding Links:
- DataCompression.info (formerly known as: Data Compression Library)Famous and extensive link list about data compression.
- Data-Compression.info (not the same site as the one above!)
- Calgary Corpus (Benchmark)A famous corpus with standard-files used to benchmark the newest compression algorithms.
- Calgary Corpus Compression ChallengeCompress the corpus down to size X and win 33.33 + (666,666.00 – X) / 222 $!
- A Mathematical Theory of Communication (Shannon)This is surely one of the most famous articles about lossless data compression. There is a proof that you cannot compress a file smaller than its entropy. We presume this proof in our article.
- Range encoding: an algorithm for removing redundancy from a digitized message (Martin)Shoes up mathematical background about the famous (arithmetic) Range Coder.
- Range Coder Implementation (Campos)A fast implementation of a Range Coder.
- The Art of lossless data compressionHere you can find actual benchmarks of the newest compression algorithms.
- Zeosync press release presentationA very funny presentation by the Zeosync Corporation who seems to believe they can compress the whole world losslessly in one bit.







Dear Mr.Bodden,
Your paper was well researched and strikingly simple.. I’d say it s the best material avail for understanding Arithmetic Coding..
Anish (India)
Thanks a lot for the kudos. We are glad that people like it, especially given that it was our first research paper ever written (and certainly, at the same time, the most successful one so far).
Eric
Dear Mr. Bodden,
Your source code is very good and I have learned a lot from it. But I still have some trouble and here beg for help. Actully I am doing some image compression work. The arithmetic coding is the last step of my compression. Those images are of raw format (without any header information) and 16 bits per pixel. But your code is just for files with one “char” per symbol. I have tried to make some modifications including changing the “char” type to “short” and “257″ to “65537″, the resultant code can survive the compilation but the result is just not OK, I mean, the file compressed can not be recovered properly. Would you please send me a copy of the code of 16-bit-per-symbol version or give a list of where to make the proper modifications? Thank u in advance.
lixiang(China)
If you are trying to read a third party file format, there’s no way around getting the proper specs for this format. Arithmetic coding can be implemented in many incompatible ways.
If this is your own image format, the most simple solution would be to decompose your pixels into bytes, for example by separating the most significant 8 bytes from the least significant ones and ordering them either big endian or little endian. This should give the same compression ratio in higher order models (prediction by at least two predecessors instead of one), and you don’t have to touch the compression routines at all. Note that the higher order should not result in memory overhead, since the additional memory would otherwise be consumed by the larger number of allowed symbols. It could affect performance, but I would not optimize this without measurement.
Hello,
First, I found your research report to be convenient enough both in terms of time as well as in terms of learning AC in a fast and simplistic way. Second, your C++ implementation is quite succinct and up to the point. I found it very useful.
However, the following problem haunts me, and I would appreciate anybody’s input, especially your technical judgment as well. I have a file of 7000 symbols, not necessary letters of any alphabet, but just symbols of ones and zeros, which all 7000 such symbols have the same length. In this file, for each symbol, I have a frequency count, which sums up to 100000. Now, given this file, how can I modify and use your code to create an Arithmetic Coding dictionary out of this file?
To be more specific, here’s what I’ve done. Initially I had 10 files that contained binary data. I merged these files into one large file and performed a frequency analysis on symbols of a predefined length, say on bit strings of length 20. In the end, I got 7000 such symbols and their respective frequencies. Now, how could I possibly employ Arithmetic Coding on the symbol frequencies, so that I derive a dictionary? That is, when new files containing binary data come in, rather than using AC on the new file, I could as well employ the already constructed dictionary to attempt to compress the symbols in the new file. What do you think of this?
Thanks,
Andrew
Hi Andrew. Yes, it’s right that you could use a dictionary approach in your setting. Once you know what your symbols are and how frequent they will occur, that’s a good way to go. However, the other option you have is to use AC nevertheless. Simply modify the model in such a way that it partitions its range into the 7000 different classes that you determined. Then, when reading a file, make sure to read enough bits at a time to determine which of the 7000 classes to expand into.
That’s a great input! But, I am quite confused on how to build the initial dictionary using the AC. Meaning, given the frequencies that I have, should I build an interval for each of the 7000 symbols? Then, whenever I get a new file, and I partition the file into the required symbols which I will use to look up in the dictionary, how shall I compress? That is, if the file starts with symbol X, and symbol X contains the interval [0.2, 0.233) in the dictionary, then how do I compress that symbol?
Thanks again Eric! More than helpful!
Andrew
Yes, all you do is partition your [0,1) interval into 7000 different sub-intervals, with the size of each interval proportional to the symbol’s relative frequency. That’s basically your “model”. Thenm whenever reading a symbol, let’s say X, then you determine the correct sub-interval for X and then “home in” on that interval using E1/E2/E3 scaling. During scaling you will automatically be shifting bits to the output stream. Because, as you mention, you already know all the right frequencies beforehand, you will achieve compression automatically using this model, perfect compression actually.
Dear Mr. Bodden,
I found your research report is a very important, i work in domain of compression with the AC methode ,i would like the french version if it is possible.
Thank you Mr. Eric for your help.
B.Benabbs
Hi and thanks for your interest. I am afraid, it was already a lot of work translating the original German manuscript into English and doing the same for French would be even much harder for me, as my French is far from fluent. But maybe somebody else would like to volunteer?
HI again Eric, and thanks for the previous response. Just clarify a point… When you write in the previous post, “Then whenever reading a symbol, let’s say X, then you determine the correct sub-interval for X and then “home in” on that interval using E1/E2/E3 scaling. “, do you mean that when I read X from an input file and I find it in the dictionary, then I get X’s interval from the dictionary and pass it as an argument to E1/E2/E3. Is this what you meant, or should I create a new interval for X?
Also, how could I possibly modify your algorithm such that the output stream is in bits, i.e. a text file containing just the bits of the encoded X.
Thanks a lot!
Andrew
@Andrew: You read X from an input file, find it in the dictionary, get the interval from the dictionary, *scale the current interval with it* and pass it to E1/E2/E3. Since your probabilities (frequencies) do not change, there’s no need to create new symbol intervals during the process. However, AC is still based on the successive downsizing of a single output interval, so don’t pass the symbols’ probability intervals directly to the E1/E2/E3-scaler.
The output stream actually contains bits. If you want to convert it to a text file consisting of 0 and 1 characters, just replace the part where the bits are shifted. Alternatively you could post-process the resulting file.
Thank you, Malte.
What do you exactly mean by “*scale the current interval with it*”? Are you referring to the initial interval [0,1], which is the current, and so forth?
Thanks very much, Mr.Bodden. Your paper inspires me
Hi Eric & Malte,
I followed your tips and correctly implemented my model using part of your code, which proved to be quite at hand.
I have one file question: Say that for an input file F I calculated the overall final interval, how can I measure compression ratio? That is, do I encode (in bits) the low and high values of the interval and divide this by the initial size of file F (in bits)?
Thanks
Hi Andrew. Yes that sounds reasonable to me. It’s what I would have done too to compute compression ratio.
Say my file contains the string aaabcc to encode. Assume that for whatever model the very final interval results in [0.25, 0.295379112). What would the compression ratio be in this case? The initial file size is 6 bytes or 48 bits. Say that the value I choose from the final interval is the low bound 0.25, which in binary translates as 01. Then, is my compression ratio equal to 2/48 bits per character? What do you think?
Andrew, I think the problem here is that in general you won’t be able to encode the resulting interval by just using the lower bound because this would not give the decoder enough information to know when the encoded file has been completely decoded. Here is what people usually do to encode the “end of file”: add to your model an artificial symbol EOF, and assign to this symbol a very low probability in your model. Then, when the encoder reaches the end of file, it encodes the EOF symbol, just as usual. When this symbol is encoded, then you can use *any* value in the final interval to encode the input – of course it makes sense to use the shortest number you can find in this interval. The decoder then simply decodes the encoded input until it happens to decode the EOF symbol. When the decoder decodes this symbol it knows that it is done.
Eric, what if my integer intervals are *really* large, say the low and high of the first symbol happens to be [19’000’000, 54’000’0000), what data types should I use? I attempt to use unsigned long, but the program enters an infinite loop when attempting E-scaling… Any suggestion?
Andrew you probably should be using some floating-point values at this point but that may require more changes to the code than just changing the data type.
Dear Mr. Bodden!
I have read your paper and it is very amazing. But until now, i’m still confused about real implementation of AC using integer value (ex. 16-bits integer).
So, for this time I have 2 questions:
1. Let’s assume we have a file with 800.000 symbols/characters in it. After it is implemented using integer value say:
LOW = 0000
UPPER = 9999
we will get final result as follows:
.4521245124154545465656465656421989894549865646565646…. (and much more digits, it might be 500, 7000, or more digits). Where this result is derived from the shifting of LOW’s MSB and UPPER’s MSB. Is it possible to convert this value to binary ???
If, i was wrong to this point, so how actually the reall implementation of AC. Is it something like this
LOW = $0000 (0000 0000 0000 0000)
HIGH = $FFFF (1111 1111 1111 1111)
So, for all encoding process we only work for that binary.
Example:
(Initail state)
LOWER = 0000 0000 0000 0000
UPPER = 1111 1111 1111 1111
(Symbol 1)
LOWER = 0100 0001 0001 1100
UPPER = 0111 0111 0111 0111
(LOWER’ msb = UPPER’ msb, so shift-out ’0′ and add ’0′ in the most right of ‘LOWER’ and ’1′ in the most right of UPPER’)
Current Output = .0
(Symbol 2)
LOWER = 1000 0000 1110 1110
UPPER = 1001 1110 0010 1000
(Shift out: ’1′, and add ’0′ in the most right of ‘LOWER’ and ’1′ in the most right of UPPER’)
Current Output = .01
(Symbol 3)
LOWER = 0111 0001 1110 1110
UPPER = 0111 1100 1111 1111
(Shift out ’0′, and add ’0′ in the most right of ‘LOWER’ and ’1′ in the most right of UPPER’)
Current Output = .010
and so on, so the final output is ‘Current Output’ with large binary.
If you don’t mind, please confirm to this email:
philipstel@gmail.com
Thank you in adavance!
Hi “Philips T”. Yes, I think the binary encoding that you wrote is correct. Generally, the binary output number is just generated by shifting bits during scaling. Hope that helps,
Eric
Eric, what is the input format of your Arithmetic Coding example code?
Is a .txt file with content “abcdefagth” ok?
Thanks.
Hi Jiang.
Do you refer to our C++ example implementation? With this you can use any file as input.
Eric
Thank Eric for the response!
how about the probabilities of each symbol? Must I rescale them for my integer implementation?
Suppose in a file i have string “aaabbbcccadeeee” to be encoded, and its probabilities as follows:
a = 4/15
b = 3/15
c = 3/15
d = 1/15
e = 4/15
and its range:
a = [0 – 4/15 )
b = [4/15 – 7/15)
c = [7/15 – 10/15)
d = [10/15 – 11/15)
e = [11/15 – 1)
So, I just want to use 16 bits calculation in my encoder calculation:
LOW = $0000;
HIGH = $FFFF;
how can i rescale my probabilities so that it just requires 16 bits?
is this make sense : 7/15 x $FFFF = (integer value)…??
Hi again. Have you read Section 4 of our paper? There we actually describe how to use integers for encoding. In your case, you would just round the value to the next whole number.
Eric
Hi Eric!
Thank you Eric,
I will read it again…
Hi,
The paper, presentation and the C++ source code has been the most useful one for my understanding of Arithmetic Codes. Thank you!
I’m trying to expand the application by implementing the static model based on your code, in addition to the already provided adaptive order-0 model. Can you please give some pointers in this regard.
Regards,
Shup
Hi Shup.
A static model should be even more simple. Just implement an instance of the Model class that always returns constant ranges for given symbols.
Dear Mr Bodden!
Nice Research!
I’m sorry for this stupid question. How actually to implement the AC’ encoding, do I have to convert every decimal numbers first to binary so that it can done in shifting?
I got an simple algorithm out there, here it’s..
* Range = ( high – low ) + 1
* High = low + ( ( range * high_values [ symbol ] ) / scale ) – 1
* Low = low + ( range * high_values [ symbol - 1 ] ) / scale
* Loop. (will exit when no more bits can be outputted or shifted)
* Msb of high = msb of low?
* Yes
o Output msb of low
o Loop. While underflow_bits > 0 Let’s output underflow bits pending for output
+ Output Not ( msb of low )
o go to shift
* No
o Second msb of low = 1 and Second msb of high = 0 ? Check for underflow
o Yes
+ Underflow_bits += 1 Here we shift to avoid underflow
+ Low = low & 3FFFh
+ High = high | 4000h
+ go to shift
o No
+ The routine for encoding a symbol ends here.
But, I don’t know to use this algorithm. What does mean Low and High?
suppose I had Low = 45623 and High = 34561. Do I have to change those value to binary first, so it will give me Low = 10000000001010111 and High = 1000011100000001???
Thank you Bodden, I’ll be very appreciate your help.
Dear Mr Bodden
Hi Bodden, I just want to thank for your research about this stuff, I’ve read your paper and it’s really a huge help for me in understanding Arithmetic Coding. Your paper is very good, good written and easy to understand, since it also completed with a good example. Understanding of E1, E2, and E3 scaling in your paper is the expensive thing that I got. The executive summarize is E1, E2 are used to handle Low and High get closer by shifting out the MSB’s. E3 used to handle the underflow.
Thank again Bodden, I really appreciate it.
~Philips Tel
@Matthew…
Hmm,, I’m sorry Bodden let me to answer the Matthew’s question (please correct me if I was wrong).
Yeah, of course you have to convert every decimal to binary number. Please read again Bodden’s paper in the first section. I think he had wrote the pretty example due to your question.
Thank You..
Dear Mr. Bodden,
My compliments on publishing this readable paper and providing a useful implementation.
I ran into a snag with the implementation in that it always reads 31 bits in DecodeStart(). That may be more bits than are part of the actual compressed bit-stream, thus either exceeding the input or consuming bits that do not belong to the compressed portion of the input.
Is there a way to bootstrap the decoder such that it only reads 1 bit at a time, or alternatively to determine after decoding how many bits to “put back” into the input stream?
Thank you!
Hello Bengt.
I don’t quite understand. Are you sending both encoded and then unencoded data over the same stream? That would definitely not be recommended. In particular, I don’t think there’s any way to “put back” data into a stream, and I also don’t know any means to read a stream bit by bit (but then I am also not a C++ expert).
My recommendation would be to use the stream *just* for reading a single encoded file and then send other data through another stream.
Eric
Hi Eric,
Thank you for the fast reply. A bit more background:
First of all my “stream” is really a buffer, so putting back data is possible, in fact trivial. The data contain AC compressed data interleaved with other data and control information, so I need to be careful not to advance the read pointer beyond the first bit of the following section.
I looked at the code some more and I may have figured out how to address the issue: In ArithmeticCoderC::DecodeStart() I will fill mBuffer one bit at a time. Initially, the model’s Process() method will call DecodeTarget() in a loop until the decoded symbol no longer changes, then the last bit is “put back”. This way, only the bits for the first symbol are read, so there will be no over-fetching. After all bits in mBuffer are valid, everything proceeds as before. Of course, mBuffer needs to be filled from the MSBs down, but that should be straight-forward. Hopefully, I haven’t overlooked anything.
Thanks again for publishing the code and the continued support!
Bengt.
Just an update: Changing the decoder to only read the necessary number of bits to decode a single symbol was possible but more complicated than expected. In particular, it blurred the current clear boundary between Model and Decoder. I ended up keeping the existing AC code and instead changed the way my data are organized by separatin all AC encoded bits into a single section of the input to work around this problem.
Bengt
Thanks for the feedback, that’s useful to know!
Hi all,
I found another stuff about this algorithm and I think it’s good for beginner who want to learn AC.
http://philipstel.web.id/data-compression/arithmetic-coding-algorithm-and-implementation-issues/
Hope this help….