Eric Bodden Current conditions in Darmstadt: Broken Clouds, 6°C (feels like 3°C)
6°C

Easy and efficient software verification
  • rss
  • Home
  • Research
    • Past Research
      • Efficient Runtime Verification
      • Racer: Effective Race Detection Using AspectJ
      • Continuation-equivalent states (ICSE 2010)
      • Aspect-oriented programming and design
      • Visual specification languages
      • A denial-of-service attack on the Java bytecode verifier
    • Publications
    • Presentations
  • Tools
    • Clara: Compile-time Approximation of Runtime Analyses
    • RacerAJ (for race detection)
    • An introduction to Soot 2.2.5
    • Aspect-oriented approaches targeting the .NET Framework
  • Teaching
    • Automated Software Engineering
    • Software-Engineering Project
    • COMP 520
    • COMP 621
  • Legacy
    • Bad Sector Recovery on NTFS
    • Arithmetic Coding
    • PHP Scripts
  • About me
  • Photos

Arithmetic Coding

There was a seminar about Data Compression at March 2002. For that reason Malte Clasen, Joachim Kneis and me did a lecture and wrote an article about Arithmetic Coding. You can download the papers below…

The article:
  • German
  • English (now available as a citeable Sable Technical Report)
The presentation (German):
  • Powerpoint, 1.02 MB
C++ – source codes developed and used in the article:
  • With English comments (ZIP, 5KB)
  • For Unix (German comments, including makefile) (ZIP, 3KB)
  • For Win32 (German comments, including Visual C++ 6 workspace) (ZIP, 5KB)
Executable:
  • Coder-Standalone for Win32-Systems(EXE, 68KB)
  • Arithmetic Coding visualization – a tool which shows how coding works (Win32) (ZIP, 330KB)
The Visualization Tool – this is what it looks like

Main view:

Arithmetic Coder - Visualization Extension, Screenshot of main view

Enter here a text string or binary string that should be encoded. Enter here e.g. Enter here the  path of a TEXTfile, that holds the data to encode. Select here the model. First option will read any strings and adapt the model dynamically. Second option will read strings formed of 0 and 1 and adapt the model.  Option 3 will used a fixed, custom model. Click here to edit the custom model. Do not enter 0, if such a letter occurs in the input! Select what kinds of scaling are allowed. Click to encode the next symbol. Rightclick to select batch encoding. Start from the beginning. Dump the current model. Here the output bitstring is shown. * symbols represent characters that were read. Here the scalings are displayed. Refer to the paper for details. Shows the last encoded symbol [ANSI code] and the number of bits / scalings used by that symbol. Select how many symbols should be encoded. Shows the current compression ratio.
Further downloads and links:
Test corpus used during the presentation:
  • Corpus (ZIP, 514 KB)
Arithmetic Coding Links:
  • DataCompression.info (formerly known as: Data Compression Library)Famous and extensive link list about data compression.
  • Data-Compression.info (not the same site as the one above!)
  • Calgary Corpus (Benchmark)A famous corpus with standard-files used to benchmark the newest compression algorithms.
  • Calgary Corpus Compression ChallengeCompress the corpus down to size X and win 33.33 + (666,666.00 – X) / 222 $!
  • A Mathematical Theory of Communication (Shannon)This is surely one of the most famous articles about lossless data compression. There is a proof that you cannot compress a file smaller than its entropy. We presume this proof in our article.
  • Range encoding: an algorithm for removing redundancy from a digitized message (Martin)Shoes up mathematical background about the famous (arithmetic) Range Coder.
  • Range Coder Implementation (Campos)A fast implementation of a Range Coder.
  • The Art of lossless data compressionHere you can find actual benchmarks of the newest compression algorithms.
  • Zeosync press release presentationA very funny presentation by the Zeosync Corporation who seems to believe they can compress the whole world losslessly in one bit.
Comments rss
Comments rss
Trackback
Trackback

20 Responses to “Arithmetic Coding”

  1. Anish says:
    April 26, 2009 at 6:12 am

    Dear Mr.Bodden,

    Your paper was well researched and strikingly simple.. I’d say it s the best material avail for understanding Arithmetic Coding..

    Anish (India)

  2. eric says:
    April 26, 2009 at 10:30 am

    Thanks a lot for the kudos. We are glad that people like it, especially given that it was our first research paper ever written (and certainly, at the same time, the most successful one so far).

    Eric

  3. Lixiang says:
    July 30, 2009 at 7:36 am

    Dear Mr. Bodden,
    Your source code is very good and I have learned a lot from it. But I still have some trouble and here beg for help. Actully I am doing some image compression work. The arithmetic coding is the last step of my compression. Those images are of raw format (without any header information) and 16 bits per pixel. But your code is just for files with one “char” per symbol. I have tried to make some modifications including changing the “char” type to “short” and “257″ to “65537″, the resultant code can survive the compilation but the result is just not OK, I mean, the file compressed can not be recovered properly. Would you please send me a copy of the code of 16-bit-per-symbol version or give a list of where to make the proper modifications? Thank u in advance.
    lixiang(China)

  4. Malte Clasen says:
    August 4, 2009 at 1:55 am

    If you are trying to read a third party file format, there’s no way around getting the proper specs for this format. Arithmetic coding can be implemented in many incompatible ways.

    If this is your own image format, the most simple solution would be to decompose your pixels into bytes, for example by separating the most significant 8 bytes from the least significant ones and ordering them either big endian or little endian. This should give the same compression ratio in higher order models (prediction by at least two predecessors instead of one), and you don’t have to touch the compression routines at all. Note that the higher order should not result in memory overhead, since the additional memory would otherwise be consumed by the larger number of allowed symbols. It could affect performance, but I would not optimize this without measurement.

  5. Andrew says:
    October 13, 2009 at 9:18 am

    Hello,

    First, I found your research report to be convenient enough both in terms of time as well as in terms of learning AC in a fast and simplistic way. Second, your C++ implementation is quite succinct and up to the point. I found it very useful.

    However, the following problem haunts me, and I would appreciate anybody’s input, especially your technical judgment as well. I have a file of 7000 symbols, not necessary letters of any alphabet, but just symbols of ones and zeros, which all 7000 such symbols have the same length. In this file, for each symbol, I have a frequency count, which sums up to 100000. Now, given this file, how can I modify and use your code to create an Arithmetic Coding dictionary out of this file?

    To be more specific, here’s what I’ve done. Initially I had 10 files that contained binary data. I merged these files into one large file and performed a frequency analysis on symbols of a predefined length, say on bit strings of length 20. In the end, I got 7000 such symbols and their respective frequencies. Now, how could I possibly employ Arithmetic Coding on the symbol frequencies, so that I derive a dictionary? That is, when new files containing binary data come in, rather than using AC on the new file, I could as well employ the already constructed dictionary to attempt to compress the symbols in the new file. What do you think of this?

    Thanks,

    Andrew

  6. eric says:
    October 13, 2009 at 9:27 am

    Hi Andrew. Yes, it’s right that you could use a dictionary approach in your setting. Once you know what your symbols are and how frequent they will occur, that’s a good way to go. However, the other option you have is to use AC nevertheless. Simply modify the model in such a way that it partitions its range into the 7000 different classes that you determined. Then, when reading a file, make sure to read enough bits at a time to determine which of the 7000 classes to expand into.

  7. Andrew says:
    October 13, 2009 at 9:41 am

    That’s a great input! But, I am quite confused on how to build the initial dictionary using the AC. Meaning, given the frequencies that I have, should I build an interval for each of the 7000 symbols? Then, whenever I get a new file, and I partition the file into the required symbols which I will use to look up in the dictionary, how shall I compress? That is, if the file starts with symbol X, and symbol X contains the interval [0.2, 0.233) in the dictionary, then how do I compress that symbol?

    Thanks again Eric! More than helpful!

    Andrew

  8. eric says:
    October 13, 2009 at 9:51 am

    Yes, all you do is partition your [0,1) interval into 7000 different sub-intervals, with the size of each interval proportional to the symbol’s relative frequency. That’s basically your “model”. Thenm whenever reading a symbol, let’s say X, then you determine the correct sub-interval for X and then “home in” on that interval using E1/E2/E3 scaling. During scaling you will automatically be shifting bits to the output stream. Because, as you mention, you already know all the right frequencies beforehand, you will achieve compression automatically using this model, perfect compression actually.

  9. Benabbas says:
    October 13, 2009 at 12:43 pm

    Dear Mr. Bodden,

    I found your research report is a very important, i work in domain of compression with the AC methode ,i would like the french version if it is possible.

    Thank you Mr. Eric for your help.

    B.Benabbs

  10. eric says:
    October 13, 2009 at 12:48 pm

    Hi and thanks for your interest. I am afraid, it was already a lot of work translating the original German manuscript into English and doing the same for French would be even much harder for me, as my French is far from fluent. But maybe somebody else would like to volunteer?

  11. Andrew says:
    October 15, 2009 at 8:12 pm

    HI again Eric, and thanks for the previous response. Just clarify a point… When you write in the previous post, “Then whenever reading a symbol, let’s say X, then you determine the correct sub-interval for X and then “home in” on that interval using E1/E2/E3 scaling. “, do you mean that when I read X from an input file and I find it in the dictionary, then I get X’s interval from the dictionary and pass it as an argument to E1/E2/E3. Is this what you meant, or should I create a new interval for X?

    Also, how could I possibly modify your algorithm such that the output stream is in bits, i.e. a text file containing just the bits of the encoded X.

    Thanks a lot!

    Andrew

  12. Malte Clasen says:
    October 27, 2009 at 7:20 pm

    @Andrew: You read X from an input file, find it in the dictionary, get the interval from the dictionary, *scale the current interval with it* and pass it to E1/E2/E3. Since your probabilities (frequencies) do not change, there’s no need to create new symbol intervals during the process. However, AC is still based on the successive downsizing of a single output interval, so don’t pass the symbols’ probability intervals directly to the E1/E2/E3-scaler.

    The output stream actually contains bits. If you want to convert it to a text file consisting of 0 and 1 characters, just replace the part where the bits are shifted. Alternatively you could post-process the resulting file.

  13. Andrew says:
    October 31, 2009 at 5:56 am

    Thank you, Malte.

    What do you exactly mean by “*scale the current interval with it*”? Are you referring to the initial interval [0,1], which is the current, and so forth?

  14. anhnht says:
    February 2, 2010 at 8:28 am

    Thanks very much, Mr.Bodden. Your paper inspires me

  15. Andrew says:
    February 7, 2010 at 7:38 am

    Hi Eric & Malte,

    I followed your tips and correctly implemented my model using part of your code, which proved to be quite at hand.

    I have one file question: Say that for an input file F I calculated the overall final interval, how can I measure compression ratio? That is, do I encode (in bits) the low and high values of the interval and divide this by the initial size of file F (in bits)?

    Thanks

  16. eric says:
    February 7, 2010 at 10:19 am

    Hi Andrew. Yes that sounds reasonable to me. It’s what I would have done too to compute compression ratio.

  17. Andrew says:
    February 7, 2010 at 1:27 pm

    Say my file contains the string aaabcc to encode. Assume that for whatever model the very final interval results in [0.25, 0.295379112). What would the compression ratio be in this case? The initial file size is 6 bytes or 48 bits. Say that the value I choose from the final interval is the low bound 0.25, which in binary translates as 01. Then, is my compression ratio equal to 2/48 bits per character? What do you think?

  18. eric says:
    February 7, 2010 at 2:34 pm

    Andrew, I think the problem here is that in general you won’t be able to encode the resulting interval by just using the lower bound because this would not give the decoder enough information to know when the encoded file has been completely decoded. Here is what people usually do to encode the “end of file”: add to your model an artificial symbol EOF, and assign to this symbol a very low probability in your model. Then, when the encoder reaches the end of file, it encodes the EOF symbol, just as usual. When this symbol is encoded, then you can use *any* value in the final interval to encode the input – of course it makes sense to use the shortest number you can find in this interval. The decoder then simply decodes the encoded input until it happens to decode the EOF symbol. When the decoder decodes this symbol it knows that it is done.

  19. Andrew says:
    March 9, 2010 at 10:37 pm

    Eric, what if my integer intervals are *really* large, say the low and high of the first symbol happens to be [19′000′000, 54′000′0000), what data types should I use? I attempt to use unsigned long, but the program enters an infinite loop when attempting E-scaling… Any suggestion?

  20. Eric says:
    March 10, 2010 at 7:48 am

    Andrew you probably should be using some floating-point values at this point but that may require more changes to the code than just changing the data type.

Leave a Reply

Click here to cancel reply.

Welcome

Welcome to my website. Interested in my research? Click here for details or jump directly to my publications.

Pages

  • Research
    • Past Research
      • Continuation-equivalent states (ICSE 2010)
      • Efficient Runtime Verification
      • Racer: Effective Race Detection Using AspectJ
      • Aspect-oriented programming and design
      • Visual specification languages
      • A denial-of-service attack on the Java bytecode verifier
    • Publications
    • Presentations
  • Tools
    • Clara: Compile-time Approximation of Runtime Analyses
    • RacerAJ (for race detection)
    • An introduction to Soot 2.2.5
    • Aspect-oriented approaches targeting the .NET Framework
  • Teaching
    • Automated Software Engineering
    • Software-Engineering Project
    • COMP 520
    • COMP 621
  • Legacy
    • Arithmetic Coding
    • Bad Sector Recovery on NTFS
    • PHP Scripts
  • About me

Categories & Feeds

  • Misc RSS Feed Icon (89)
  • Montreal Blog RSS Feed Icon (44)
  • Research Blog RSS Feed Icon (67)
  • Comments (RSS) RSS Feed Icon

Kitchensink

  • Conferences
  • My first patent: Method and system for performance profiling of software (pending)
  • Photos

Research projects

  • AspectBench Compiler (abc)
  • J-LO
  • Soot
  • Stratified aspects

Service

  • AOSD 2006
  • AOSD 2007
  • AOSD 2010
  • AOSD 2011
  • Association of Alumni, Friends, and Supporters of the RWTH Aachen University in North America
  • ATVA 2008
  • ECOOP 2008 Doctoral Symposium
  • ECOOP 2010
  • FOAL 2010
  • IEEE Transactions on Software Engineering (TSE)
  • International Journal of Image and Graphics
  • ISSTA 2011
  • OOPSLA 2008
  • PEPM 2008
  • PLDI 2006
  • PLDI 2008
  • RV 2007
  • RV 2009
  • RV 2010
  • SEFM 2005
  • SEFM 2008
  • Transactions on Software Engineering and Methodology (TOSEM)
  • VMIL 2008
  • VMIL 2009

Some of my colleagues

  • Gregory Prokopski
  • Laurie Hendren
  • Nomair Naeem
  • Ondrej Lhotak
  • Patrick Lam
  • Programming Tools Group
  • Sable lab

Some other people I know

  • Adrian Colyer
  • Bruno Dufour
  • Dan North
  • Daniel Klink
  • Dave Thomas
  • Dean Wampler
  • Friedrich Steimann
  • Joachim Kneis
  • Klaus Havelund
  • Liz Keogh
  • Malte Clasen
  • Markus Schorn
  • Pascal Costanza
  • Patricia Jablonski
  • Philip Mayer
  • Ron Bodkin
  • Sven Wittig
  • Wiebke Berg

Some people not to confuse me with

  • Eric B. the terrorist
  • Eric Bodden the basketball player
  • Eric Bodden the chef who sunk
  • Eric Christopher Bodden
  • Noel R. Lopez alias Eric Bodden

Previous posts

March 2010
M T W T F S S
« Feb    
1234567
891011121314
15161718192021
22232425262728
293031  

Tags

Alumni AspectJ Atlanta Blizzard Bug finding Clara COMP 621 Eclipse Flight FSE Google ISSTA Java LinkedIn Mac McGill Microsoft Montreal Blog OOPSLA Oxford Photos Politics Programming Quebec City Race detection Racer Random ranting Runtime verification RWTH Seattle Ski trip Snow Snow storm Soot Soot Tutorial Static Analysis Strike TA strike Thesis tracematches Typestate Vacation Website Winter carnival Wordpress


rss Comments rss valid xhtml 1.1 design by jide powered by Wordpress get firefox