Prof. Dr. Eric Bodden

ISSTA 2008 recap

Eric | August 11, 2008

This is a small recap of ISSTA 2008. It’s a little belated but I thought better late than never.

Defects workshop

On Sunday I attended Defects’08 which was the first (!) International Workshop on Defects in Large Software Systems. The group of people was kind of interesting in the sense that I knew nobody in person when I got there. The primary may have been that there is a subtle difference between detecting software “defects” (which they focus on) and “API violations” (which I focus on): When looking for program points where API specifications are violated then your are given a specification. Therefore any usage of the API that does not conform with that specification is –by definition– an error, even if nothing bad actually happens.

The detection of “defects” in the sense of the above workshop is different. By “defect” these people mean some piece of code that actually triggers a fault in the system (although there was some disagreement about whether a program location that merely contains an error is already a defect, even if that error does not actually lead to a fault). Anyway, so these are generic programming errors, regardless of any specification. In Java for instance, invoking a method on a definitely null value makes no sense. If it is really hard to determine that (and why) the value is non-null then the code may still be considered “bad style”. Tools like FindBugs help programmers to detect such errors and indeed FindBugs seemed to be one of the most sophisticated tools mentioned at Defects. It also seems to be the only tool with industrial relevance.

One interesting point that came out of the discussion was that apparently virtually all such tools have a false positive rate of around 30-40%. This may seem high and so many researchers have spent years and years on improving metrics for determining likely defects in code. Interestingly, there seems to be a (natural !?) barrier around this 30-40% mark and therefore most approaches were not able to improve a lot on earlier ones.

One very powerful technique, and one amongst the best, seems to be the mining of repositories. There is now an entire conference dedicated to this topic. Code churn for example is a metric that one can determine using such an approach: Code that is changed often is surprisingly likely to have been hard to understand in the first place and therefore is likely to contain defects.

Defects was the first workshop of its kind and I am interested to see if it will continue to exist in the current form, given that most new approaches did not seem to advance the state of the art significantly. But that’s only my two cents.

ISSTA conference

Keynote by Jim Larus

ISSTA itself started up on Tuesday with a keynote talk by Jim Larus from MSR. Jim gave a broad historical overview of the various efforts that Microsoft undertakes to improve code quality and to find programming errors as early as possible. Interestingly, one of the most effective tools is one that did not come at all out of research: Watson. Watson is the little tool that anonymously sends information to Microsoft when one of your programs crashes. Prioritizing bugs, i.e. answering the question “Which bug should I fix first?” was really hard before Watson. With Watson however, people could find out that 80% of the crashes are usually only caused by the top 10 or so bugs. You can easily imagine how big of a help this can be. Another interesting story was the one on buffer overflows. Jim confirmed that indeed Microsoft stopped all coding work for some months in order to re-educate programmers on the problem of buffer overflows and how to avoid them. During that time they also threw all sorts of tools at the existing code base and eliminated a large set of potential problems that way.

ISSTA’s main program was much about abstract interpretation, some of them very interesting however. Some others actually showed some synergy with tracematches and our static analysis of those. I will list some in the remainder of this post.

Finding Errors in .NET with Feedback-Directed Random Testing

Carlos Pacheco (Massachusetts Institute of Technology), Shuvendu K. Lahiri (Microsoft Research Laboratory), Thomas Ball (Microsoft Research Laboratory)

The authors present a tool called Randoop (a version of the tool also exists for Java) which generates unit tests that execute certain pieces of code randomly. During the execution, the tool recignizes a “faulty” run by looking for certain kinds of exceptions and/or violated assertions. Despite all the randomness, the tool showed astonishingly good results, revealing more errors per time than even experienced testers managed to reveal in a similar code base. The connection to tracematches? In our last paper we proposed a method to evaluate tracematches ahead-of-time, as good as possible. In many cases it is not possible to evaluate tracematches completely however. In those cases Randoop could come in handy. Just imagine: Right now we tell you for example that “There is a collection created at line x which may be iterated over at line y after having been modified at line z and this may lead to a ConcurrentModificationException. If there was a way to direct Randoop do only generate random executions that are guaranteed to go threw these locations then we could instead have Randoop (1) check if this is indeed that case and (2) if there is a violation have Randoop emit a test case that triggers this violation. What else could you ask for?

Finding Bugs in Java Native Interface Programs

Goh Kondoh, Tamiya Onodera (IBM Research)

This paper was IMHO actually a misnomer, it should rather have been called “Applying typestate checking to the real world”. the concepts that were presented had really nothing to do with JNI. The only connection to JNI was that Kondoh and Onodera looked for usage errors in the API of JNI. How did they do it? With typestate analysis. (yawn) And even worse, the typestate properties were really simple to decide because they were not parametrized. In terms of tracematches, all their tracematches had zero variables. I had expected more, judging from the title.

More interesting to me were the following.

Practical Pluggable Types for Java

Matthew M. Papi, Mahmood Ali, Telmo Luis Correa Jr., Jeff H. Perkins, Michael D. Ernst (Massachusetts Institute of Technology)

Michael Ernst presented an interesting tool approach to integrate (“plug”) type systems into a real programming language like Java. Their system is based on Java 5 annotations, or (as I understood) and extension thereof which allows annotations at more places than usual. Programmers can for example define an annotation @NonNull and then in a –relatively–
simple way extend a type checking framework to implement the static semantics of this annotation. They had some promising examples and I think it’s a great tool for bringing the PL community a little closer to the “real world”.

Are Your Votes Really Counted? Testing the Security of Real-world Electronic Voting Systems

Davide Balzarotti, Greg Banks, Marco Cova, Viktoria Felmetsger, Richard Kemmerer, William Robertson, Fredrik Valeur, Giovanni Vigna (University of California at Santa Barbara)

This talk was kind of fun. I mean by now really every sensible person should not only believe but know that automated voting machines are just inherently unsafe and insecure ant that there is a not too little chance that the POTUS elected in 2000 would have been a different one, had voting machines not been used. Of course the American election system is fundamentally flawed anyway (as is the German 🙁 ): after all, how can it be considered democratic if the person who gets most direct votes (without any doubt Al Gore in 2008) is not elected president? But that was not the point of this paper. The point of this paper was really voting machines and why they should not be trusted.

In this paper the researchers presented results of the “California Top-To-Bottom Review (TTBR) in July 2007”. Of course they found several security holes in the machines but it was surprising how easy they could be exploited. The must “funny” one (if one may consider this at all funny) I can remember were:

All machines had an “auto-run” feature enabled: just insert a memory stick with a (potentially malicious) program; you don’t even need to hit a key: it’s automatically executed!
The machines had a flag that told the machine whether or not it is in test mode. This flag was global and globally accessible. Therefore every malicious piece of code could simply check the flag and perform “business as usual” when being in test mode but then steal votes when being in the actual voting mode.

Sometimes i am really not sure whether I should laugh or cry.