clausen1997
clausen1997
SUMMARY
This paper describes Cream, an optimizer for Java bytecode using side-effect analysis to im-
prove the optimizations. Dead-code elimination and loop-invariant removal are implemented
and tested, as well as several variations of the side-effect analysis. The optimizer is tested on
real-world applications such as itself and JavaSoft’s Java compiler. Results show that the opti-
mizations benefit well from the side-effect analysis. The best side-effect analysis gives five to ten
times as many optimizations as without an analysis, and, in one case, makes a speed increase of
25% possible. 1997 John Wiley & Sons, Ltd.
1. INTRODUCTION
Java, being a relatively young language, does not yet have as sophisticated optimizing
compilers as other, more mature, languages like C and Fortran. When designing the opti-
mizers needed to improve Java’s speed towards a more acceptable level, one must carefully
consider the special traits of the language. In contrast with C programs, Java programs have
many small method invocations, many structure references, extensive exception handling
and dynamically loaded classes. A successful optimizer must deal correctly and effectively
with such programs.
Normally, optimization of Java code is done in two places: (i) in the compiler; and
(ii) in the runtime system. The compiler from Java source to bytecode can perform some
large-scale optimizations, but access to the source code is a necessity for doing this. The
runtime system, whether it does JIT compilation or interpreting, is limited by having little
time to spare for optimizing, and thus can only do some local optimizations.
A third possibility is to optimize on the Java bytecode, as shown in Figure 1. This has
several advantages:
1. Access to source code is not required. In fact, it will be possible to optimize programs
written in other languages if they can be compiled to bytecode.
2. Machine independency is maintained.
3. Inter-procedural analysis is possible for the user, so library variations are recognized.
The user will be able to optimize with respect to his own libraries and still get valid
results.
∗ Correspondence to: L. R. Clausen, Department of Computer Science, Aarhus University, Aarhus, Denmark.
(e-mail: [email protected])
In any object-oriented language, massive use is made of object fields. Since objects
generally live through several method invocations, it is difficult to know much about their
state locally. On the other hand, inter-procedural analysis is complicated by the existence
of virtual methods and interfaces, which are very common in Java. Good methods for
optimizing field accesses interprocedurally are essential when working on object-oriented
programs.
To address this problem, we have implemented a simple side-effect analysis. It does
inter-procedural analysis where possible, determining if and how side-effects are made.
Using this information, many more optimizations of field accesses and method invocation
become possible.
In this paper, we describe a working implementation of Cream, an optimizer for Java
bytecode, which performs dead-code elimination and loop-invariant removal using standard
optimization techniques from [1], as well as the side-effect analysis. It handles native code,
exceptions and monitors, and optimizes large class collections successfully.
In Section 2 we describe how we analyze Java bytecode. Section 3 describes the side-
effect analysis and its effectiveness in our implementation, and Section 4 shows how it is
used in Cream, as well as the resulting optimizations. Section 5 briefly describes related
work, and Section 6 gives conclusions and future work.
Java source
(.java)
Runtime system
Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997) 1997 John Wiley & Sons, Ltd.
JAVA BYTECODE OPTIMIZER 1033
interprocedural, while the remaining phases are performed for one method at a time. A
brief description of each phase is given below.
∗ It is necessary for good optimizations to be able to optimize on field accesses. For this reason, we use the
option where null pointer and class cast exceptions do not cause branches. Programming to catch such exceptions
is not very nice, anyway.
1997 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997)
1034 L. R. CLAUSEN
3. SIDE-EFFECT ANALYSIS
The side-effect analysis is an inter-procedural analysis to determine when computations
may alter or inspect object fields or array elements. We have called it ‘purity analysis’ after
the usage in functional languages, where ‘purely’ functional languages are those without
the possibility of side-effects.
An instruction or collection of instructions is said to have a purity for all object fields
and the five basic types of array elements.∗ Purity can be inferred for methods, loops and
instructions. The inferred purity of an instruction for an object field or array element can
be one of four:
1. pure: An instruction is pure for a field or element if it neither alters nor inspects the
field or elements. These include allocations and array length.
2. read-only: An instruction is read-only for a field or element if it reads that field
or element, but does not write to it.
3. write-only: An instruction is write-only for a field or element if it changes the
field or element.
4. read/write: These instructions both alter and inspect the object field or array
element. The only single instruction that can do this is a method invocation.
These three types of purity form a partial ordering as shown in Figure 2. With this,
finding the purity for a field or an element of blocks, loops or methods becomes a matter
of finding the least upper bound of the corresponding purities of the instructions involved.
∗ At bytecode level, determining the exact type of the elements of an array involves a great deal of analysis.
Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997) 1997 John Wiley & Sons, Ltd.
JAVA BYTECODE OPTIMIZER 1035
read/write =>
oo OOO
ooo OOO
7 g
ooooo OOO
OO
o
read-only write-only
OOO o
OOO ooo
g 7
OOO
OO ooooo
oo
pure =>
Figure 3 shows examples of purities of methods. This ordering can be extended to allow
for more complex information, as demonstrated in Section 3.2.
class Purities {
int a;
int b[];
int pure1(int x) { return(x+2); }
int pure2(int x) { return(x+b.length); }
int readonly1(int x) { return(x+a); }
int readonly2(int x) { return(x+b[0]); }
void writeonly1(int x) { a = x; }
void writeonly2(int x) { b[x] = 0; }
}
Figure 3. Examples of pure, read-only and write-only methods
1997 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997)
1036 L. R. CLAUSEN
main()
J
t JJ
tt JJ
tt JJ
ttt JJ
t
z $
foo() bar()J
JJ
/ o
x
xx
GF ED
JJ
xxx JJ
JJ
xx
xx
$
qux()
baz()
(write-only)
Once the strongly connected components are found and replaced by virtual nodes, a
depth-first traversal will take only O(ek) time, where e is the number of edges in the
reduced call graph and k is the time needed to find the least upper bound of two purities.
This holds because each edge in the graph is only visited once, and thus cannot be involved
in more than one LUB. Updating the m nodes represented by virtual nodes takes no more
than O(m) time. Thus the whole purity analysis is done in linear time times the height of
the purity ordering.
Native methods cannot, of course, be analyzed, so they must be taken as being both
read-only and write-only. The naive approach also considers invocations of interfaces
and virtual, non-final methods to be both, too, as the actual code they execute is undeter-
mined. A more reasonable analysis of virtual methods and interface methods is given in
Section 3.3.
To find the interesting loop invariants in such a loop, more information is needed about
which objects have which fields manipulated.
Our first approximation was to find the least object of which all objects with affected
fields are all subclasses and store that. Arrays are kept separately, with purity information
being kept for each kind of array known in the bytecode (Integer, Long, Float, Double and
Object). This does not take much space, but each LUB operation can take time proportional
to the height of the inheritance tree. This we call subclassing purity.
Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997) 1997 John Wiley & Sons, Ltd.
JAVA BYTECODE OPTIMIZER 1037
A better approximation is to store the classes that are affected. This solves the problem
of subclassing purity that two simple reads in different classes may cause a total loss of
information, but takes up a lot more space. We call this class-based purity.
The optimal representation is, of course, to store exactly which fields have been altered
or inspected. This will take even more space than the class-based purity, but gives as good
information as can be hoped for without using other kinds of analysis to improve the code
first. This is what we call field-based purity.
∗A class being used in several programs could, of course, be optimized separately for each program.
1997 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997)
1038 L. R. CLAUSEN
to Interface1.h() would combine purities of the h() methods in the following classes:
B and H, because they implement Interface1; H, I and G, because these are subclasses
of classes implementing Interface1; and J, because it implements a subinterface if
Interface1.
h() Interface1
Object Interface2
g() A h() B C
First, we consider the non-virtual purity analyses from Sections 3.1 and 3.2. To compare
the quality of the variants, we define a common measurement, the average number of
read-only or write-only fields per method. This is found for a method by counting
through all fields in all known classes and collecting their purity. We then take the average
Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997) 1997 John Wiley & Sons, Ltd.
JAVA BYTECODE OPTIMIZER 1039
of these numbers over all the methods that we want to optimize. The results are found in
Table 2.
It may seem odd that the number of write-only or read-only fields can rise when
going to a better analysis. This is because some methods in the better analysis are found to
only read or write that field, but not both. The actual average number of fields that are being
written to is the sum of the fields that are only written to and those that are both written to
and read from. As can be seen, this sum never rises.
There is a big difference in the purity of these programs. Javac, for some reason, gives
poor results, while Cream, having a large subclass structure, has many pure and read-only
methods. The analysis of virtual methods does not give any substantial improvement here,
but it is worth noting that the improvement is in the amount of read-only methods. This
can be the effect of small encapsulation methods that only read a single field.
It turned out that Object.<init>, the initialization method for Object, does not
contain any code. This means that any call to it will be pure. Since methods that do not
explicitly make any <init> procedure automatically get one that calls super.<init>,
there are many pure <init> methods. These account for up to one-quarter of the pure
fields in some cases.
Native methods are also an important consideration. Since we do no know anything about
what they do, they should be considered to be read/write for all fields. Unfortunately,
this makes quite a lot of methods read/write, for instance any that may eventually call
System.out.println. If we could annotate the native code with information about its
purity, then we could avoid this totally. As an intermediate solution, we have used the
option of letting native methods be read/write on all fields in their own object, and pure
on everything else, which gave a tremendous improvement in the purity analysis.
1997 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997)
1040 L. R. CLAUSEN
Table 3 shows the purity when using virtual purity. We see that the virtual purity analysis
gives better results than the non-virtual case. This was not the case when native methods
were considered to be read/write for all fields, where the amount of information levelled
out when reaching subclassing purity. This effect is still visible in the non-virtual purity.
4. USING PURITY
When the purity information has been found, what is it good for? It tells us how safe it is
to transform the code. Most common optimizations can benefit from it.
Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997) 1997 John Wiley & Sons, Ltd.
JAVA BYTECODE OPTIMIZER 1041
tions that are write-only for the same field. Line 6 in Figure 7 does not depend on i in any
way, and can be moved outside the loop. We have also implemented this optimization, the
results of which are in Section 4.2. This optimization benefits the most from the subclass
purity analysis, as it is common practice to make several field accesses in one assignment,
as shown in Figure 5.
1997 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997)
1042 L. R. CLAUSEN
Purity type
Purity type
Total Non-virtual Virtual Virtual Virtual
instructions None simple subclassing class-based field-based
Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997) 1997 John Wiley & Sons, Ltd.
JAVA BYTECODE OPTIMIZER 1043
i = startval; i = startval;
j = stopval; j = stopval;
if (i == j) if (i == j) if (i == j)
break; break; break;
k = x*10+y; k = x*10+y;
i++; Moved code goes here i++;
l = z << y; l = z << y;
continue; continue;
The dead-code optimizations do not benefit at all from the better kinds of analysis. This is
because it does in fact not use more information than whether an instruction is write-only
or not. For loop-invariant removal it makes much more of a difference, because a field access
can be moved outside a loop only if there are no instructions in the loop that write exactly
that field.
To get maximum benefit from loop-invariant removal, it is necessary to know whether
the loop body is executed at least once. As most loops appear to be while loops at the
bytecode level, it was not possible to move instructions whose values may be used outside
the loop.
To overcome this problem, we added a partial loop unroller. It finds all exits from the
loop above the instruction in question and puts a copy of those before the loop. Instructions
that are moved outside the loop are then placed after the copied part, but before the entry to
the loop. This is shown in Figure 9, where the assignment to k can be moved after unrolling.
The assignment to l can never be moved, as we cannot be sure that it will be executed at all.
Unrolling code turns out to give a factor five to ten increase in the number of instructions
that can be moved.
On the other hand, this could decrease performance if large parts of code are being
duplicated in order to perform relatively small optimizations. To avoid this, it is necessary
to add some heuristics to judge how much of a loop it is worthwhile to unroll. Any invariant
code in the unrolled part has automatically been moved outside the loop when unrolling,
so those are even cheaper to optimize. Exactly how to get the best results out of this is a
separate subject; in our implementation we have just unrolled all necessary code.
Table 6 shows the improvement in execution speed for the four programs described in
Section 3.4. The Figure shows raw execution times in seconds and relative execution times
for unoptimized code, optimization without purity analysis, optimization with simple purity
analysis, and optimization with virtual and subclassing purity analysis. The programs have
been optimized with both dead-code elimination and loop-invariant removal, and using full
unrolling, treating native methods as being read/write only for their own object. The test
programs were run on a PentiumPro with Linux 2.0.28 using JDK1.02.
The greatest speed improvements are in the BYTEMarks, caused by the many tight,
imperative-style loops in that program. The other programs, though they have been quite
heavily optimized, too, do not show any interesting increase in speed. This seems to indicate
1997 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997)
1044 L. R. CLAUSEN
Kem 2.83 2.80 0.99 2.74 0.97 2.78 0.98 2.77 0.98 2.78 0.98
BYTEMarks 30.2 30.7 1.02 30.7 1.02 24.8 0.82 24.6 0.82 24.1 0.80
Javac 17.75 17.56 0.99 17.45 0.98 17.49 0.99 17.55 0.99 17.45 0.98
Cream 3.8 4.2 1.09 4.2 1.09 4.1 1.07 4.1 1.07 4.2 1.08
that other kinds of optimizations are necessary for object-oriented programs. The decreased
speed seen for Cream is caused by a poor reconstruction of the abstracted instruction order.
5. RELATED WORK
Little optimization has been done on Java so far, but with the popularity of the language,
we will surely see more work in this area. Sun’s compiler has some inlining, though in
versions before 1.1 it could break the privacy of classes[7]. The only optimization they
suggest is for the runtime system to calculate certain constants ahead of time.
Cierniak and Li[8] have made a flexible compiler/optimizer using an intermediate repre-
sentation. It is also able to optimize bytecode to bytecode, but their article does not mention
any interprocedural analysis or actual results. This is the only optimizer we know of that is
akin to Cream.
There are several bytecode-to-C compilers[9,10], which then use the C compiler for
standard optimizations. While this gives many of the well-known optimizations, it does not
take the special structure of an object-oriented language into account.
Side-effect analysis was first thoroughly covered by Banning in [11], and the meth-
ods improved upon by Kennedy and Cooper in [3]. These papers considered optimizing
imperative languages, and thus did not consider classes and virtual functions.
Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997) 1997 John Wiley & Sons, Ltd.
JAVA BYTECODE OPTIMIZER 1045
are often long chains of fields accesses occurring more than once in a loop without being
invariant.
The results obtained here should be compared to what can be done using the Harissa
bytecode-to-C translator[9], both alone and together. Using an optimizing C compiler as
the back-end gives local optimizations such as strength reduction and register allocation
for free, and the large-scale optimizations suggested partially in this paper will still be
available.
ACKNOWLEDGEMENTS
The optimizer is based on Clark Verbrugge’s Coffi program[12]. Although some rewriting
was necessary, it has turned out to be a good starting point for the program. Ulrik Pagh
Schultz co-implemented the first version of the optimizer, primarily the virtual purity
analysis and the class graph, but also acted as a sounding board for ideas. Laurie Hendren
has been helpful with support, advice and lots of corrections.
REFERENCES
1. Alfred V. Aho, Ravi Sethi and J. D. Ullman, Compilers: Principles, Techniques and Tools,
Addison-Wesley, 1985.
2. Tim Lindholm and Frank Yellin, The Java Virtual Machine Specification, The Java Series,
Addison-Wesley, Reading, MA, USA, 1996.
3. Keith D. Cooper and Ken Kennedy, ‘Interprocedural side-effect analysis in linear time’, PLDI,
1988, pp. 57–66.
4. Robert Endre Tarjan, ‘Fast algorithms for solving path problems’, J. ACM, 28(3), 594–614
(1981).
5. Jens Palsberg and Michael I. Schwartzbach, Object-Oriented Type Systems, Wiley, 1993.
6. Fritz Henglein, ‘Breaking through the n3 barrier: Faster object type inference’, in Benjamin
Pierce (Ed.), Proc. 4th Int’l Workshop on Foundations of Object-Oriented Languages (FOOL),
Paris, France, January 1997.
7. Doug Bell, ‘Make Java fast: Optimize’, JavaWorld Magazine, April 1996.
8. Michael Cierniak and Wei Li, ‘Briki, A flexible Java compiler’, Technical Report TR 621,
University of Rochester, Computer Science Department, May 1996.
9. G. Muller, B. Moura, F. Bellard and C. Consel, ‘Harissa: a flexible and efficient Java environment
mixing bytecode and compiled code’, to appear in Proceedings of COOTS’97, June 1997.
10. Todd Proebsting, John Hartman, Gregg Townsend, Patrick Bridges, Tim Newsham and Scott
Watterson, Toba: A Java-to-C Translator, Technical Report, University of Arizona, 1997.
11. John Banning, ‘An efficient way to find side effects of procedure calls and aliases of variables’,
POPL, 1979, pp. 29–41.
12. Clark Verbrugge, Using Coffi, Technical report, McGill University, October 1996.
1997 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 9, 1031–1045 (1997)