Two-Level Just-in-Time Compilation With One Interpreter and One Engine

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Two-level Just-in-Time Compilation with One

Interpreter and One Engine


Yusuke Izawa Hidehiko Masuhara Carl Friedrich Bolz-Tereick
[email protected] [email protected] [email protected]
Tokyo Institute of Technology Tokyo Institute of Technology Heinrich-Heine-Universität
Tokyo, Japan Tokyo, Japan Düsseldorf
North Rhine-Westphalia, Germany

Abstract compilation strategy. One naïve approach for multitier com-


pilation is to create compilers for each optimization level.
arXiv:2201.09268v1 [cs.PL] 23 Jan 2022

Modern, powerful virtual machines such as those running


Java or JavaScript support multi-tier JIT compilation and op- However, this approach makes existing implementations
timization features to achieve their high performance. How- more complex and requires more developing efforts to im-
ever, implementing and maintaining several compilers/opti- plement and maintain — we need to at least consider how
mizers that interact with each other requires hard-working to share components, along with how to exchange profil-
VM developers. In this paper, we propose a technique to ing information and compiled code between each compiler.
realize two-level JIT compilation in RPython without imple- To avoid this situation, we have to find another efficient
menting several interpreters or compilers from scratch. As a and reasonable way to support multitier JIT compilation
preliminary realization, we created adaptive RPython, which in a framework. Several works in RPython [2, 4, 14] found
performs both baseline JIT compilation based on threaded that specifying special annotations in an interpreter defini-
code and tracing JIT compilation. We also implemented a tion can influence how RPython’s meta-tracing JIT compiler
small programming language with it. Furthermore, we pre- works. In other words, we can view the interpreter definition
liminarily evaluated the performance of that small language, as a specification of a compiler. We believe those approaches
and our baseline JIT compilation ran 1.77x faster than the can be extended to achieve an adaptive optimization system
interpreter-only execution. Furthermore, we observed that at the meta-level.
when we apply an optimal JIT compilation for different tar- As a proof of concept of language-agnostic multitier
get methods, the performance was mostly the same as the JIT compilation, we propose adaptive RPython. Adaptive
one optimizing JIT compilation strategy, saving about 40 % RPython can generate a VM with two-level JIT compila-
of the compilation code size. tion. We do not create two separated compilers in adaptive
RPython but indeed generate two different interpreters — the
CCS Concepts: • Software and its engineering → Just- one for the first level of compilation and another is for the
in-time compilers. second level. As first-level compilation, we support threaded
code [3, 13] in a meta-tracing compiler (we call this technique
Keywords: JIT compiler, adaptive compilation, tracing com-
threaded code generation [15]). The second one is RPython’s
pilation, RPython, language implementation framework
original tracing JIT compilation. We can switch those com-
pilation tiers by moving between the two interpreters. In
1 Introduction addition, adaptive RPython generates two interpreters from
Language implementation frameworks, e.g. RPython [6] one definition. It will reduce the implementation costs that
and Truffle/Graal [24], are tools used to build a virtual ma- language developers should pay if they are in a traditional de-
chine (VM) with a highly efficient just-in-time (JIT) compiler velopment way. In the future, we plan to extend this system
which is done by providing an interpreter of the language. to realize a nonstop compilation-tier changing mechanism
For example, PyPy [5], which is a fully compatible Python in a language implementation framework — we can use an
implementation, achieved 4.5x speedup from the original appropriate and efficient optimization level or a compilation
CPython 3.7 interpreter [23]. The other successful exam- strategy, depending on the executed program.
ples include Topaz [9], Hippy VM [8], TruffleRuby [20], and The contributions of the current paper can be summarized
GraalPython [21]. as follows:
One of the limitations of RPython and Truffle/Graal, which • An approach to generate multiple interpreter imple-
is differed from the current language-specific VMs such as mentations from one common interpreter definition to
Java or JavaScript, is that they don’t support a multitier JIT obtain JIT compilers with different optimization levels.
• The technical details of enabling threaded code gen-
PEPM ’22, January 17–18, 2022, Philadelphia, Pennsylvania, United States eration as a first-level JIT compilation by driving an
2022. existing meta-tracing JIT compiler engine.
PEPM ’22, January 17–18, 2022, Philadelphia, Pennsylvania, United States Izawa, Masuhara, and Bolz-Tereick.

Generic Interpreter 1 jittierdriver = JitTierDriver ( pc = 'pc ')


2
3 class Frame :
Adaptive RPython 4 def interp ( self ) ;
5 pc = 0
VM generation time 6 bytecode = self . bytecode
run-time 7 jitdriver . jit_merge_point ( pc = pc ,
8 bytecode = bytecode , self = self )
9 opcode = ord ( bytecode [ pc ])
10 pc += 1
base- baseline JIT tracing JIT 11 if opcode == JUMP_IF :
program interp. interp. 12 target = bytecode [ pc ]
13 jittierdriver . can_enter_tier1_branch (
14 true_path = target , false_path = pc +1 ,
method- 15 cond = self . is_true )
native native
based 16 if we_are_in_tier2 ( kind = ' branch '):
17 # do stuff
tracing-JIT-suitable 18
data flow hot spot found 19 elif opcode == JUMP :
exec. flow
20 target = bytecode [ pc ]
21 jittierdriver . can_enter_tier1_jump (
Figure 1. An overview of adaptive RPython: two different target = target )
22 if we_are_in_tier2 ( kind = ' jump '):
interpreters are generated from a single generic interpreter 23 # do stuff
by adaptive RPython at VM generation time. At run-time, 24
the two interpreters that is, the baseline JIT and tracing JIT 25 elif opcode == RET :
26 w_x = self . pop ()
interpreters, behave as tier 1 and tier 2, respectively. 27 jittierdriver . can_enter_tier1_jump (
ret_value = w_x )
28 if we_are_in_tier2 ( kind = ' ret ') :
• The preliminary evaluation of our two-level JIT com- 29 # do stuff
pilation on a simple programming language. Listing 1. A skeleton of the generic interpreter definition.
The rest of current paper is organized as follows: Section 2
proposes an idea and technique to support two-level JIT com-
pilation with one interpreter in RPython without crafting overview of how Adaptive RPython and the generic inter-
a compiler from scratch. We evaluate our preliminary im- preter work. At VM generation time (the upper half of Fig-
plementation in Section 3. Section 4 discusses related work, ure 1), the developer writes the generic interpreter. Adaptive
and we conclude the paper and denote the future work in RPython generates two different interpreters that support
Section 5. different JIT compilation tiers, e.g. baseline JIT and tracing
JIT interpreters work as tier-1 and tier-2, respectively. At run-
2 Two-level Just-in-Time Compilation time (the bottom half of Figure 1), a generated baseline JIT
with RPython interpreter firstly accepts and runs a base-program. While
In this section, we propose a multitier meta-tracing JIT com- running the program, the execution switches to a generated
piler framework called adaptive RPython along with its tech- tracing JIT interpreter if the hot spot is suitable for tracing
nique to support two different compilation levels with one JIT compilation.
interpreter and one engine. Section 2.1 introduces adaptive
RPython. Then, Sections 2.2 and 2.3 explain the technical 2.2 Generic Interpreter
details to realize two-level JIT compilation in RPython. Sec- Adaptive RPython takes the generic interpreter as the input.
tions 2.2 and 2.3 correspond to “one interpreter” and “one The generic interpreter is converted into an interpreter that
engine,” respectively. includes two definitions – one for baseline JIT compilation
and another is for tracing JIT compilation. To remove the task
2.1 Adaptive RPython of manually writing redundant definitions in the method-
Adaptive RPython lets the existing meta-tracing engine to traversal interpreter [15], we internally generate both the
behave in two ways: as both a baseline JIT compiler and method-traversal interpreter (for the baseline JIT compila-
a tracing JIT compiler. To realize this behavior, the most tion) and a normal interpreter (for tracing JIT compilation)
obvious approach would be to implement two different com- from the generic interpreter definition.
pilers or interpreters. However, this approach increases the To convert the generic interpreter into several inter-
amount of implementation necessary. Thus, we do not craft preters with two different definitions, we need to tell adap-
compilers from scratch but derive two different compilation tive RPython some necessary information. For this reason,
strategies by providing a specializing interpreter called the we implement several hint functions for the generic inter-
generic interpreter to Adaptive RPython. Figure 1 shows an preter, along with RPython’s original hints. The skeleton
Two-level Just-in-Time Compilation with One Interpreter and One Engine PEPM ’22, January 17–18, 2022, Philadelphia, Pennsylvania, United States

is shown in Listing 1. When developers write the generic traverse stack resulting trace
start
interpreter, they first declare an instance of JitTierDriver A
class that has a field pc. It tells adaptive RPython funda-
A
pc (B → D)

pu
mental information such as the variable name of the pro- B

s
h
gram counter. Furthermore, transforming hints should be B C
defined in a specific handler. can_enter_tier1_branch,
can_enter_tier1_jump, and can_enter_tier1_ret tell pc (C → F)
pus
h E
the adaptive RPython’s transformer the necessary infor- pc (B → D) C D
mation to generate the method-traversal interpreter. The F
end
pop
method-traversal interpreter requires a particular kind of

po
pc (C → F)
code in the JUMP_IF, JUMP, and RET bytecode handlers so
D

p
E F
pc (B → D)
that hints can be called in those handlers. The requisite in- emit_X emit_Y pc (B → D)

formation in each handler and hint is the following:


Figure 2. An overview of method-traversing in a program
can_enter_tier1_branch. The method-traversal inter- with nested branches. The left-hand side shows how we
preter needs to drive the engine to trace both sides of a drive the meta-tracing engine by using the traverse stack.
branch instruction. Thus, we have to pass the following vari- The snake line represents the trail of tracing. The right-hand
ables/expressions to the transformer: side is a resulting trace.
• the true and false paths of program counters: to man-
age them in the traverse stack interpreter to emit an executable JITted code. This is the es-
• the conditional expression: to generate the if expres- sential component in the baseline JIT compilation of adaptive
sion. RPython.
In the handler of JUMP_IF in Listing 1, we pass the follow- To reconstruct the original control flow, we need to con-
ing statements/expressions: target as true_path, pc+1 as nect a correct guard operation and bridging trace. Therefore,
a false_path and self.is_true as cond. We also remem- we should treat the destination of a “false” branch in a special
ber and mark the cond to utilize at trace-stitching inside of manner while doing trace-stitching. In tracing JIT compilers,
adaptive RPython (details are explained in Section 2.3.1). at the branching instructions, guards are inserted.1 After a
guard operation fails numerous times, a tracing JIT compiler
can_enter_tier1_jump. A jump instruction possibly will trace the destination path starting from the failing guard,
placed at the end of a function/method. In this case, the that is, trace the other branch. The resulting trace from a
engine does not follow the destination but takes out a pro- guard failure is called a bridge, which will be connected
gram counter from the top of traverse_stack. Otherwise, to the loop body. On the other hand, in trace-stitching, we
the jump instruction performs as defined in the original in- perform a sequence of generating and connecting a bridge
terpreter . Thus, the method-traversal interpreter requires in one go. For the sake of ease and reducing code size, we
the program counter of a jump target to manage it in the generate a trace that essentially consists of call operations
traverse_stack. As shown in the handler of JUMP in List- by threaded code generation.2
ing 1, we pass target to the transform_jump function. The left-hand side of Figure 2 illustrates the process
of traversing an example target method that has nested
can_enter_tier1_ret. A return instruction is invoked at
branches. In tracing a branch instruction, we record the
the end of a function/method. The method-traversal inter-
other destination, for example, a program counter from B to
preter requires a return value, so we have to pass w_x via
D in node B, in the traverse stack. For example, when tracing
transform_ret, as illustrated in the handler of RET in List-
B and C, we push each program counter from B to D and
ing 1.
from C to F to the traverse stack. Then, we pop a program
we_are_in_tier2. This is a hint function to tell the area counter from the traversal stack and set it to E and F. After
where a definition for the tracing JIT compilation is written traversing all paths, we obtain a single trace that does not
to the transformer. One thing we need to do is to specify keep the structure of the target as shown in the right-hand
the type of handler function defined immediately above. For side of Figure 2.
example, we add the keyword argument kind='branch' in 1 Technically speaking, type-checking guards are inserted to optimize the
the handler of JUMP_IF in Listing 1. obtained trace based on the observed run-time type information. However,
we currently handle branching guards only and leave type optimization to
2.3 Just-in-Time Trace Stitching the last JIT tier.
2 Note that threaded code generation [15] yields call and guard operations.
The just-in-time trace stitching technique builds back up the In contrast to normal tracing JIT compilation, it can reduce the code size
original control of a trace yielded from the method-traversal and compilation time by 80 % and 60 %.
PEPM ’22, January 17–18, 2022, Philadelphia, Pennsylvania, United States Izawa, Masuhara, and Bolz-Tereick.

guard failure stack target_token).3 Next, it declares guard_failure_stack,


trace, and result. guard_failure_stack is a key data
A start

push structure to resolve the relation between guard failures and


guard failure (g1) B guard 1
bridges. trace temporarily stores a handled operation, and
result memorizes a pair of a trace and its corresponding
pu
sh
C guard 2
guard failure (g2) guard_falure. After done this, it manipulates each opera-
guard failure (g2)
pop
guard failure (g1) tion in the given ops. We specifically handle the following
guard failure (g1) E emit_X operations: (1) a guard operation marked by the generic in-
terpreter, (2) pseudo call operations CallOp(emit_ret) and
CallOp(emit_jump), and (3) RPython’s return and jump
F emit_Y
guard failure (g1)
operations. Note that almost all operations except guards
pop are represented as a call operation because the resulting
loop body D end trace is produced from our threaded code generator.
bridge
Marked Guard Operation. To resolve the guard–bridge
Figure 3. An overview of just-in-time trace-stitching. This relation, we firstly collect the guard operations that we
shows how we resolve the relations between guard failures have marked in the generic interpreter. The reason we
and bridges, here taking Figure 2 as an example. mark some guards is to distinguish between branching
guards and others. When we encounter such a guard op-
eration, we take its guard_failure and append it to the
2.3.1 Resolving Guard Failures and Bridges. Intend- guard_failure_stack. Its algorithm is shown in the first
ing to resolve the relations between guards and bridges, we branching block of Algorithm 1.
utilize the nature of the method-traversal interpreter: it man-
ages the base-program’s branches as a stack data structure, Pseudo call operations. Pseudo functions emit_jump
so the connections are first-in-last-out. Thus, we implement and emit_ret are used as a sign of cutting the trace at this
guard failure stack to manage each guard’s failure. The guard position and to start recording a bridge. They are represented
failure stack saves guard failures in each guard operation as Call(emit_jump) and Call(emit_ret) in a trace.
and pops them at the start of a bridge, that is, right after In the case of CallOp(emit_jump):
emit_jump or emit_ret operations that indicate a cut-here 1. look up a target_token from the token_map using
line. the target program counter as a key
Before explaining the details of the algorithm, we give 2. try to retrieve a guard_failure from the
a high level overview of how we resolve the connections guard_failure_stack. If it the stack is empty,
between guard failures and bridges by using Figure 3 as an we do nothing because the current recording trace is
example. First, the necessary information is which guard a body (not a bridge). Otherwise, pop a guard failure
failure goes to which bridges. We resolve it by sequentially from the guard_failure_stack.
reading and utilizing the guard failure stack. When we start 3. create a jump operation with the inputargs and
to stitch the trace shown in Figure 3, we sequentially read target_token and append to the trace.
the operations from the trace. In node B, when it turns out 4. append the pair of the trace and guard_failure to
that the guard operation is already marked as cond in the the result.
generic interpreter, we take its guard failure (g1) and In the case of CallOp(emit_ret):
push it to the guard failure stack. In node C, we do the same
1. take a return value (retval) from the operation.
thing in the case of node B. Next, in node E, we finish tracing
2. take a guard_failure if it is not the first time.
an operation and cut the current trace. In addition, we pop a
3. create a return operation with the retval and append
guard failure and connect it to the bridge that we are going to
to the trace.
retrieve, because the bridges are lined up below with a depth-
4. append the pair of the trace and guard_failure.
first approach. In node F, we do the same thing in the case
of node F. In node D, finally, we finish reading and produce RPython’s Jump and Return. These operations are
one trace and two bridges. The connections are illustrated placed at the end of a trace. When we read them, we ap-
as red arrows in Figure 3. pend the pair of the operation and retrieve the guard failure
2.3.2 The Mechanism of Just-in-Time Trace- to the result, and then, we have finished reading.
Stitching. How JIT trace-stitching works is ex- 3 target_token is an identifier for a trace/bridge. When we encounter
plained through Algorithm 1. First, the trace-stitcher emit_jump or emit_ret operations, we get the program counter (key)
DoTraceStitching prepares an associative array called passed as an argument to emit_jump or emit_ret operations, hence creat-
token_map, and the (key, value) pair is (program counter, ing a new target_token (value).
Two-level Just-in-Time Compilation with One Interpreter and One Engine PEPM ’22, January 17–18, 2022, Philadelphia, Pennsylvania, United States

3 Preliminary Evaluation (e) callabit_tracing_only: all methods are compiled by


In this section, we evaluate our implementations of JIT trace- tracing JIT compilation.
stitching and baseline JIT compilation can actually work. All bytecode programs used for this evaluation are shown
Because our work is not finished, these results can only be in Appendix C.2.
preliminary.
3.3 Methodology
3.1 Implementation We took two kinds of data: the times of stable and startup
We wrote a small interpreter called tla in adaptive RPython speeds. When we measured the stable speed, we discarded
and executed several micro benchmark programs on it. In the first iteration and accumulated the elapsed time of each
the tla language, we have both primitive and object types, 100 iterations. In contrast, we measured the startup speed;
and they are dynamically typed. we iterated 100 times the spawning of an execution process.
In addition, we took how many operations they emit and how
Simulating Multitier Compilation. To conduct this much time they consumed in tracing and compiling in the
preliminary evaluation, we partially implemented two-level case of callabit programs. Note that in every benchmark, we
JIT compilation by separately defining interpreters for each did not change the default threshold of the original RPython
execution tier – baseline JIT compilation (tier 1), and tracing to enter JIT compilation.
JIT compilation (tier 2). In other words, to support multi- We conducted the preliminary evaluation in the follow-
tier JIT compilation, we prepared two interpreters that have ing environment: CPU: Ryzen 9 5950X, Mem: 32 GB DDR4-
different jitdrivers in tla; the one is for baseline JIT com- 3200MHz, OS: Ubuntu 20.04.3 LTS with a 64-bit Linux kernel
pilation while the other is for tracing JIT compilation. For 5.11.0-34-generic.
example, during executing a base-program, we call sepa-
rate interpreters for different JIT compilation tiers that are 3.4 Result
manually specified in a base-program. Compiling Single and Nested Loops. Figure 4 visualizes
Accessibility. The implementations of proof-of-concept the resulting traces from loopabit.tla by baseline JIT compi-
adaptive RPython and tla interpreter is hosted on Hepta- lation. We can initially confirm that our baseline JIT compi-
pod.4 Moreover, the generic interpreter implementations are lation works correctly when looking at this figure. Figure 5a
hosted on our GitHub organization.5 and 5b show the results of stable and startup times in the
loop and loopabit programs, respectively. In stable speed,
3.2 Targets on average, baseline JIT and tracing JIT are 1.7x and 3.25x
faster than the interpreter-only execution. More specifically,
To verify that our JIT trace-stitching mechanism works on
baseline JIT compilation is about 2x slower than tracing JIT
a program with some complex structure, we wrote a single
compilation. In startup time (Figure 5b), baseline JIT compila-
loop program loop and nested loop program loopabit in
tion is about 1.9x faster and tracing JIT compilation is about
tla for the experiments.
5x faster. The loop and loopabit programs are much suitable
Furthermore, to confirm the effectiveness of shifting dif-
for tracing JIT compilation, so we consider that tracing JIT
ferent JIT compilation levels, we wrote callabit, which has
compilation should be dominant in executing such programs.
two different methods and; each method has a single loop.
Furthermore, at applying the baseline JIT compilation for
According to the specified JIT compilation strategy, callabit
such a program, we should reduce the value of threshold to
has the following variants:
enter JIT compilation. Tuning the value is left for our future
(a) callabit_baseline_interp: the main method is com- work.
piled by baseline JIT compilation, but the other is in-
terpreted. Simulating Multitier JIT Compilation. Figure 6 shows
(b) callabit_baseline_only: the two methods are com- how the simulated multitier JIT compilation works on a pro-
piled by baseline JIT compilation. gram with two different methods. The callabit has main the
(c) callabit_baseline_tracing: the main method is com- method that repeatedly calls sub_loop that reduces the given
piled by baseline JIT compilation, and the other is number one by one. The call method is not implemented by
compiled by tracing JIT compilation. using jump but by invoking an interpreter, so the effective-
(d) callabit_tracing_baseline: the main method is com- ness of inlining by tracing JIT compilation is limited in this
piled by tracing JIT compilation, and the other is by case. In other words, main is relatively suitable for baseline
baseline JIT compilation. JIT compilation and sub_loop works well for tracing JIT
compilation.
4 https://foss.heptapod.net/pypy/pypy/-/tree/branch/threaded-code- In the stable and startup speeds (Figure 6a and 6b),
generation/rpython/jit/tl/threadedcode callabit_baseline_interp is about 3 % slower than the
5 https://github.com/prg-titech/mti_transformer interpreter-only execution. This means that repeating back
PEPM ’22, January 17–18, 2022, Philadelphia, Pennsylvania, United States Izawa, Masuhara, and Bolz-Tereick.

and forth between native code and an interpreter exe- translates the entire bytecode into machine code. In addition,
cution leads to run-time overhead. Meanwhile, the com- a full-fledged compiler WarpMonkey [19] compiles a hot
bination of baseline JIT and tracing JIT compilations spot into fast machine code. Besides such a JavaScript en-
(callabit_baseline_tracing) is as fast as the tracing JIT gine, the SpiderMonkey VM has an interpreter and compiler
compilation-only strategy (callabit_tracing_only). Addition- called WASM-Baseline and WASM-Ion.
ally, when looking at the Figure 7, the baseline-tracing JIT Google’s JavaScript engine V8, which is included in the
strategy’s trace size is about 40 % smaller than the only trac- Chrome browser, also supports a multitier compilation mech-
ing JIT strategy. In contrast, the trace sizes are the same anism [10]. V8 sees it as a problem that the JIT-compiled
between baseline-tracing JIT and tracing-baseline JIT strate- code can consume a large amount of memory, but it runs
gies, but the tracing-baseline JIT strategy is about 45 % slower only once. The baseline interpreter/compiler is called Igni-
than baseline-tracing JIT strategy, and the baseline-only tion, and it is so highly optimized to collaborate with V8’s
strategy is about 5 % faster than tracing-baseline JIT strategy. JIT compiler engine Turbofan. It can reduce the code size up
From these results, we can deduce that there is a ceiling to to 50 % by preserving the original.
using only a single JIT strategy. Furthermore, to leverage Google’s V8 has another optimizing compiler called
different levels of JIT compilations, we have to apply an ap- Liftoff [11]. The Liftoff compiler is designed for a startup
propriate compilation according to the structure or nature compiler of WebAssembly and works alongside Turbofan.
of the target program. Turbofan is based on its intermediate representation (ir), so it
In summary, our baseline JIT compilation is about 1.77x needs to translate WebAssembly code into the ir, leading to a
faster than the interpreter-only execution in both the stable reduction in the startup performance of the Chrome browser.
and startup speeds.6 Moreover, our baseline JIT compilation However, Liftoff instead directly compiles WebAssembly
is only about 43 % slower than the tracing JIT compilation, code into machine code. The liftoff compiler is tuned to
even though it has very few optimizations, such as inlining quickly generate memory-efficient code to reduce the mem-
and type specialization. This means that our approach to ory footprint at startup time.
enabling baseline JIT compilation alongside with tracing The Jikes Java Research VM (originally called Jalapeño) [1],
JIT compilation has enough potential to work as a startup which was developed by IBM Research, is a research-oriented
compilation if we carefully adjust the threshold to enter a VM that is written in Java. It has baseline and optimizing JIT
baseline JIT compilation. This is left as future work. compilers and supports an optimization strategy in three-
tires.
4 Related Work
Both well-developed VMs, such as Java VM or JavaScript VM,
and research-oriented VMs of a certain size support multitier
5 Conclusion and Future Work
JIT compilation to balance among the startup speed, compila-
tion time, and memory footprint. As far as the authors know, In the current paper, we proposed the concept and initial
such VMs build at least two different compilers to realize stage implementation of adaptive RPython, which can gen-
multitier optimization. In contrast, our approach realizes it erate a VM that supports two-tier compilation. In realizing
in one engine with a language implementation framework. adaptive RPython, we did not implement another compiler
The Java HotSpot™ VM has the two different compilers, from scratch but drove the existing meta-tracing JIT compila-
that is C1 [16] and C2 [22], and four optimization levels. The tion engine by a specially instrumented interpreter called the
typical path is moving through the level 0, 3 to 4. Level 0 generic interpreter. The generic interpreter supports a fluent
means interpreting. On level 3, C1 compiler compiles a target api that can be easily integrated with RPython’s original
with profiling information gathered by the interpreter. If the hint function. The adaptive RPython compiler generates dif-
C2’s compilation queue is not full and the target turns out to ferent interpreters that support a different compilation tier.
be hot, C2 starts to optimize the method aggressively (level The JIT trace-stitching reconstructs the initial control flow
4). Level 1 and 2 are used when C2’s compilation queue is of a generated trace from a baseline JIT interpreter to emit
full, or level 3 optimization cannot work. the executable native code. In our preliminary evaluation,
The Firefox JavaScript VM called SpiderMonkey [17] has when we manually applied a suitable compilation depending
several interpreters and compilers to enable multitier opti- on the control flow of a target method, we confirmed that
mization. For interpreters, it has normal and baseline in- the baseline-tracing JIT compilation runs as fast as the trac-
terpreters [18]. The baseline interpreter supports inline ing JIT-only compilation and reduces 50 % of the trace size.
caches [7, 12] to improve its performance. The baseline JIT From this result, selecting an appropriate compilation strat-
compiler uses the same inline caching mechanism, but it egy according to a target program’s control flow or nature
is essential in the multitier compilation.
6 We calculated the geometric mean of loop, loopabit, and To implement an internal graph-to-graph conversion of
callabit_baseline_only in both stable and startup speeds. the generic interpreter in RPython is something we plan
Two-level Just-in-Time Compilation with One Interpreter and One Engine PEPM ’22, January 17–18, 2022, Philadelphia, Pennsylvania, United States

to work on next. We currently implement the generic in- Machinery, New York, NY, USA, 297–302. https://doi.org/10.1145/
terpreter transformer as source to source because it is a 800017.800542
proof-of-concept. For a smoother integration with RPython, [8] Maciej Fijałkowski, Armin Rigo, Rafał Gałczyński, Ronan Lamy, Sebas-
tian Pawluś, Ashwini Oruganti, and Edd Barrett. 2014. HippyVM - an
we need to switch implementation strategies in the future. implementation of the PHP language in RPython. Retrieved 2021-10-07
To realize the technique to automatically shift JIT compi- from http://hippyvm.baroquesoftware.com
lation tiers in Adaptive RPython, we also need to investigate [9] Alex Gaynor, Tim Felgentreff, Charles Nutter, Evan Phoenix, Brian
a compilation scheme including of suitable heuristics regard- Ford, and PyPy development team. 2013. A high performance ruby,
ing when to go from one tier to the next. written in RPython. Retrieved 2021-10-07 from http://docs.topazruby.
com/en/latest/
Finally, we would implement our Adaptive RPython tech- [10] Google. 2016. Firing up the Ignition interpreter. Retrieved 2021-10-07
niques in the PyPy programming language because it brings from https://v8.dev/blog/ignition-interpreter
many benefits. For example, we can obtain a lot of data by [11] Google. 2018. Liftoff: a new baseline compiler for WebAssembly in V8.
running our adaptive RPython on existing polished bench- https://v8.dev/blog/liftoff
mark programs to determine a certain threshold to switch [12] Urs Hölzle, Craig Chambers, and David Ungar. 1991. Optimizing
dynamically-typed object-oriented languages with polymorphic in-
a JIT compilation. Furthermore, we could potentially bring line caches. In ECOOP’91 European Conference on Object-Oriented Pro-
our research results to many Python programmers. gramming, Pierre America (Ed.). Springer Berlin Heidelberg, Berlin,
Heidelberg, 21–38.
[13] P. Joseph Hong. 1992. Threaded Code Designs for Forth Interpreters.
Acknowledgments SIGFORTH Newsl. 4, 2 (Oct. 1992), 11–16. https://doi.org/10.1145/
We would like to thank the reviewers of the PEPM 2022 work- 146559.146561
shop for their valuable comments. This work was supported [14] Ruochen Huang, Hidehiko Masuhara, and Tomoyuki Aotani. 2016.
by JSPS KAKENHI grant number 21J10682 and JST ACT-X Improving Sequential Performance of Erlang Based on a Meta-tracing
Just-In-Time Compiler. In International Symposium on Trends in Func-
grant number JPMJAX2003. tional Programming. Springer, 44–58.
[15] Yusuke Izawa, Hidehiko Masuhara, Carl Friedrich Bolz-Tereick, and
References Youyou Cong. 2021. Threaded Code Generation with a Meta-tracing
JIT Compiler. (Sept. 2021). arXiv:2106.12496 submitted for publication.
[1] Bowen Alpern, C. R. Attanasio, Anthony Cocchi, Derek Lieber, Stephen [16] Thomas Kotzmann, Christian Wimmer, Hanspeter Mössenböck,
Smith, Ton Ngo, John J. Barton, Susan Flynn Hummel, Janice C. Shep- Thomas Rodriguez, Kenneth Russell, and David Cox. 2008. De-
erd, and Mark Mergen. 1999. Implementing Jalapeño in Java. In Pro-
sign of the Java HotSpot™ Client Compiler for Java 6. ACM Trans.
ceedings of the 14th ACM SIGPLAN Conference on Object-Oriented Pro-
Archit. Code Optim. 5, 1, Article 7 (May 2008), 32 pages. https:
gramming, Systems, Languages, and Applications (Denver, Colorado,
//doi.org/10.1145/1369396.1370017
USA) (OOPSLA ’99). Association for Computing Machinery, New York,
[17] Mozilla. 2019. Spider Monkey: Mozilla’s JavaScript and WebAssembly
NY, USA, 314–324. https://doi.org/10.1145/320384.320418
Engine. Retrieved 2021-10-07 from https://spidermonkey.dev
[2] Spenser Bauman, Carl Friedrich Bolz, Robert Hirschfeld, Vasily Kir-
[18] Mozilla. 2019. SpiderMonkey’s JavaScript Interpreter and Compiler.
ilichev, Tobias Pape, Jeremy G. Siek, and Sam Tobin-Hochstadt. 2015.
Retrieved 2021-09-27 from https://firefox-source-docs.mozilla.org/js
Pycket: A Tracing JIT for a Functional Language. In Proceedings of the
[19] Mozilla. 2020. Warp: Improved JS performance in Firefox 83. Retrieved
20th ACM SIGPLAN International Conference on Functional Program-
2021-10-07 from https://hacks.mozilla.org/2020/11/warp-improved-js-
ming (Vancouver, BC, Canada) (ICFP 2015). ACM, New York, NY, USA,
performance-in-firefox-83/
22–34. https://doi.org/10.1145/2784731.2784740
[20] Oracle Lab. 2013. A high performance implementation of the Ruby
[3] James R. Bell. 1973. Threaded Code. Commun. ACM 16, 6 (June 1973),
programming language. https://github.com/oracle/truffleruby
370–372. https://doi.org/10.1145/362248.362270
[21] Oracle Labs. 2018. Graal/Truffle-based implementation of Python. Re-
[4] Carl Friedrich Bolz, Antonio Cuni, Maciej Fijałkowski, Michael
trieved 2021-10-07 from https://github.com/graalvm/graalpython
Leuschel, Samuele Pedroni, and Armin Rigo. 2011. Runtime Feedback
[22] Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The Java
in a Meta-tracing JIT for Efficient Dynamic Languages. In Proceedings
Hotspot™ Server Compiler. In Proceedings of the 2001 Symposium on
of the 6th Workshop on Implementation, Compilation, Optimization of
JavaTM Virtual Machine Research and Technology Symposium - Volume
Object-Oriented Languages, Programs and Systems (Lancaster, United
1 (Monterey, California) (JVM ’01). USENIX Association, USA, 1.
Kingdom) (ICOOOLPS ’11). ACM, New York, NY, USA, Article 9, 8 pages.
[23] PyPy development team. 2009. PyPy Speed Center. Retrieved 2021-09-
https://doi.org/10.1145/2069172.2069181
27 from https://speed.pypy.org
[5] Carl Friedrich Bolz, Antonio Cuni, Maciej Fijałkowski, and Armin
[24] Thomas Würthinger, Christian Wimmer, Christian Humer, Andreas
Rigo. 2009. Tracing the Meta-level: PyPy’s Tracing JIT Compiler. In
Wöß, Lukas Stadler, Chris Seaton, Gilles Duboscq, Doug Simon,
Proceedings of the 4th Workshop on the Implementation, Compilation,
and Matthias Grimmer. 2017. Practical Partial Evaluation for High-
Optimization of Object-Oriented Languages and Programming Systems
performance Dynamic Language Runtimes. In Proceedings of the 38th
(Genova, Italy). ACM, New York, NY, USA, 18–25. https://doi.org/10.
ACM SIGPLAN Conference on Programming Language Design and Im-
1145/1565824.1565827
plementation (Barcelona, Spain) (PLDI 2017). ACM, New York, NY, USA,
[6] Carl Friedrich Bolz and Laurence Tratt. 2015. The Impact of Meta-
662–676. https://doi.org/10.1145/3062341.3062381
tracing on VM Design and Implementation. Science of Computer Pro-
gramming 98 (2015), 408 – 421. https://doi.org/10.1016/j.scico.2013.02.
001 Special Issue on Advances in Dynamic Languages.
[7] L. Peter Deutsch and Allan M. Schiffman. 1984. Efficient Implemen-
tation of the Smalltalk-80 System. In Proceedings of the 11th ACM
SIGACT-SIGPLAN Symposium on Principles of Programming Languages
(Salt Lake City, Utah, USA) (POPL ’84). Association for Computing
PEPM ’22, January 17–18, 2022, Philadelphia, Pennsylvania, United States Izawa, Masuhara, and Bolz-Tereick.

A The Algorithm of Just-in-Time


Trace-Stitching

Algorithm 1: DoTraceStitching(inputargs, ops)


input : red variables 𝑖𝑛𝑝𝑢𝑡𝑎𝑟𝑔𝑠 of the given trace
input : a list of operations 𝑜𝑝𝑠 taken from the given trace
/* Note that token_map and guard_failure_stack are global
variables */
token_map ← CreateTokenMap (ops);
Figure 4. The visualization of the resulting traces from
guard_failure_stack, trace, result ← [], [], []; loopabit.tla compiled by baseline JIT compilation. Note that
for op in ops do each trace was joined at one compile time.
if op is guard and marked then
guard_failure ← GetGuardFailure (op);
append guard_failure to guard_failure_stack;
B Results of the Preliminary Evaluation
else if op is call then
if op is CallOp(emit_jump) then
trace, guard_failure ← HandleEmitJump (𝑜𝑝, 𝑖𝑛𝑝𝑢𝑡𝑎𝑟𝑔𝑠) ;

The speed up ratio normalized to the interp.-only exec.


append ( trace, guard_failure) to result;
trace ← [] ; TLA w/ Adaptive RPython (Stable speed)
else if op is CallOp(emit_ret) then Executing w/ baseline JIT Executing w/ tracing JIT
3.5 3.5
trace, guard_failure ← HandleEmitRet (𝑜𝑝) ;
append ( trace, guard_failure) to result; 3.0 3.0
trace ← [] ; 2.5 2.5
else 2.0 2.0
append 𝑜𝑝 to trace;
1.5 1.5
else if op is JumpOp then 1.0 1.0
append 𝑜𝑝 to trace;
0.5 0.5
guard_failure ← PopGuardFailure ();
append (trace, guard_failure) to result; 0.0 0.0
loop

loopabit

geo_mean (baseline)

loop

loopabit

geo_mean (tracing)
break;
else if op is RetOp then
append 𝑜𝑝 to trace;
guard_failure ← PopGuardFailure ();
append (trace, guard_failure) to result;
break;
else
append 𝑜𝑝 to trace; (a) The result of the stable speeds of loop and loopabit.
The speed up ratio normalized to the interp.-only exec.

return result;
Function PopGuardFailure(): TLA w/ Adaptive RPython (Startup speed)
if first pop? then Executing w/ baseline JIT Executing w/ tracing JIT
return None;
else 5 5
𝑓 𝑎𝑖𝑙𝑢𝑟𝑒 ← pop the element from guard_failure_stack;
4 4
return 𝑓 𝑎𝑖𝑙𝑢𝑟𝑒 ;
3 3
Function HandleEmitJump(𝑜𝑝, 𝑖𝑛𝑝𝑢𝑡𝑎𝑟𝑔𝑠 ):
𝑡𝑎𝑟𝑔𝑒𝑡 ← GetProgramCounter (op); 2 2
𝑡𝑜𝑘𝑒𝑛 ← token_map [𝑡𝑎𝑟𝑔𝑒𝑡 ];
guard_failure ← PopGuardFailure (); 1 1
append 𝐽 𝑢𝑚𝑝𝑂𝑝 (𝑎𝑟𝑔𝑠, 𝑡𝑜𝑘𝑒𝑛) to trace;
0 0
return trace, guard_failure;
loop

loopabit

loopabit
geo_mean (baseline)

loop

geo_mean (tracing)

Function HandleEmitRet(𝑜𝑝 ):
𝑟𝑒𝑡 𝑣𝑎𝑙 ← GetRetVal (𝑜𝑝) ;
guard_failure ← PopGuardFailure () ;
append 𝑅𝑒𝑡𝑂𝑝 (𝑟𝑒𝑡 𝑣𝑎𝑙) to trace;
return trace, guard_failure;

(b) The result of the startup speeds of loop and loopabit.

Figure 5. The results of loop and loopabit. Execution times


are normalized to the interpreter-only execution. Higher is
better
Two-level Just-in-Time Compilation with One Interpreter and One Engine PEPM ’22, January 17–18, 2022, Philadelphia, Pennsylvania, United States

The speed up ratio normalized to the interp.-only exec. callabit_baseline_interp callabit_tracing_baseline


callabit_baseline_only callabit_tracing_only
TLA w/ Adaptive RPython (Stable speed) callabit_baseline_tracing
400
3.0
350 3.0
2.5 300 2.5
2.0 250
2.0
200

ms
1.5 1.5
150
1.0
1.0 100
50 0.5
0.5
0 0.0
0.0 # Traces Compilation time (ms)
callabit_baseline_interp

callabit_baseline_only

callabit_baseline_tracing

callabit_tracing_baseline

callabit_tracing_only
Figure 7. The trace sizes and compilation times in callabit
programs. The program is so small that the compilation time
is at most 3 % of the total.

17 @elidable
(a) The result of the stable speeds of callabit programs. 18 def t _ e m p t y ( ) :
19 return _T_EMPTY
The speed up ratio normalized to the interp.-only exec.

20
21 memoization = { }
TLA w/ Adaptive RPython (Startup speed) 22
23 @elidable
3.0 24 def t _ p u s h ( pc , next ) :
2.5 25 key = pc , next
26 if key in m e m o i z a t i o n :
2.0 27 return m e m o i z a t i o n [ key ]
28 r e s u l t = T r a v e r s e S t a c k ( pc , next )
1.5 29 m e m o i z a t i o n [ key ] = r e s u l t
1.0 30 return r e s u l t

0.5 Listing 2. The definition of traverse_stack.


0.0
callabit_baseline_interp

callabit_baseline_only

callabit_baseline_tracing

callabit_tracing_baseline

callabit_tracing_only

C.2 Bytecode Programs Used for Preliminary


Evaluation
1 # loop . tla
2 t l a . DUP ,
3 t l a . CONST_INT , 1 ,
4 t l a . LT ,
5 t l a . JUMP_IF , 1 1 ,
6 t l a . CONST_INT , 1 ,
(b) The result of the startup speeds of callabit programs. 7 t l a . SUB ,
8 t l a . JUMP , 0 ,
9 t l a . CONST_INT , 10
Figure 6. The results of callabit programs with simulated 10 t l a . SUB ,
multi-tier JIT compilation. Every data is normalized to the 11 t l a . EXIT ,

interpreter-only execution. Higher is better. Listing 3. Definitions of loop.

1 # loopabit . tla

C Programs 2
3
t l a . DUP ,
t l a . CONST_INT , 1 ,
C.1 The Definition of Traverse Stack 4
5
t l a . SUB ,
t l a . DUP ,
6 t l a . CONST_INT , 1 ,
1 class T r a v e r s e S t a c k :
7 t l a . LT ,
2 _ i m m u t a b l e _ f i e l d s _ = [ ' pc ' , ' next ' ]
8 t l a . JUMP_IF , 1 2 ,
3
9 t l a . JUMP , 1 ,
4 def _ _ i n i t _ _ ( s e l f , pc , next ) :
10 t l a . POP ,
5 s e l f . pc = pc
11 t l a . CONST_INT , 1 ,
6 s e l f . next = next
12 t l a . SUB ,
7
13 t l a . DUP ,
8 def t _ p o p ( s e l f ) :
14 t l a . DUP ,
9 return s e l f . pc , s e l f . next
15 t l a . CONST_INT , 1 ,
10
16 t l a . LT ,
11 @elidable
17 t l a . JUMP_IF , 2 5 ,
12 def t _ i s _ e m p t y ( s e l f ) :
18 t l a . JUMP , 1 ,
13 return s e l f is _T_EMPTY
19 t l a . EXIT
14
15 _T_EMPTY = None Listing 4. Definition of loopabit
16
PEPM ’22, January 17–18, 2022, Philadelphia, Pennsylvania, United States Izawa, Masuhara, and Bolz-Tereick.

1 # callabit . tla 14 t l a . JUMP_IF , 1 5 ,


2 # - callabit_baseline_interp replaces XXX with tla . CALL_NORMAL 15 t l a . JUMP , 0 ,
, 16 16 t l a . EXIT ,
3 # - callabit_baseline_tracing and callabit_traing_only replace 17 # sub_loop (n)
XXX 18 t l a . CONST_INT , 1 ,
4 # with tla . CALL_JIT , 16 19 t l a . SUB ,
5 # main (n) 20 t l a . DUP ,
6 t l a . DUP , 21 t l a . CONST_INT , 1 ,
7 t l a . CALL , 1 6 , # XXX 22 t l a . LT ,
8 t l a . POP , 23 t l a . JUMP_IF , 2 7 ,
9 t l a . CONST_INT , 1 , 24 t l a . JUMP , 1 6 ,
10 t l a . SUB , 25 t l a . RET , 1

Listing 5. The definition of callabit.


11 t l a . DUP ,
12 t l a . CONST_INT , 1 ,
13 t l a . LT ,

You might also like