Mambo - A Full System Simulator For The Powerpc Architecture

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Mambo – A Full System Simulator for the PowerPC Architecture

Patrick Bohrer Mootaz Elnozahy Ahmed Gheith Charles Lefurgy Tarun Nakra
James Peterson Ram Rajamony Ron Rockhold Hazim Shafi Rick Simpson
Evan Speight Kartik Sudeep Eric Van Hensbergen
Lixin Zhang

IBM Austin Research Lab


Austin, TX 78758
[email protected]

Abstract To fill our needs, our design stresses modularity and con-
figurability. Modularity is achieved by an internal struc-
ture that features a modern, multithreaded simulation core.
Mambo is a full-system simulator for modeling PowerPC-
This in turn is enhanced with various programming constructs
based systems. It provides building blocks for creating sim-
that support a modular and highly maintainable design. The
ulators that range from purely functional to timing-accurate.
constructs implement higher-level abstractions to express the
Functional versions support fast emulation of individual Pow-
usual characteristics of simulated systems, such as pipelined
erPC instructions and the devices necessary for executing op-
execution units, and programmers use these abstractions to
erating systems. Timing-accurate versions add the ability to
quickly model different system behaviors. Due to this modu-
account for device timing delays, and support the modeling
larity, our team is able to experiment with simulator enhance-
of the PowerPC processor microarchitecture. We describe our
ments and performance improvement, and quickly introduce
experience in implementing the simulator and its uses within
them into the simulator with minimal perturbation to the pro-
IBM to model future systems, support early software devel-
duction mode operation.
opment, and design new system software.
The second feature stressed in the Mambo design is config-
urability. Mambo is designed as a collection of configuration
features that can be selected to easily define a variety of pro-
1 Introduction cessors and devices. Compile time and runtime parameters
allow users to configure nearly every feature of the system
Full system simulators have emerged during the past decade being simulated. Compile time options define major features
as viable tools for low-level system software development and (such as 32-bit or 64-bit support), while runtime options set
performance evaluation. Earlier, our team adapted the SimOS fine-grained parameters such as amount of memory, number
simulator platform [8] to support the PowerPC architec- of processors, cache geometry, etc. A partial list of selectable
ture [5]. While our experience was successful, it also showed features includes:
the need for an industry-strength implementation that is more
configurable and amenable to the rigors of the software en-
gineering life cycle. Therefore we started Mambo, a modu- 32-bit or 64-bit processor design.
lar full system simulator that is designed from the ground up Floating point registers and instructions.
to simulate the PowerPC line of processors [6]. The imple- Vector Multimedia Extension (VMX) registers and in-
mentation supports different simulation modes, ranging from structions.
functional simulation of the PowerPC instructions, to cycle- Hardware Multi-Threading (SMT) [12].
accurate simulation of an entire system. Mambo also includes PCI bus.
trace collection and debugging interfaces to allow detailed
analysis of the simulated hardware and software. Seven pro- IDE disks.
cessors of the PowerPC line are supported, including the 32- Network.
bit embedded 405GP [7] and the 64-bit 970 PowerPC used in Caches (L1, L2, L3, and victim).
Apple’s new G5 system [1]. The processor support includes Bus.
interrupts, debugging controls, caches, busses, and a large Memory.
number of architectural features. In addition, Mambo models
memory-mapped I/O devices, consoles, disks, and networks UART and console support.
that allow the simulated operating systems to boot and run Hypervisor support.
programs. Address translation (ERAT, SLB, TLB) [6].
Uniprocessor or multiprocessor. feature may uncover errors, missing functionality or areas that
were not well understood. Traditionally, such problems are
not uncovered until a detailed VHDL model of the hardware
The simulator runs on the x86 and PowerPC platforms run- is built, or even after system software has been implemented
ning a range of operating systems including Linux, AIX, on the finalized hardware platform. For instance, in the early
OS/X, and Windows R . It uses Tcl/Tk to provide a command design of a PowerPC processor, Mambo revealed a race con-
language and graphical user interface and DiskSim [3] to pro- dition that required changing the semantics of several bits in
vide timing-accurate disk models. a control register. Also, the hardware features of a hypervisor
design had to be updated based on the implementation of the
We have used Mambo successfully for a variety of purposes,
operating system on the modeled hypervisor.
including support of operating system development, sys-
tem bringup, characterization of application performance and The second category of using Mambo is in application char-
power consumption, performance tuning, and pre-hardware acterization. Mambo produces a variety of statistics, both in
application development. In Section 2 we describe our ex- summary and detailed form, allowing the performance and
perience with Mambo in more detail. We then describe in operation of a program to be understood and evaluated for
Section 3 the implementation of the simulation and conclude a new hardware architecture. By associating performance-
the paper in Section 4. affecting hardware events (e.g., cache misses, TLB shoot
downs, and memory references) with the program instruction
stream, it is possible to identify under-performing portions
2 Experience with Mambo of a program and correlate the performance problems with
resource usage. This may allow significant performance im-
Like other full system simulators, Mambo has proved useful provement by changing a data structure or the position of an
in software development and application characterization. In inner loop to reflect the cache architecture. These features
some cases, the simulator served as a platform to enable soft- provide an infrastructure for characterizing application and
ware development before the hardware is available. As an ex- system behavior and performance.
ample, a team of researchers at IBM was able to develop the
We have extended the characterization to the emerging field of
software for Blue Gene/L [11] [4] [2] so that when the hard-
power-aware computing [9]. With the help of power estimates
ware became available, programs were running on the first
for the various tasks associated with execution of instructions,
day, and the system was usable within a week. Similar uses
an analysis of the total power consumed in the core and mem-
are also underway for several architectures and systems under
ory subsystem can be carried out. Then, one can use this
development.
information to identify opportunities for reducing processor
It is noteworthy that Mambo is useful for software develop- speed (e.g., during memory-intensive instructions) or modi-
ment even if the hardware is available. For example, devel- fying the application structure to reduce power consumption.
oping low-level system software such as operating systems
on the bare hardware is time consuming. Mambo includes
an interface to gdb, allowing source-level debugging from the
very first instruction of the operating system. gdb attaches to 3 Implementation
Mambo so that developers can use the normal gdb interface to
debug the simulated operating system. The simulator can sin- 3.1 Operating System Adaptation
gle step through code that cannot normally be traced in this While Mambo is capable of booting unmodified operating
way, such as an operating system’s first level interrupt han- systems such as Linux, detailed simulation of peripherals is
dler. A team of researchers at IBM has used the simulator to time intensive to implement and slows down simulation. To
support the development of the K-42 operating system [10]. improve run-time when detailed device simulation is not nec-
In their experience, the simulator has advanced their develop- essary, several changes are made to the simulated operating
ment schedule by about a year. system to allow more direct interaction with Mambo. A direct
block driver interface allows disk images on the simulation
Mambo also can enhance the software-hardware co-design host to be used by the simulated operating system, and a vir-
process. For example, new hardware features such as SMT tual Ethernet interface is added that can either communicate
or hypervisor support can be modeled and low-level system to other simulated hosts or to real networks. Other changes
software can be developed to examine the use of such fea- include process tracking hooks, which interact with Mambo
tures before they are finalized into hardware. Our experience statistics gathering infrastructure.
shows that this approach has several benefits that straddle
software and hardware. For example, our experience shows Figure 1 shows a screenshot of Mambo booting Linux on a
that using Mambo early in the hardware design process to PowerPC 750 system. The UART0 window shows the simu-
model the new feature forces the designers to define the fea- lated console and the xterm window shows the Mambo com-
ture well enough to be programmed. The feedback from the mand line. Other windows show the GUI interface and a
model implementation and the software experience with the statistic gathering tool. The GUI ensures ease of use and
Figure 1: Mambo Graphical User Interface during a Linux boot.

quick identification of performance bottlenecks. of operation provides good accuracy at the expense of longer
simulation time. A cycle accurate model of the 405GP pro-
3.2 Timing Models cessor was validated to be within 0.6% of real hardware, but
Mambo provides a variety of timing models for software de- ran four times longer than the functional model, which was
velopment and for hardware and software performance eval- off by 26% against the real hardware [9]. For more complex
uation. The simplest timing model assumes each instruc- processors, the slowdown of the cycle accurate model com-
tion requires one cycle to execute. Memory accesses are pared to the functional model can be 10 times or more.
synchronous and instantaneous. This is a purely functional
model, and is useful for software development and debug- A compromise between the fast, but inaccurate, functional
ging when a precise measure of execution time is not impor- model and the slower, but accurate, cycle-accurate model is
tant. Even in this mode, some system features require tim- the cycle-approximate model. This model uses probabilistic
ing support. For example, I/O interrupts and timer interrupts measures to improve timing estimates. For example, a mem-
are scheduled to provide at least a crude sense of the passing ory reference may (or may not) hit in the cache. A cache hit
of time. These inaccuracies are tolerated given the intended takes a different amount of time than a cache miss. In the
use of the functional model. This use trades accuracy for in- cycle-accurate model, it is necessary to model the cache, al-
creased processing speed. For instance, a functional model lowing Mambo to determine exactly if a particular reference
of the 405GP processor executing on a 3.2GHz, x86 system is in the cache. The cycle-accurate model knows if there is
can simulate an average of 4 million PowerPC instructions a cache hit or miss. The cycle-approximate model does not
per second. model the cache (hence providing a faster simulation), but
probabilistically determines the time for the access from user-
For accurate performance evaluation, Mambo provides a supplied cache hit ratios as well as a predetermined time for a
cycle-accurate timing model. A cycle-accurate timing model cache hit and cache miss. We are currently adding this model
requires a complete modeling of the operation of the proces- to the infrastructure.
sor including its pipeline and functional units. Each operation
takes a number of cycles to complete and must consider both 3.3 Multithreaded Simulator Structure
processing time (the time to search a cache, for example) and To simplify the development effort while still accurately mod-
resource constraints (e.g., an instruction cannot be issued to eling hardware events, we structured Mambo as an internal
an add unit if that add unit is already in use). This mode thread programming model, allowing instruction execution
code to simply pause in place (delay) as necessary. For in- thread and returns asynchronously to the caller. Counters can
stance, the main function of a cache refill request simply looks be used to synchronize different aspects of the interaction be-
as follows: tween the caller and the worker.

Since simulation models always use those constructs to ex-


Cache_Refill(MemAccessStruct *ma)
press their dynamic behavior and timing characteristics, we
{
have the freedom to vary the implementation to achieve dif-
DO_DELAY(cache_to_bus_delay);
Pass_It_To_Bus(ma); ferent objectives. Indeed, we are currently exploring different
DO_DELAY(bus_to_memory_controller_delay); models of multi-threaded (using one way messages) and dis-
Pass_It_To_Memory_Controllers(ma); tributed (using backwards recovery techniques) implementa-
DO_DELAY(memory_controller_to_dram_delay); tions.
Pass_It_To_Dram(ma);
DO_DELAY(dram_delay);
Pass_Back_To_Memory_Controller(ma); 3.4 Performance Evaluation Infrastructure
DO_DELAY(memory_controller_to_bus_delay); The performance of a program on a PowerPC system, even
Pass_Back_To_Bus(ma); one that does not yet exist, can be determined by running it on
DO_DELAY(bus_to_cache_delay); Mambo. By booting an operating system, such as Linux, the
Pass_Back_To_Cache(ma); application can be executed and timed using standard timing
} tools running on the simulated system, including operating
system interactions.
DO DELAY() is a call to inform the thread scheduler that this
Alternatively, applications can be run in ”standalone” mode,
thread needs to be delayed for the given number of cycles. Be-
where all operating system functions are supplied by Mambo,
cause Mambo’s thread model allows simulators to be coded
and normal OS effects, such as paging and scheduling do not
in the way that preserves hardware behavior, it can drastically
occur. This provides information that is more directly a result
reduce programming effort and the resulting code is very easy
of the intrinsic program design and implementation. This is
to understand.
useful for application’s performance and power characteriza-
Mambo’s thread model is implemented completely at user tion.
level. Switching between Mambo threads is almost as effi-
Mambo also provides its own timing measures. In addition
cient as switching between events, but should occur much less
to simple cycle count information, and summary statistics,
frequently compared to a pure event-driven simulator.
Mambo provides an ”emitter” data stream. To enhance mod-
The thread programming model has also introduced a host ularity and usability, we decoupled the performance analysis
of other programming constructs to express the constraints toolset from the simulator implementation. Rather than build
of, for example, a pipeline processor model or a multi-level into Mambo a large set of performance analysis tools, Mambo
cache. All abstractions are built on a small set of low-level has been designed to generate a stream of events. The spe-
(familiar) mechanisms, most notably, gates, counters, cific events, such as instruction execution, memory reference
avals, and ports. addresses and contents, TLB hits and misses, cache hits and
misses, and so on, are emitted into a circular queue in shared
A gate is used to express resource limitations by restricting memory where they can be read by other programs, called
concurrency. A gate is created with a given width. A thread ”emitter readers.”
can enter then leave a gate. The gate only allows width
threads simultaneously (in simulated time). The events that Mambo puts in the emitter data stream are se-
lectable by the user at run-time. In addition, events deemed
A counter is used as an event signaling mechanism. ”uninteresting” for a particular run or purpose can be ignored
Threads can set, increment, decrement the value of a by emitter readers. Thus, a user can select a large set of events
counter. A thread can block for the counter to have a zero to be emitted and simultaneously run several emitter readers
count. that process the emitter data stream looking for the events of
interest to them. Emitter readers can compute summary in-
An aval (active value) is used to implement constructs like formation (range, average, standard deviation, and histogram
register renaming. A thread can declare ownership of an ac- of execution times), or can display the events in real time.
tive value. Any thread that attempts to access this value can
provide a counter that gets decremented when the owner sets Another approach is to define an emitter reader that converts
the final value. the Mambo emitter data stream (or some subset of it) to a
pre-existing trace format. This new trace can then be fed into
A port is the preferred mechanism for forking concurrent existing analysis tools. Multiple trace formats can be sup-
activities. A port is associated with a handler function and a ported simply by writing new emitter readers, a relatively sim-
dynamic pool (fixed or variable size) of worker threads. Send- ple task. Mambo itself requires no changes for the additional
ing a message to a port schedules it for processing by a worker formats.
For enhanced usability, performance analysis can be provided [4] L. Ceze, K. Strauss, G. Almasi, P. J. Bohrer, J. R. Brun-
by the GUI based emitter readers. These provide graphs of heroto, C. Caşcaval, J. G. Castaños, D. Lieber, X. Martorell,
memory access, cache misses, processor resource usage, and J. E. Moreira, A. Sanomiya, and E. Schenfeld. Full Circle:
even power usage [9], displayed against time. Since the emit- Simulating Linux Clusters on Linux Clusters. In Proceedings
ter stream includes the program counter, it is possible to trace of the Fourth LCI International Conference on Linux Clus-
interesting performance events, such as high cache miss rates, ters: The HPC Revolution 2003. Springer-Verlag, June 2003.
back to the specific instruction of the simulated program and [5] IBM Austin Research Lab. SimOS-PowerPC web
even to the specific lines of source code. page. Available at http://www.research.ibm.com/arl/projects
/SimOSppc.html, 2000.
[6] IBM Corporation. The PowerPC Architecture: A Spec-
4 Conclusions ification for a New Family of Processors. Morgan Kaufmann
Publishers, Inc., 1994.
Mambo is a full system simulator that has proved useful in [7] IBM Corporation. PowerPC 405GP Embedded Pro-
supporting low-level system software development and char- cessor User’s Manual. IBM Corporation, 2000.
acterizing applications’ performance and power consumption.
[8] M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta.
We have used the simulator successfully within IBM to sup-
Complete Computer Simulation: The SimOS Approach. In
port various projects, including the Blue Gene supercomputer
IEEE Parallel and Distributed Technology, Fall 1995.
and the K42 operating system, among several others. The
simulator features a modern core based on multithreading and [9] H. Shafi, P. Bohrer, J. Phelan, C. Rusu, and J. Peterson.
high-level abstractions that support a high degree of modular- Design and validation of a performance and power simulator
ity, configurability, and ease of use. Users of the simulator for PowerPC systems. IBM Journal of Research and Devel-
report substantial reduction in development times, increased opment, 47(5/6):641–652, 2003.
insight into the hardware design process, and successful char- [10] C. A. N. Soules, J. Appavoo, K. Hui, R. W. Wisniewski,
acterization of application performance and power consump- D. Da Silva, G. R. Ganger, O. Krieger, M. Stumm, M. Aus-
tion. lander, M. Ostrowski, B. Rosenburg, and J. Xenidis. System
support for online reconfiguration. In USENIX Annual Tech-
While Mambo is not open source software, it is freely avail- nical Conference, pages 141–154, 2003.
able through a special license to parties outside IBM on an as-
is basis. At the time of this publication, it has been licensed [11] The BlueGene/L Team. An Overview of the Blue-
to 8 companies and over 25 academic institutions. Gene/L Supercomputer. In Proceedings of the 2002
ACM/IEEE conference on Supercomputing, Nov 2002.
[12] R. Thekkath and S. Eggers. The effectiveness of multi-
ple hardware contexts. In International Conference on Archi-
Acknowledgements tectural Support for Programming Languages and Operating
Systems, 1994.
This work was partially supported by the Defense Ad-
vanced Research Projects Agency, Department of De-
fense under contracts F33615-00-C-1736, F33615-03-C-
4106, NBCH30390004, and NBCHC020056. Further support
was provided by various IBM divisions. PowerPC is a trade-
mark of IBM. X86 is a trademark of Intel Corp. Windows is
a trademark of Microsoft Corp. We acknowledge all trade-
marks referenced herein to be the property of their owners.

References
[1] Apple Computer Inc. Apple Power Mac G5, 2004.
[2] L. R. Bachega, J. R. Brunheroto, L. DeRose,
P. Mindlin, and J. E. Moreira. The BlueGene/L Pseudo Cycle-
accurate Simulator. In Proceedings of the IEEE International
Symposium on Performance Analysis of Systems and Software
(ISPASS), March 2004.
[3] J. S. Bucy and G. R. Ganger. The disksim simulation
environment version 3.0 reference manual. Technical Report
CMU-CS-03-102, Carnegie Mellon University, 2003.

You might also like