An FPGA-Based Pentium in A Complete Desktop System: Shih-Lien L. Lu Peter Yiannacouras Taeweon Suh
An FPGA-Based Pentium in A Complete Desktop System: Shih-Lien L. Lu Peter Yiannacouras Taeweon Suh
[email protected] [email protected]
Rolf Kassa,
Michael Konow
Intel Corp.
[email protected]
53
In this work we emulate a version of a commercial to multiply [14]. Some of this work focusses on accelerating
x86 desktop processor on an FPGA to run real operating simulation times by offloading highly detailed resource
systems on stock hardware. To be precise, weve replaced modelling into the FPGA while a software simulator remains
a Pentium
r microprocessor from its standard socket on the core of the emulation environment [7]. Other research
a stock motherboard, with a single Xilinx Vritex4 LX200 often focusses on a single architectural novelty (for example
FPGA which implements the Pentium
r core. The stock transactional parallel systems [17], caching [19], vector-
motherboard with a standard socket is underclocked at thread processors [16]) and build FPGA-based models of
25 MHz and all system components such as memory, the relevant hardware. Contrary to both these approaches,
graphics card, CDROM, hard disk, USB devices, mouse and we implement the complete microprocessor on an FPGA
keyboard can be operated at the same relative speeds as making the entire processor architecture flexible.
in an original system. Most importantly, our FPGA-based Complete RTL models of microprocessors have already
Pentium
r emulation system provides us the ability to run become available for the SPARC V8 [1], Niagara [3], and
real operating systems, such as Fedora Core 4, Red Hat 9, PowerPC [4]. These cores are can be synthesized to FPGA
and Windows XP on the FPGA while interacting with real and are designed to facilitate design space exploration as
hardware components. seen by Jones et al [15]. However, to the best of our
The FPGA-based Pentium
r desktop system provides a knowledge, we are the first to employ such a core in a real
powerful tool for the exploration and customization of future desktop system with real hardware peripherals capable of
microprocessors. Although the system being emulated hosting real and modern operating systems. Our emulation
does not contain a state-of-the art microprocessor, its platform also provides several orders of magnitude of simula-
applicability to modern architectural research has recently tion time speedup over software emulators such as Simics [8]
spiked due to the successful arrival of chip multi-processors and SimOS [20].
(CMPs). As the number of cores in a CMP increases, system An abundance of research already exists in the embedded
level architectural decisions are becoming more important. domain which applies customization to an FPGA-based
Our emulation system has already been expanded to a core. The fruitfulness of application-specific microarchitec-
multiprocessor system by using available dual processor tural variation was seen in [22] and its automatic navigation
motherboards, though that work is still in progress. in [21]. In addition, the effect of including custom instruc-
In this work we make the following contributions: (i) we tions into such cores was explored [6]. While our work is
analyze the Pentium
r core implementation on the Virtex- similar in spirit to these works we differentiate ourselves by
4 FPGA and crudely contrast it to its implementation focussing on the desktop domain and emphasizing peripheral
using the silicon technology of its commercial debut, (ii) and operating system interaction.
we perform preliminary architectural enhancements which
demonstrate the emulators ability to measure the effect of
microarchitectural changes on the complete system using 3. THE FPGA-BASED PENTIUM
r
the SPEC2000 integer benchmarksspecifically we param- EMULATION SYSTEM
eterize the branch target buffer and the L1 cache; and (iv) The complete emulation environment consists of four main
we experimented with adding hardware accelerators such as components: (i) the FPGA which hosts the Pentium
r
AES and DES. processor; (ii) the hardware including motherboard and
The ability to place desktop microprocessors on an FPGA peripherals; (iii) the software/operating system; and (iv)
device and have it execute consumer applications has sig- the necessary FPGA CAD software required to implement
nificant ramifications for the FPGA community. It may the FPGA design. We discuss each of these four items in
not be feasible for desktop processors to be hosted on further detail.
FPGAs commercially, but with academia and industry
embracing the concept as a research vehicle, at the very 3.1 The Processor
least, researchers will discover innovative ways to use the
The processor used in our emulation system is the original
programmable FPGA fabric (for example by adding custom
Pentium
r which is the desktop processor released after
instructions or parameterizing parts of the architecture),
the 486 and before the Pentium Pro
. r The 3.3 million
which may then pave the way for FPGA fabric to be tightly
transistor processor was released in 1994 in a 0.6 micron
integrated into ordinary desktop processor devices. Also, it
technology and was originally clocked at 75 MHz [13]. It is
provides an interesting point of comparison allowing us to
a 32-bit in-order 5-stage dual-pipeline processor supporting
benchmark modern FPGA technology against twelve year
the IA32 instruction set including floating point instructions
old transistor-based silicon technology.
using an on-chip pipelined floating-point module. It is
The remaining sections of this document will summarize
equipped with two on-chip separate 8 KB 2-way set associa-
related work and relevant background in Section 2, describe
tive level 1 caches for data and instructions and implements
the Pentium
r emulation system in more detail in Section 3,
the MESI protocol for use in multiprocessor environments.
outline the implementation of our architectural enhance-
It also includes dynamic branch prediction using a 256 entry
ments made in Section 4, discuss the area/speed effects of
predictor table and branch target buffer.
the architectural implementations in Section 5, and then
A 3-level stacked board houses the FPGA and necessary
conclude in Section 6.
circuitry. The first level contains the pin/power conversion
between the motherboard and FPGA allowing it to be
2. BACKGROUND plugged directly into the motherboard. The second level
The concept of using FPGAs to more quickly and more contains the FPGA itself, and the top level contains the
accurately explore the microprocessor design space has programming circuitry for the FPGA. The FGPA used
recently gained traction causing publications on the topic to host the Pentium
r is a Xilinx Virtex-4 LX200 90
54
3.3 The Operating Systems
The most powerful ability of our FPGA-based system is
its ability to boot real operating systems. We successfully
installed unmodified versions of Fedora Core 4, Red Hat
9, and Windows XP on the Pentium
; r the installation
procedure was no different than on any typical desktop
system. In terms of performance and usability, it takes
approximately 10 minutes to boot Fedora Core 4 without a
GUI. Command shells, and text editors such as vim operate
just as expected on a modern computer system, and GCC
can compile small programs in seconds. Typing is certainly
done at full speed, searches through normal sized text files
succeed with unnoticeable latency. In summary, the system
is perfectly usable as a desktop computer for very simple
non-graphical applications.
3.4 FPGA Development
Figure 1: Image of the FPGA-based processor To synthesize the Pentium
r we use Synplify Pro 8.5.1 for
emulator system equipped with standard hardware high-level synthesis of the VHDL and then use Xilinx ISE
peripherals, a Xilinx Virtex-4 device in place of a 8.1i for placement and routing onto the Virtex-4 device. The
microprocessor chip, all running Windows XP entire process takes between 10 and 20 hours to synthesize,
map, place, route and generate a bitstream, followed by
an additional 20 seconds to download the bitstream to the
device. This turnaround time is orders of magnitude quicker
than the fabrication time for a silicon implementation
of the processor which could be inserted directly on the
motherboard. In terms of debugging, Modelsim 6.1 is used
to simulate the VHDL in lockstep with a software simulator
which models the original behaviour of the processor. A
suite of regression tests are used to ensure the processor is
still a functional x86 machine. The regression tests are a
subset of those used to verify the original Pentium
.
r
55
32 KB 8-way set associative caches. The LRU replacement
policy which determines which line gets evicted within a Table 1: Virtex-4 resource utilization by the
r
unmodified Pentium
.
full set was also expanded to handle the sets of 8 cache
lines. Both instruction and data caches can be individually Resource Number used Percent Used
configured to either the 8KB or 32KB versions, but in this
4-LUTs 65615 37%
work we always keep them the same size.
Registers 26859 15%
4.3 Integrating AES and DES Crypto Engine Slices 41438 46%
DSP48s 29 30%
We integrated two crypto-engines into the Pentium
: r
BRAMs 118 35%
Advanced encryption standard (AES) and data encryption
standard (DES). Security has more recently become a crit-
ical requirement in many computing areas such as network
security and digital rights management. To support such 5. EXPERIMENTING WITH THE
security requirements and maximize system performance, PENTIUM
r SYSTEM
security-enhanced processors are preferred and becoming In this section we analyze and benchmark the FPGA-
available in the market [12]. In our approach we integrate based Pentium
r system to extract the following results:
custom instructions for accelerating encryption and decryp- (i) an area breakdown of the Pentium
r as reported by the
tion directly into the processor. CAD flow; (ii) a comparison between the original branch
We retrieved AES and DES intellectual property (IP) target buffer and our expanded version; (iii) a comparison
cores from Opencores [2]. The AES core implemented between the original 8KB L1 cache and our expanded 32KB
the Rijndaels algorithm and takes a 128-bit key and a L1 cache; (iv) an analysis of the crypto-engine hardware
128-bit plaintext/cyphertext for encryption and decryption, accelerator. We examine each of these in more detail. Note
respectively. The DES core takes a 56-bit key and 64- that we report on area in terms of Virtex-4 resources but
bit plaintext/cyphertext for encryption and decryption, are cognizant that these results may not predictably map to
respectively. In our implementation, we extended the x86 a real silicon implementation. Nonetheless the area analysis
ISA to integrate AES and DES engines by creating new can be used for first-order approximations.
Model-Specific Registers (MSRs)a set of hidden registers
usually used to capture debug/performance information 5.1 Area Breakdown of the Pentium
r
which are accessible only by two privileged instructions
We synthesized the Pentium
r VHDL to the Virtex-4
called rdmsr and wrmsr respectively for reading and writing.
LX200 and noticed that less than half of the device resources
We can use the MSRs to provide communication with
were used; the corresponding data is shown in Table 1
the crypto-engines. That is, the encryption/decryption is
taken after high-level synthesis and technology mapping was
executed by sending data to the appropriate crypto-engine
completed. Only 37% of the LUTs were used to store all
by writing to our newly created MSR(s) via the wrmsr
the logic for the Pentium
,r however they were distributed
instruction, then the corresponding cyphertext or plaintext
through 46% of the slices. Also, 35% of the block RAMs were
result can be read from the crypto-engine via the rdmsr
utilized (distributed RAMs are counted as 4-LUTs). With
instruction. Similarly, control information is sent to the
more than half of the resources still available, there exists
crypto-engines using another MSR. For example, users can
sufficient space on the device for expanding and augmenting
choose the configuration such as AES or DES, encryption or
the Pentium
. r
decryption, and key or input data. This approach reduces
Figure 3 shows the breakdown of each Virtex-4 resource
the access latency by avoiding comparably expensive bus
used by different units in the processor; the data was
accesses had the engine been a co-processor connect through
collected from the synthesis results reported by Synplify Pro.
the bus.
All of the DSP48 (multipliers) were used by the floating
Implementing the new MSRs involved several changes.
point unit, and nearly all of the block RAMs were divided
First the actual MSRs and necessary logic to access them
amongst the instruction cache, data cache, and microcode
was inserted into the VHDL design. Second the privilege
units. The Virtex-4 LUTs were used mostly by the FPU,
protections checks were removed from rdmsr and wrmsr
ALU, address generation, and caches. The entire memory
allowing us to access the crypto-engines from user space
hierarchy (including the caches and bus interface) claimed
rather than through the operating system. Finally, many
approximately 45% of the LUTs used, suggesting that even
optimizations were required to improve the execution speed
when considering only logic, almost half of the chip is
of these instructions since generally rdmsr and wrmsr are
devoted to communication leaving the other half for control
very slow instructions. With all these modifications we
and actual computation.
achieved a communication overhead of only 6 cycles between
Although synthesizable, the Pentium
r VHDL was not
the processor and the crypto-engines (the engines were
designed for mapping to an FPGA. Recent work [10]
clocked at the same CPU frequency though capable of much
suggested that a processor designed specifically for synthesis
higher clock rates). The entire design time was less than
to an FPGA can be more than an order of magnitude smaller
two weeks for this change and involved modifications to the
than a generically written mostly-behavioural VHDL pro-
microcode in addition to VHDL changes to only one isolated
cessor. While our processor has had some manual tweaking
component.
to guide its mapping to some FPGA resources, we too also
believe that the resource usage of the Pentium
r can be
significantly reduced by more carefully mapping structures
to the resources in the FPGA. Of particular note is the
mapping to block RAMs. The interconnection between large
56
100% 12%
90%
10%
% Speed Improvement
80% Floating-point
Address Generation
70%
ALU 8%
60% Pipeline Control
50% Microcode 6% 5.35%
Decode
40%
Bus unit
30% I-Cache
4%
20% D-Cache
2%
10%
0%
0%
RAMS
DSP48
REGISTERS
LUTS
186.crafty
300.twolf
164.gzip
256.bzip2
OVERALL
181.mcf
253.perlbmk
175.vpr
197.parser
252.eon
176.gcc
255.vortex
-2%
Figure 3: Breakdown of FPGA resources used by Figure 4: Performance increase of the doubled
r archtiecture.
different parts of the Pentium
branch target buffer on SPEC2000 integer bench-
marks.
numbers of under-utilized BRAMS is a major contributor only an extra block RAM and a small amount of logic most
to both the speed and area overhead. Multiple BRAMs are likely a side-effect of the randomness in the CAD algorithms.
often required due to limitations on the number of ports The performance of the expanded BTB was measured across
or the width of the ports. Re-architecting the processor to all SPEC2000 integer benchmarks. Since the system is in
better utilize the block RAMs may be of great benefit to the fact real, the time to complete a single benchmark run is
FPGA design. non-deterministic and takes almost a day making it difficult
In spite of the cores ill-suitedness for FPGA design, it to average out the non-determinism. As such, some of the
still provides an interesting point of comparison for FPGAs real speed improvements remain hidden in the noise inherent
as a platform. Recent work [18] has measured FPGAs to in the real system.
be 3x slower in speed and 35x larger in area compared to a Figure 4 shows significant speed improvements up to 11%
standard cell ASIC flow with both using 90nm technology. by parser. vpr and perlbmk also benefit largely from
With some simple and crude calculations we can attempt the increased predictor accuracy. On average the expanded
to do the same with the 12 year old Pentium
. r The BTB provides a 5.35% speed improvement, which is quite
FPGA-based core is clocked at 25 MHz compared to the significant for such a small change.
75 MHz it originally ran at 12 years ago, meaning the
90nm FPGA is already 3x slower than the older 600nm 5.3 Comparing Level 1 Cache Size
silicon technology. Accounting for the generation gap can Figure 5 shows the additional FPGA resources consumed
only be crudely estimated: Assuming modern 90nm desktop from growing the L1 caches from 8KB (2-way) to 32KB
processors run up to 3.8GHz and have 5x the number of (8-way). Almost 25% more logic was necessary for the
pipeline stages (and hence 5x the clock rate) we extrapolate expansion as well as more than 50% more block RAMs
and say that our Pentium
r core would be clocked at 760 making this growth in L1 cache very expensive with respect
MHz in a 90nm processapproximately 30x faster than to area. In addition to the area cost, the place and route
its 90nm FPGA counterpart. Although crude, the above time is more than doubled. Nonetheless the performance
analysis suggests that highly optimized transistor designs benefit is quite substantial.
can perform multitudes faster than the expected 3x of a Figure 6 plots the performance improvement of the ex-
push-button FPGA flow. panded L1 cache for each SPEC2000 integer benchmark.
With respect to area, we estimate that the number of An average of 16% performance improvement is achieved
transistors on the Virtex-4 LX200 is greater than 500 mil- with benchmarks such as crafty reaching as high as 40%.
lion. Since the 3.3 million transistor Pentium
r used about Although there are a myriad of cache studies, we believe
35% of these (we assume the number of transistors used this work is unique in capturing operating system effects
is proportional to the LUT and BRAM usage in Table 1), such as cache flushes and preemption while sustaining high
that means the FPGA required 53x more transistors than simulation speeds..
the actual processor. Although this is also very crude and
even coupled with the fact that transistor count is not an 5.4 Evaluating the Crypto-Engine
accurate measurement of area, the outcome agrees with our The AES takes only 12 CPU cycles to finish its com-
expectation of seeing higher overheads since the previously putation for encryption/decryption, and the DES takes
published results used a synthesis-based standard cell flow 16 CPU cycles, both significantly faster than a software
without manual optimization. implementation. The best known software implementation
for AES written specifically for the same Pentium
r executes
5.2 Comparing Branch Target Buffer Sizes in 320 cycles [11]. This results in an execution speedup of
Doubling the branch target buffer should give the proces- 27x for our custom crypto-engine versus the best software
sor twice the accuracy in predicting taken indirect jumps. implementation. Table 2 summarizes the resource utiliza-
This modification was a simple warm-up exercise requiring tion of the AES and DES engines on the Virtex-4 FPGA
57
60% implementation. Such a system can be used to achieve newer
50.85%
heights of efficiency by optimizing across the entire system
50% stack: architecture, instruction-set device drivers, operating
systems, and applications without the inhibitive simulation
40% times of a software simulator.
Area Increase
300.twolf
164.gzip
256.bzip2
181.mcf
253.perlbmk
175.vpr
197.parser
252.eon
Overall
176.gcc
255.vortex
7. REFERENCES
[1] LEON SPARC. http://www.gaisler.com.
[2] Opencores.org. http://www.opencores.org.
[3] OpenSPARC. http://opensparc.sunsource.net/.
Figure 6: Performance increase of the 32KB 8-way [4] PowerPC. http://www.power.org.
L1 caches versus the 8KB 2-way L1 caches.
[5] T. Austin and D. Burger. The SimpleScalar Tool Set
Version 3.0, 1998.
and shows that the logic requirement is very small but a [6] P. Biswas, S. Banerjee, N. Dutt, P. Ienne, and
substantial number of BRAMs were required. Nonetheless, L. Pozzi. Performance and Energy Benefits of
for secure environments the extra resources would be well Instruction Set Extensions in an FPGA Soft Core. In
worth the performance improvement. IEEE International Conference on VLSI Design
(VLSID). IEEE, 2006.
[7] D. Chiou, H. Sunjeliwala, D. Sunwoo, J. Xu, and
6. CONCLUSION N. Patil. FPGA-based Fast, Cycle-Accurate,
The FPGA-based Pentium
r emulator is a powerful Full-System Simulators. In Workshop on Architecture
tool for researching desktop processor architectural en- Research using FPGA Platforms in the 12th
hancements. Its ability to quickly prototype architectural International Symposium on High-Performance
changes and measure their effects at the application-level Computer Architecture, 2006.
in the presence of a real operating system provides a more [8] P. S. M. et al. Simics: A Full System Simulation
realistic research tool without the expensive costs and long Platform. IEEE Computer, 35(2):5058, 2002.
design times associated with actually creating a silicon [9] G. Gibeling, A. Schultz, and K. Asanovic. RAMP:
The RAMP Architecture and Description Language.
Technical Report, 2006.
Table 2: Virtex-4 resource utilization of the AES [10] G. Gibeling and J. Wawrzynek. A Universal Processor
and DES IP cores. for RAMP. Technical Report, 2006.
[11] L. Granboulan. AES Timings of the Best Known
Resource Number used Implementations.
4-LUTs 2347 http://www.di.ens.fr/ granboul/recherche/AES/timings.html,
Registers 1319 2000.
DSP48s 0 [12] Hifn. 4450 HIPP III Storage Security Processor, 2006.
BRAMs 72 [13] Intel. The Pentium Datasheet, 1997.
58
[14] International Symposium on High-Performance [19] S.-L. Lu, E. Nurvitadhi, J. Hong, and S. Larsen.
Computer Architecture. Workshop on Architecture Memory Subsystem Performance Evaluation with
Research using FPGA Platforms, San Francisco, FPGA based Emulators. In Workshop on Architecture
California, 2005. Research using FPGA Platforms in the 11th
[15] P. Jones, S. Padmanabhan, D. Rymarz, International Symposium on High-Performance
J. Maschmeyer, D. V. Schuehler, J. W. Lockwood, and Computer Architecture, 2005.
R. K. Cytron. Liquid Architecture. In International [20] M. Rosenblum, S. A. Herrod, E. Witchel, and
Parallel and Distributed Processing Symposium: A. Gupta. Complete Computer System Simulation:
Workshop on Next Generation Software, 2004. The SimOS Approach. IEEE parallel and distributed
[16] J. Kasper, R. Krashinksy, C. Batten, and technology: systems and applications, 3(4):3443,
K. Asanovic. A Parameterizable FPGA Prototype of a Winter 1995.
Vector-Thread Processor. In Workshop on [21] D. Sheldon, R. Kumar, R. Lysecky, F. Vahid, and
Architecture Research using FPGA Platforms in the D. Tullsen. Application-Specific Customization of
11th International Symposium on High-Performance Parameterized FPGA Soft-Core Processors. In
Computer Architecture, 2005. IEEE/ACM International Conference on
[17] C. Kozyrakis and K. Olukotun. ATLAS: A Scalable Computer-Aided Design (ICCAD). ACM Press, 2006.
Emulator for Transactional Parallel Systems. In [22] P. Yiannacouras, J. G. Steffan, and J. Rose.
Workshop on Architecture Research using FPGA Application-Specific Customization of Soft Processor
Platforms in the 11th International Symposium on Microarchitecture. In FPGA 06: Proceedings of the
High-Performance Computer Architecture, 2005. 2006 international symposium on Field-programmable
[18] I. Kuon and J. Rose. Measuring the Gap Between gate arrays. ACM Press, 2006.
FPGAs and ASICs. In FPGA 06: Proceedings of the
2006 international symposium on Field-programmable
gate arrays. ACM Press, 2006.
59