Project 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Parallel Compilation for a Parallel Machine

Thomas Gross, Angelika Zobel, and Markus Zolg


School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213

Abstract functional units, complicated memory systems, and unusual


register organizations. These architectural features give a
An application for a parallel computer with multiple, inde- compiler an opportunity to produce good (and sometimes
pendent processors often includes different programs even optimal) code, but determining the appropriate code
(functions) for the individual processors; compilation of such sequence can be expensive. Furthermore, various important
functions can proceed independently. We implemented a optimizations (like loop unrolling, procedure inlining, or trace
compiler that exploits this parallelism by partitioning the in- scheduling) increase the size of the program to be compiled
put program for parallel translation. The host system for the and thereby make a bad situation even worse.
parallel compiler is an Ethernet-based network of worksta-
tions, and different functions of the application program are Our need to speedup compilation, combined with the desire
compiled in parallel on different workstations. For typical to use a parallel system for an application outside of the
programs in our environment, we observe a speedup ranging domain of scientific computing, led us to investigate parallel
from 3 to 6 using not more than 9 processors. The paper compilation. Compilers are non-trivial programs, and map-
includes detailed measurements for this parallel compiler; we ping a compiler onto a parallel system provides a non-trivial
report the system overhead, implementation overhead, as well test case for the problems encountered when developing
as the speedup obtained when compared with sequential com- realistic applications for a parallel system.
pilation.
In this paper, we start with a brief survey of parallel com-
puting to motivate our algorithm and implementation strategy.
1. Introduction We then sketch the actual implementation of a parallel com-
Research in compiler optimization has led to better methods piler. The bulk of the paper is devoted to a discussion of the
to translate programs for high-performance systems and has results: we measured the performance of the system for
expanded the class of machines for which code can be various example programs and discuss in detail which factors
generated automatically. Optimizing compilers can now determine the speedup in practice.
produce highly efficient code for multi-processors, vector-
processors, and wide-instruction-word processors, but these
compilers can take a long time to compile a program. In our
2. Parallel programming
environment, compilation tunes measured in hours are not There are two aspects of parallel programming that we have
unusual. This is not surprising considering the number of to consider if we want to implement a parallel compiler,
hard problems that have to be solved in the process of compil- First, a suitable algorithm has to be found that allows com-
ing programs for supercomputers with multiple pipelined pilation to proceed in parallel. That is. we must find a
suitable model to structure the compiler, and this task can
benefit from previous work on models for parallel computa-
‘Ike research was supported in part by Defense Advanced Research tion.
Projects Agency (DOD), monitored by the Air Force Avionics
Laboratory under Contract P33615-81-K-1539, and Naval Electronic The second aspect is concerned with achieving a real
Systems Command under Contract NCHIO39-85-C-0134,and in part speedup in practice. Although we might find an elegant
by the Office of Naval Research under Contracts NOOO14-80-C-0236, algorithm that partitions the compilation process, this algo-
NR 048-659. and NOC014-85-K-0152.NR SDRJ-007. rithm might not result in an observable speedup. We have to
investigate how well the parallel algorithm maps onto existing
Markus Zolg is now with Siemens AG, Central Technology Division, parallel systems, and which factors limit the measured
Munich, West Germany.
speedup.

2.1. Models for parallel programming


There are two different sources of models for parallel pro-
gramming. One set of models has been used primarily by
Permission to copy without fee all OTpart of this material is granted provided that researchers in theoretical computer science to formulate al-
the copies are not made or distributed for direct commercial advantage, the ACM
copyright notice and the title of the publication and its date appear, and notice is
gorithms for parallel machines. A second set, more ap-
given that copying is by permission of the Association for Computing Machinery. propriately called usage models, is based on experience with
To copy otherwise, or to republish. requires a fee and/or specific permission. parallel machines and attempts to capture the mapping
0 I989 ACM O-8979 l-306-X/89/ooO6/009 I $ I SO strategies that worked well in the past.

91
2.1.1. Theoretical models directly to the performance observed by a user. Using this
The most common model of parallel computation in parallel metric requires measurements on a real system, it is not suf-
algorithms theory is the parallel random-access machine ficient to analyze the cost of an algorithm for an abstract
(PRAM), in which it is assumed that each processor has machine model.
random access in unit time to any cell of a global memory.
Even if an algorithm has been carefully designed, it is still
Variants of the PRAM model differ in the types of global not clear that parallelism yields better performance in prac-
memory access supported (that is, to what extend concurrent tice. Factors that influence performance of a parallel algo-
read and/or concurrent write operations involving the same
rithm are:
memory cell are allowed). Other models of parallel computa-
l Cost of inter-process communication and of
tion are boolean circuits with varying fan-in/fan-out con-
ditions and alternating Turing machines. There exists synchronization; this cost depends on the under-
numerous algorithms for these models; for example, see lying system architecture as well as on the actual
[S] for an overview of parallel algorithms for shared memory use of the communication and synchronization
machines. primitives supported by the architecture.

While theoretical models of parallel computation are a tool l Overhead of creating and managing parallel
for thinking about parallelism, some of the underlying as- processes, and other costs imposed by the host
sumptions are unrealistic. For instance, it is assumed that I/O operating system.
happens in unit time and that the number of processors is l Hardware limitations that result in bottlenecks
unlimited. No existing parallel system exhibits characteristics during program execution, i.e. bus contention,
that are close to those assumptions, and so while these models performance of the I/O subsystem (i.e. the paging
may be adequate to serve as a model to reason about parallel devices). realized network bandwidth and
algorithms, they cannot provide a basis for the actual im- latency, etc.
plementation of a parallel application system.
2.1.2. Usage models 3. Parallel compilation
Users of parallel computers have discovered a number of Our compiler for Warp, a systolic array. is a cross compiler
computational models that can guide in the selection of a that executes on workstations and produces optimized code
mapping strategy or programming style [lo, 71. Three of the for each of the processing elements of the systolic
most successful usage models are data partitioning. array [6, 111. The target of the compiler is a multi-computer,
covnputation partitioning and pipelining. each of the processors in the array contains local program and
The idea behind data partitioning is that the input data is data memories. Quite frequently an application program for
distributed to multiple processors. Each processor performs the Warp array contains different programs for different
the same computation on its corresponding portion of data in processing elements. Due to its high communication
parallel, using only the data local to this processor, and each bandwidth, Warp is a good host for pipelined computations
processor produces a corresponding portion of the output data where different phases of the computation are mapped onto
set. Although all processors perform the same computation, different processors. This usage model encourages the design
this model is not restricted to the case that alI processors of algorithms which result in different programs for the in-
execute the same instruction at the same time. If data par- dividual processors. and this observation initiated our inves-
titioning is implemented on a system with multiple inde- tigation into parallel compilation, since the program for each
pendent processors, then the implementor is given large processor can be compiled independently. One of our goals
lkeedom in deciding at what level to partition the data set. was to assess the practical speedup; since we were not certain
One advantage of the data partitioning model is that it is often of the amount of speedup that can be realized, the project was
easy to develop a parallel algorithm from a sequential one, subject to the constraint that the parallelization had to be
because data partitioning exploits the parallelism inherent in easily implemented: it is not advisable to reimplement the
an application. entire compiler before it is even clear that parallelism can be
exploited.
Another option is to partition the computation into separate
independent phases. An example is domain decomposition, 3.1. Source programs
where each processor first computes a solution for a sub- The structure of the programming language for Warp
problem, then these solutions are used to compute the final reflects the underlying architecture of the machine: A Warp
result. program is a module that consists of one or more section
Pipelining divides a computation into k phases P such that programs which describe the computation for a section
the output of phase Pi serves as input to phase Pi+l. Each (group) of processing elements. Section programs contain
one or more functions and execute in parallel. Because sec-
processor performs one of the k phases. Parallelism is ob-
tained by working on k instances of a problem simul- tion programs execute independently, optimizations per-
taneously. formed within one section program are independent of those
performed within another. Figure 1 depicts the structure of
2.2. Effective speedup program S that contains 2 section programs. Section 1 con-
There are several ways to evaluate an effort to implement a tains function 1.1, Section 2 contains functions 2.1. 2.2, and
2.3.
parallel compiler. The metric of success that we wish to
employ is the speedup achieved: how much faster does a The original plan was to parallelize only the compilation of
program compile when using the parallel compiler, compared programs for different sections, but then we realized that since
to the sequential version that is commonly in use. This metric the compiler performs only minimal inter-procedural op-
is most appropriate for real world situations because it relates

92
there are any syntax or semantic errors in the
program, they are discovered at this time and the
wmpilation is aborted. Once the master has
fished, the compilation is complete. The master is
invoked by the user.
Section level
Processes on the section level are called section
musters. and there is one section master for each
section in the source program. Section masters are
C processes. Section masters are invoked by the
master, which also supplies parse information so
that the section masters know how many functions
are in the section. A section master controls the
processes that perform code generation for the func-
tions within its section. When code has been
generated for each function of the section, the scc-
tion master combines the results so that the parallel
I- compiler produces the same input for the assembly
phase as the sequential compiler. Furthermore, the
Figure 1. Structure of Warp program S
section master process is responsible to combine the
timizations, the scheme could be extended to handle the paral- diagnostic output that was generated during the
lel compilation of multiple functions in the same section as compilation of the functions. After this is done, the
well. section master terminates.
Function level
3.2, Architecture of the parallel compiler The number of processes on the function level,
The compiler is implemented in Common Lisp, and the called function musters, is equal to the total number
sequential compiler runs as a Common Lisp process on a of functions in the program. Function masters are
single SUN workstation under the UNIX operating system. Common Lisp processes. The task of a function
The compiler wnsists of four phases: master is to implement phases 2 and 3 of the com-
l Phasel: parsing and semantic checking piler, that is to optimize and generate code for one
function. The function master for functionfoo that
l Phase2 construction of the flowgraph, local op-
appears in section k is invoked by the k-th section
timization. and computation of global depen-
master.
dencies
l Phase% software pipelining and code generation
l Phased: generation of I/O driver wdc, assembly
and post-processing (linking, format conversion
for download modules, etc.) ......... $y+
tinEtim 23
Both the first and last phase are cheap compared to the * *
optimization and code generation phases. Furthermore, for Iilmtion 22
1, l
the current implementation, these phases require global infor-
fmutim21
mation that depends on all functions in a section. For ex- I,
ample, to discover a type mismatch between a function return smticm 2stamp lcctlmlfiniah
I
value and its use at a call site, the semantic checker has to + tim 2 w.itiug
flUId- 1.1
ptoccss the complete section program. We therefore decided
to only parallelize optimization and code generation. Parsing, sL%im1atlrNp sectimlanish
,t . . .._......
semantic checking and assembly are performed sequentially.
The structure of the parallel compiler then reflects the struc-
ture of the source programs. There exists a hierarchy of g!cYE!ti.. .................................. .I. . ....... I=*
processes; the levels of the hierarchy are master level, section - waiting
level and function level. The tasks of each level are as /
follows: EXFCUTlON Tl?dE

Master level Figure 2. Call graph for compilation of program S


The master level consists of exactly one process. the
master that controls the entire compilation. The Figure 2 depicts the level of parallelism during compilation
master process is a C process; it invokes a Common of program S shown in Figure 1. The master process forks
Lip process that parses the Warp program to obtain two section master processes and waits for them to finish.
enough information to set up the parallel compila- Independently of each other, both section master processes
tion. Thus, the mar&r knows the structure of the fork one function master for each function in the wrrespond-
program and therefore the total number of processes ing section. The only wmmunication required is between a
involved in one compilation. As a side effect, if parent process and its children; processes on the same level of
the hierarchy introduced above operate completely independ-
ent of each other.

93
3.3. Host system day use which includes a parallel parser. This is not surpris-
The Warp machine is integrated into an Ethernet-based ing when one considers the small fraction of time that an
network of SUN workstations, which are also used for com- optimizing compiler spends on parsing. Our own measure-
pilation and program preparation. These workstations are in ments indicate that a sequential compiler spends less than 5%
individual offices, but not all workstations are in use at all of its time on parsing, and this time includes file input and
times. Therefore, the host architecture for our parallel com- scanning as well. Furthermore, to be effective. parallel pars-
piler is this network of about 40 diskless SUN workstations ing requires frequent inter-processor communication while
that share the same fiie system. In practice, the number of only exchanging small data sets, and this communication style
processors that can be used in parallel is limited to lo-15 is penalized with high overheads by existing networks.
since not all machines are free to be used by the compiler. Data partioning has been used to in other parallel compila-
The section masters attempt to distribute the function masters tion systems, for example in a parallel assembler [9], the
to different workstations, thereby achieving load balancing. concurrent compiler developed at University of Toronto 1131.
Synchronization between the master processes and their and it has been proposed even earlier [ 121.
children processes occurs via messages, since there is no
global shared memory. The concurrent compiler developed at the University of
Toronto addresses the issue of building the symboltable in
Currently no parallel version of Common Lisp is available, parallel. In this model, the program to be translated is par-
so we had to use UNIX primitives to express parallelism. titioned into multiple scopes (begin-end blocks, with-
Specifically, each master is a separate, heavy-weight UNIX statements, functions, etc.). However, this partitioning intro-
process. The use of operating system primitives for inter- duces two problems: first, error reporting is potentially com-
process communication (instead of communication and plicated, and second, since an optimizing compiler performs
synchronization primitives integrated into the Common Lisp numerous global optimizations, these scopes (if they are
system) has significant performance implications and limits smaller than a complete procedure) must be recombined be-
the strategy choices available, since only large-gram com- fore the most expensive phase of the compilation process.
munication can achieve a balance between communication Nevertheless, this partitioning has the potential to achieve a
and computation on such a system. (The reason that the significant speedup in practice; unfortunately, no performance
master and section master processes are C processes is prag- data are available to assess its effectiveness.
matic: these processes start up much faster and require fewer
resources than a Common Lisp process. Furthexmore, the A different approach to parallel compilation is taken by
tasks performed by these masters is easily expressed in C and parallel versions of the make utility [l, 31. These programs
not much would be gamed by coding these programs in Com- allow separate compilations to proceed concurrently. The
mon Lisp.) input to parallel make is a UNIX makefile in which the user
explicitly specifies dependencies between modules. Each
The model used to parallelize the compiler can be charac- module is a program that can be compiled separately after the
terized as data partitioning, using the structure of the target objects (files) on the dependency list have been generated.
machine (multiple independent processors) and the structure The compiler invoked by paralleI make is the default scquen-
of the source language as the boundaries to partition the input tial compiler, and all potential parallelism has been identified
data. Although this parallel compiler implements rather by the creator of the makefile. While in parallel make several
coarse gram parallelism (the unit of work is the compilation modules are compiled concurrently with a sequential com-
of a function or section). we were nevertheless concerned if piler, our system compiles a single module with a parallel
the cost of managing the sub-processes would not cancel out compiler. Thus, the level of parallelism is finer, and the
any improvement gained from the parallel implementation. parallel compiler has to analyze how tasks can be paralleltied.
On the other hand, it was clear to us that any other implemen- Parallel make, on the other hand, must accept the dcpen-
tation of a parallel compiler that required finer grain com- dencies of the makefile and has no knowledge about the
munication had to be postponed until we were convinced that individual compilations: a makefile with superfluous dcpen-
this simple method produced favorable results. dencies might not invoke any compilations in parallel.
The decision to use a number of autonomous workstations Since makefiies for large systems give rise to numerous
was motivated as much by their availability as well as by our independent compilations, a parallel make system offers a
desire to demonstrate that the compiler could be parallelized good framework to address scheduling and load balancing
with moderate effort. Since the sequential compiler executed issues. In practice, both approaches could coexist, with the
on the individual hosts of the network, all the effort spent on parallel compiler speeding up the individual translations, and
the implementation of this compiler could be carried over the parallel make system organizing the system generation
easily. In our research we concentrated more on parallelizing effort.
compilation than on scheduling and load bhncing. and we
adopt a simple first-come-first-served strategy that distributes In the compiler that forms the basis of our system, the time
the tasks over the available processors. Other researches have spent in the assembly stage is short compared to the time
observed that such a simple strategy works well in practice spent on code generation. Furthermore, a number of produc-
[3]. and we would reconsider this aspect of our system if we tion compilers integrate the assembly phase into code emis-
attempted to scale it to employ a larger number of processors. sion, making it hard to parallelire assembly without paral-
leliziig code generation. We will compare our results with
3.4. Previous work the result.5 reported in [9] in Section 4.
Parallel compilation has been investigated in previous
studies. Several earlier research projects focussed on parsing,
the part of the compilation that is best understood [4,5, 21.
However, we do not know of any compiler that is in day-to-

94
4. Results waiting for a compilation to finish. and this is the metric of
importance to the user. For this reason, we also use the term
This section summarizes the results that we have collected
user time to refer to the elapsed time. CPU time is actual
for the parallel compiler described in the previous section.
processor time. We measured the total elapsed time and the
We first describe the framework for our measurements and
total CPU time for both compilers. Note that in all figures,
then present the data.
the elapsed time reported is the total time taken, whereas the
4.1. Test programs CPU time is reported on a per-processor basis. This presen-
tation depicts nicely the processor utilization; we found the
ln addition to measuring the speedup of the parallel over the
cumulative CPU tune (e.g. added for all N processors) not
sequential compiler, we were interested in evaluating the di-
nearly as informative.
tributed network of workstations as a host for coarse grain
parallelism. The central question is: how big must the parallel
tasks be so that we can observe a speedup. To answer this
question, we varied the size and the nwnber of functions in
the source program, since most of the time will be spent by
the function masters. To have a controlled environment. it is
desirable that the parallel tasks be of equal size, because this 120.00 -
allows optimal processor utilization and therefore more ac-
curate estimates of system behavior. To obtain tasks of equal 100.00 -
size, we derived synthetic programs from one of our largest
application programs, a Monte Carlo style simulation.
We used 5 functions of increasing size that required in- 4-)
4-
creasing amounts of time to compile. The functions consisted d- -2
of 4,35,100,280 and 360 lines of code and were selected to 40.00~~-.~~- //---
require different amounts of compilation time. These ,.- 1
benchmark programs are clearly specific to our environment,
20.00~ - -
but the same methodology can be applied to other compilers.
Each of these programs consists of a loop nest (with deeply
nested loop bodies in the case of the larger programs) that is
O-O0
11 Number of functions
8
representative with regard to compilation speed of a computa-
Figure 3. Execution times for LY
tion kernel for the Warp array. The challenge for the com-
piler, which was never tuned for compilation speed, is to use Figure 3 depicts user and CPU times for k , the smallest
the multiple functional units of the processing elements and to function. The parallel elapsed time is considera&ly larger than
keep the pipeline of these units busy. Purely sequential code the sequential elapsed time. This indicates that for small
compiles much faster. functions, parallel compilation is of no use. The measure-
ln the following, we call those functions Lr, fmti, fmd-, ments for fmd and fmdm show continually better results for
f andf . We varied the number of functions in each parallel compilation. The interested reader can fmd the
results in Figures 12 and 13 in the appendix.
&?&am bet?- 1. 2. 4 and 8. (We also compiled selected
examples with an intermediate number of functions but these The best results were obtained for function fLDIgcas depicted
compilations did not generate any different results, so we did in Figure 4. Parallel elapsed time is considerably smaller than
not complete the measurement series for those programs.) sequential elapsed time. As the number of functions in-
Thus, our test data consisted of a set of Warp programs: St creases. the resulting increase in parallel compilation time is
containing one LY function. S, containing two ftinY furI only marginal. In other words, adding more tasks does not
increase execution time - a parallel programmer’s dream!
tions and so on. While these pro&ms provide a good vehicle
to evaluate the performance and limitations of the parallel Figure 5 shows the results for fhuac. Still, the parallel
compiler, actual user programs will consist of a mixture of compiler is much faster than the sequential compiler.
functions with different sizes. To assess the parallel compiler However, compared to fwe, the speedup obtained by the
for this setting, we have included measurements for a repre- parallel compilation decreases. This suggests, that for func-
sentative user program as well. tions that are about the size of f luae, the behavior of the
4.2. Measurements parallel compiler is optimal.
Each measurement was done for all programs with both the 4.2.2. Speedup over the sequential compiler
parallel and the sequential compiler. Each test was run mul- The speedup metric presents the measurements in a form
tiple times. The numbers presented in this paper are the arith- that depicts directly the performance improvement compared
metic mean of those measurements. Since the deviation of to sequential compilation. Figure 6 shows the speedup for
the individual measurements are within 10% of the average, elapsed time over sequential compilation for all programs.
we consider the arithmetic mean of all measurements a fair Except for fu,,r, the speedup is always greater than 1 and
approximation of the compilers’ performance. increases as the level of parallelism (that is the number of
4.2.1. Total execution times functions) increases.
The total execution time of a compilation with the parallel Performance of the parallel compiler increases as the size of
compiler is the time it takes the master process to finish the functions increases, and decreases again for function fhuac.
execution. We distinguish between elapsed time and CPU Another way to look at the parallel compiler’s performance is
time. The elapsed time is the wall clock time the user spends

95
c-4

E 5.56 - --’
3 5.00 - -
j$ 4.44 -
b
3.89 -
333 -
2.78 -
,2

1.67 1 /? .-.-.-.2---n 1

0.00~ ’ ’ ’ ’ ’ ’ ’ 1
1 2 3 4 5 6 7 8 0 50 loo 150 200 250 300 350 400
Number of functions Lines of code
Figure 4. Execution times for fWe Figure 7. Speedup versus function size

c - - ehpsedtimesequential of the size although we are aware of the problematic nature of


c - -0 eputimescqumtial this metric.) If the number of functions is small, the size of the
2 833 - elapmdtime parallel function does not influence speedup. This changes for 4 and
b---o cputimepardd
/ 8 functions: the parallel speedup is significantly smaller for
s the largest function (f,,,,). Both Figure 6 and 7 show how
B 6.94 ,‘,* the performance of the parallel compiler is influenced both by
F; // // the numher and the size of parallel tasks.
5.56 // It is interesting to note that these data correlate with the
//
measurements reported in [9]. There, the speedup reported is
about 6 for a large program and 4 for a small one; adding
processors past 8 for the large program (5 for the small one)
2.78 yields no further decrease in elapsed time. Since the amount
t of computation per processor is larger in our system, we are
able to use more processors but also observe the dependence
on the input size.

0.00 1 I I I I I I I 4.2.3. Overhead of the parallel compiler


1 2 3 4 5 6 7 8 Ideally, the speedup of parallel compilation should be k, if k
Number of functions is the number of processors that execute in parallel. In prac-
Figure 5. Execution times for f huge tice there is a relationship between system behavior,
synchronization overhead and implementation overhead (such
as task management) that prevents the system from achieving
the ideal speedup.
flarse
The total overhead incurred by the parallel compiler is
composed of system overhead and implementation overhead.
f huge The implementation overhead consists of the additional work
that the parallel compiler performs (compared to the sequen-
tial one); it consists of:
master time Time spent by the master process:
l setup time
Time for one extra parse of
the program to determine par-
titioning,
scheduling time
tL ‘P.--e.- .-.- ~ ftiny l

Coordination of section
0.001 ’ ’ ’ ’ ’ ’ ’ 1 masters based on the par-
1 2 3 4 5 6 7 8 9
Number of functions titioning derived from the
previous step.
Figure 6. Speedup over sequential compiler
depicted in Figure 7 that shows speedup versus the size of the section time The time spent by the section masters to
functions. (We use the number of lines as a rough indication interpret the directives from the master,

96
and to start up the appropriate number of - reL total ovahead f medium
I-.---.-Y rel.totalovcrh~dflargc
function masters. Also, the time required .-...---. rctsymanovcrhcadflqe
to combine the results and the diagnostic 35.00 _ - rel. system overhead f medium
compiler output is included in this figure.
30.00
The system overhead includes additional system activity
incurred by the parallel compiler. Some of its major con- 25.00
tributors are:
20.00
l Startup time for lisp processes (portion of large
coreimage must be downloaded, and each lisp
process has to interpret initializing information)
15.00
t / -*..P’-*‘- _.*...
..*.
10.00
l Network load (multiple processors attempt to ac-
cess the network, increasing the chance of a 5.00
collision) 0.00
l Garbage collection for lisp processes -5.00
l File server load.
For example, the parallel compiler increases the system Figure 9. Overheads as percentage of total time
load, since more lisp processes are started (one for each for fmcdiumand fhge
function master). As a consequence, the network load in-
creases since multiple lisp images are downloaded and mul- of the parallel compiler is less than the system overhead of the
tiple processes swap off the same fiie server. sequential compiler. The reason for the negative system
overhead is that the sequential compiler processes a program
In our discussion, we focus on the total overhead and the that does not fit into the local memory and system space of a
system overhead of the parallel compiler. The system over- single workstation. Extensive garbage collection and swap-
head is obtained by subtracting the implementation overhead ping are the result. The multiple processes of the parallel
(that is, the CPU times for master, section master and the cost compiler do not have that problem, since each works on a
of parsing each program once) from the total overhead. smaller subproblem. For small numbers of functions, the
extra system load created by the parallel compiler (multiple
- rd. toal overhead f tiny
- rd. system ovuixad f tiny lisp core images, increasednetwork and server load) is al-
1.1------ rel.tot.aloverheadfmall leviated by the savings due to working on smaller problems.
80.00 r l ---*-* teL systan overhead f smell
- rd. total overhead f hu c
70.00 - c - -e ml.systanovcxhe.adf f ugc

50.00
60.00 -
40.00

30.00

20.00

12 3 4 5 6 7 8 10.00
Number of functions
Figure 8. Overheads as percentage of total time 0.00
for ftiy and fsmoll 1 2 3 4 5 6 7 8
Number of functions
Figure 8 shows both overheads as percentage of parallel
elapsed time for ftiny and fsmall For fuay, the overhead con- Figure 10. Overheads as percentage of total time for fhuge
tributes up to 70 % of the parallel elapsed time. The system Figure 10 shows the overhead results for fhuge. The system
overhead is almost as big as the total overhead. For fsm,, the overhead is a significant portion of the total overhead. For
overhead is less thsn for ftiy but still substantial. The system eight functions, 50% of the total execution time is contributed
overhead is only half of the total overhead. This indicates, by the overhead. Of all functions, f l&=p&-=;~~sg
that for small function sizes, both implementation overhead overhead (<= 25%). Consequently,
and system overhead are equally responsible for the bad per- tern overhead and function size is optimal, if the size of the
formance of parallel compilation. functions is between fmcdiumand fiarge.
Figure 9 depicts the results for fmedium and shows an inter- It is interesting to note that in all tests the relative overhead
esting result: The system overhead is negative if the number increases with the number of functions, regardless of their
of functions is small. This indicates that the system overhead size. That is, compiling multiple functions in parallel be-

97
comes more expensive as the number of functions is in- 5. Discussion
creased. So the system behavior is directly correlated to the
We consider this experiment with parallel compilation a
number of parallel tasks. For the absolute overhead times, see success. If the functions to be compiled are large enough. the
Figures 14.15 and 16 in the appendix. speedup over sequential compilation is considerable, and this
4.3. User program speedup speedup can be realized even when using UNIX user-level
synchronizaticm. This compiler has also given us an oppor-
To validate that this approach works well in practice, we tunity to evaluate the architecture of its underlying host sys-
measured the compilation speed for a mechanical engineering tem, and we notice that general purpose systems such as
application implemented on Warp. The program consists of
workstations connected by local networks can serve as ef-
three section programs with three functions each, i.e. a total of
ficient parallel hosts. These results were obtained with a
nine functions - an example of a large application program.
relatively modest investment in time.
The sequential compilation times of three functions ranged
Wowever, since the relative overhead increases with the
between 19 and 22 minutes (about 300 lines of code each), the
number of functions compiled in parallel, there is a limit on
compilations times for the other six functions are in the 2 to 6 how far this approach can be scaled up. Furthermore, the
minutes range (between 5 and 45 lines of cede). In our first
number of functions in a compilation unit is limited, and we
measuremen& we used one workstation per function (that is,
expect that further advances have to explore fmer grain paral-
nine processors) and observed a speedup of 4.5 over sequen- lelism if the goal is to speed up a single compilation. For
tial compilation. We also observed. that each processor com-
those environments that attempt to speed up the compilation
piling one of the small functions was idle for at least 15
of large systems, the parallel compilation of modules that can
minutes during the entire compilation.
be compiled separately provides an additional way to improve
This result is even more encouraging when one considers compilation time.
that the processor assignment can easily be improved: instead
of scheduling one function per processor, smaller functions 5.1. Compiler implications
can be. grouped and compiled on the same processor, so the Our experiments show that exploiting large grain paral-
same speedup can be observed using fewer processors. A lelism is a successful step towards parallelizing compilers.
combination of lines of code and loop nesting can serve as This style of parallelism cannot be exploited for uniprocessor
approximation of the compilation time that is the basis for the languages when a module must be compiled togethen even in
scheduler to perform load balancing, and since the master our case, we had to curtail inter-procedural optimization.
process parses the program to determine the partitioning, this However parallel languages for parallel target architectures
information is readily available. are a rewarding source for large-grain parallelism.
We used the above heuristic to balance the load and Optimizing compilers for supercomputers are particularly
measured the speedup for the this application program using slow. Here, parallelism not only speeds up the compilation
2, 3 and 5 processors. The result of our measurements is process, but can also improve the quality of the generated
depicted in Figure 11. The speedup for 2 processors is 2.16, code. For example, more sophisticated optimization al-
which indicates that the system overhead of the sequential gorithms can be used that would make compilation on a
compiler is greater than the system overhead of the parallel uniprocessor too slow. As seen in our experiments, multiple
compiler due to swapping (and possibly garbage collection). processors can process a program that does not fit into the
Scheduling the parallel compilation on fewer processors system space of a single processor implementation. This is
yields excellent results: the speedup for 5 processors is almost particularly important for Lisp systems, which have to use
as good as the speedup for 9 processors. garbage collection extensively if the data object space grows
too large.
g 4.50 The observation that parallel compilation is of marginal
El
value when compiling small functions supports our view that
# 4.00 procedure inlining is an important optimization that should be
32 3.50 included in the compiler if the source programs consists of
many small functions. Not only will procedure inlining allow
i2 the code generator to perform a better job, the increase in size
ig 3.00 of each function operated upon will also improve the speedup
obtained by the parallel compiler. We still have to investigate
g 2.50 the tradeoff between inter-procedural optimizations and the
combination of inlining and parallel compilation.
32 2.00
5.2. Host observations
g 1.50 A system of autonomous nodes that run under the UNIX
operating system causes problems for parallel applications. In
1.00 that environment it is hard to make a parallel program reliable
2 3 4 5 6 7 8 9 and comfortable to use. UNIX does not support centralized
Number of processors resource allocation, and consequently, each parallel applica-
Figure 11. Specdup for a user program tion must handle processor allocation and load balancing with
minimal operating system support.
Process control in UNIX is uncomfortable; in particular
managing process hierarchies is a real problem. One of the

98
reasons is that UNIX does not have an explicit join construct 3. Bubenik, R. and Zwaenepool, W. Performance of Gp-
that allows the forking process to resume execution after the timistic Make. Proc. In& Conf. on Measurement and Model-
child process has finished. As a result, the application code ing of Computer Systems. May, 1989.
becomes unwieldy as it tries to account for all possible
failures in the child processes and their host processors. 4. Cohen, J. and Kolodner, S. ‘Estimating the Speedup in
Parallel Parsing”. IEEE Trans. So@. Engineering II,1 (Jan
An efficient parallel language or a library for parallel 1985), 114-124.
process management could make the task of programming a
distributed system of autonomous nodes much easier; our 5. Fischer, C. N. On Parsing Context Free Languages in
results show that there is great potential for parallelism in Parallel Environments. Ph.D. Th., Cornell University, 1975.
general purpose hardware. 6. Gross, T. and Lam, M. Compilation for a High-
performance Systolic Array. Proceedings of the ACM
6. Concluding remarks SIGPLAN ‘86 Symposium on Compiler Construction, ACM
Improving the speed of compilations is important since SIGPLAN. June, 1986. pp. 27-38.
programs for multi-computers can be quite large. Further- 7. Hillis. W. D., and Steele, G. L. Jr. “Data Parallel
more, continued research in code optimization should not be Algorithms”. Comm. ACM 29,12 (Dee 1986). 1170-l 183.
bound by compile time constraints. Any strategy that reduces
the compilation time benefits the users in two ways: the 8. Karp, R. M. and Ramachandran, V. A Survey of Parallel
actual compilation time is reduced, or the compiler can Algorithms for Shared-Memory Machines. Computer
employ more time consuming optimizations and thereby im- Science Division, University of California at Berkeley,
prove the quality of the code generated. March, 1988.
We have also seen that the overhead associated with a 9. Katseff, H. P. Using Data Partioning to Implement a
parallel compilation can be responsible for a significant por- Parallel Assembler. Parallel Programming: Experience with
tion of the total execution time. Compiling small functions in Applications, Languages and Systems, New Haven, July,
parallel is unlikely to yield any speedup. But for medium and 1988, pp. 66-76. also published as SIGPLAN Notices Vol. 23.
large size functions, the benefits obtained from using multiple 9.
computers in parallel offset the overhead even for a small
number of functions. For the style of parallelism exploited by 10. Kung, H. T. Warp Experience: We Can Map Computa-
this compiler, on the order of 8 to 16 processors can be used tions onto a Parallel Computer Efficiently. Conference
comfortably. For our domain of application programs, ex- proceedings of 1988 International Conference on Superwm-
tending the number of processors beyond this range is un- puting, St. Malo, France, July, 1988, pp. 668-675.
likely to yield any additional speedup. 11. Lam, M. S.. A Systolic Array Optimizing Compiler.
In summary, we have demonstrated that the parallelism Kluwer Academic Publishers, 1988.
inherent in the source programs for a parallel computers can
12. Lipkie, D. E. A Compiler Design for a Multiple Znak-
easily and directly be exploited by a parallel compiler based
pen&rat Processor Computer. Ph.D. Th., University of
on data partitioning. The strategy to compile each function
Washington, Seattle, 1979.
separately on a different machine proved effective and led to a
straightforward modification of the sequential compiler. For 13. Seshadri, V., Wortman, D.B., Junkin, M. D., Weber, S.,
typical programs in our environmenf we observe a speedup Yu, C.P.. and Small, I. . Semantic Analysis in a Concurrent
ranging from 3 to 6 using not more than 9 processors. Compiler. Proceedings of the ACM SIGPLAN ‘88 Con-
ference on Progr amming Language Design and Implemen-
tation, ACM SIGPLAN, June, 1988. pp. 233-239.
Acknowledgements
We appreciate the comments and contributions by the other
members of the Warp and iWarp projects at Carnegie Mellon.
C. H. Chang, Monica Lam, Peter Lieu, Abu Noaman and
David Yam contributed to the sequential compiler on which
this parallel compiler is based upon. Todd Mummert helped
us with the setup for the measurements on our network.

References
1. Baalbergen, E. H. “Design and Implementation of Parallel
Make”. Computing Systems I, 2 (Spring 1988), 135 - 158.
2. Boehm, H. J., and Zwaenepoel, W. Parallel Attribute
Grammar Evaluation. 7* Intl. Conf. on Distributed Comput-
ing Systems, IEEE, Berlin, September, 1987. Earlier
published as Rice Tech. Report 86-39.

99
I. Appendix ~.-....-a
.--.---..
totalovuhudflugc
systanovwhadflqe
- tad ovuhud f medium
c - - dsps+tilne’~altial - system ovahead f medium
p 16.67 - - -e ~;$~pm~d 89 16.67 +
.8 15.00
- cp” timcparalld .
g 13.33 .*.-.- .*
!3
I.333
/‘I .*1.
/A pd 10.00 .- .*
is // ,*
g 11.67 // .=.* ..
.- ,.--
6.67 .* ..+- :
.d’ *.**
.*

1 2 3 4 5 6 7 8 Figure 15. Absolute overhead for f,,+,,,, and fw


Number of functions
- total ovcrhcod f hu
Figure 12. Execution times for fsmn 0 5833 - c - -e systanovahudf ff”uge

j. 50.00
-r- ----r--- k 41.67
g 30.00
t
z 26.67 33.33
!I
b 23.33
25.00
20.00 -
16.67
16.67 -
13.33 - 833
I I I I
0.00
1 2 3 4 5 6 7 8
Number of functions
6 7 8 Figure 16. Absolute overhead for fhuBc
1 2 3 4 5
Number of functions
Figure 13. Execution times for fmcdium
M---.--se totalovahcadfsmall
. . total ovexhead f tinv
.---..--. systanovaheadf;rmall

f- 180-00
ii
160.00
*----o system overhead f tiny

_.
.-
l .**

G 120.00 -
100.00 -

0.00 I I I I I I I I
1 2 3 4 5 6 7 8
Number of functions
Figure 14. Absolute overhead for fdny and fmalt

100

You might also like