Retargetable Compilers and Architecture Exploration For Embedded Processors

Retargetable compilers and architecture exploration for embedded processors
R. Leupers, M. Hohenauer, J. Ceng, H. Scharwaechter, H. Meyr, G. Ascheid and G. Braun Abstract: Retargetable compilers can generate assembly code for a variety of different target processor architectures. Owing to their use in the design of application-specic embedded processors, they bridge the gap between the traditionally separate disciplines of compiler construction and electronic design automation. In particular, they assist in architecture exploration for tailoring processors towards a certain application domain. The paper reviews the state-of-the-art in retargetable compilers for embedded processors. Based on some essential compiler background, several representative retargetable compiler systems are discussed, while also outlining their use in iterative, proling-based architecture exploration. The LISATek C compiler is presented as a detailed case study, and promising areas of future work are proposed.
Introduction
Compilers translate high-level language source code into machine-specic assembly code. For this task, any compiler uses a model of the target processor. This model captures the compiler-relevant machine resources, including the instruction set, register les and instruction scheduling constraints. While in traditional target-specic compilers this model is built-in (i.e. it is hard-coded and probably distributed within the compiler source code), a retargetable compiler uses an external processor model as an additional input that can be edited without the need to modify the compiler source code itself (Fig. 1). This concept provides retargetable compilers with high exibility with respect to the target processor. Retargetable compilers have been recognised as important tools in the context of embedded system-on-chip (SoC) design for several years. One reason is the trend towards increasing use of programmable processor cores as SoC platform building blocks, which provide the necessary exibility for fast adoption, e.g. of new media encoding or protocol standards and easy (software-based) product upgrading and debugging. While assembly language used to be predominant in embedded processor programming for quite some time, the increasing complexity of embedded application code now makes the use of high-level languages like C and C just as inevitable as in desktop application programming. In contrast to desktop computers, embedded SoCs have to meet very high efciency requirements in terms of MIPS per Watt, which makes the use of power-hungry,
q IEE, 2005 IEE Proceedings online no. 20045075 doi: 10.1049/ip-cdt:20045075 Paper rst received 7th July and in revised form 27th October 2004 R. Leupers, M. Hohenauer, J. Ceng, H. Scharwaechter, H. Meyr, and G. Ascheid are with the Institute for Integrated Signal Processing Systems, RWTH Aachen University of Technology, SSS-611920, Templergraben 55, D-52056, Aachen, Germany G. Braun is with CoWare Inc., CA 95131, USA E-mail: [email protected]
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 2, March 2005
high-performance off-the-shelf processors from the desktop computer domain (together with their well developed compiler technology) impossible for many applications. As a consequence, hundreds of different domain or even application-specic programmable processors have appeared in the semiconductor market, and this trend is expected to continue. Prominent examples include low-cost=low-energy microcontrollers (e.g. for wireless sensor networks), number-crunching digital signal processors (e.g. for audio and video codecs), as well as network processors (e.g. for internet trafc management). All these devices demand for their own programming environment, obviously including a high-level language (mostly ANSI C) compiler. This requires the capability of quickly designing compilers for new processors, or variations of existing ones, without the need to start from scratch each time. While compiler design traditionally has been considered a very tedious and manpower-intensive task, contemporary retargetable compiler technology makes it possible to build operational (not heavily optimising) C compilers within a few weeks and more decent ones approximately within a single man-year. Naturally, the exact effort heavily depends on the complexity of the target processor, the required code optimisation and robustness level, and the engineering skills. However, compiler construction for new embedded processors is now certainly much more feasible than a decade ago. This permits us to employ compilers not only for application code development, but also for optimising an embedded processor architecture itself, leading to a true compiler= architecture codesign technology that helps to avoid hardware software mismatches long before silicon fabrication. This paper summarises the state-of-the-art in retargetable compilers for embedded processors and outlines their design and use by means of examples and case studies. We provide some compiler construction background needed to understand the different retargeting technologies and give an overview of some existing retargetable compiler systems. We describe how the above-mentioned compiler=architecarchitecture codesign concept can be implemented in a processor architecture exploration environment. A detailed example of an industrial retargetable C compiler system is discussed.
209
classical compiler
source code
retargetable compiler
source code processor model
compiler
processor model
compiler
asm code
asm code
Fig. 1
Classical against retargetable compiler
Compiler construction background
The general structure of retargetable compilers follows that of well proven classical compiler technology, which is described in textbooks such as [1 4]. First, there is a language frontend for source code analysis. The frontend produces an intermediate representation by which a number of machine-independent code optimisations are performed. Finally, the backend translates the intermediate representation into assembly code, while performing additional machine-specic code optimisations.
2.1 Source language frontend

The standard organisation of a frontend comprises a scanner, a parser and a semantic analyser (Fig. 2). The scanner performs lexical analysis on the input source le, which is rst considered just as a stream of ASCII characters. During lexical analysis, the scanner forms substrings of the input string to groups (represented by tokens), each of which corresponds to a primitive syntactic entity of the source language, e.g. identiers, numerical constants, keywords, or operators. These entities can be represented by regular expressions, for which in turn nite automata can be constructed and implemented that accept the formal languages generated by the regular expressions. Scanner implementation is strongly facilitated by tools like lex [5]. The scanner passes the tokenised input le to the parser, which performs syntax analysis with respect to the contextfree grammar underlying the source language. The parser recognises syntax errors and, in the case of a correct input, builds up a tree data structure that represents the syntactic structure of the input program. Parsers can be constructed manually based on the LL(k) and LR(k) theory [2]. An LL(k) parser is a top-down parser, i.e. it tries to generate a derivation of the input program from the grammar start symbol according to the grammar rules. In each step, it replaces a non-terminal by the right-hand side of a grammar rule. In order to decide which rule to apply out of possibly many alternatives, it uses a lookahead of k symbols on the input token stream. If the context-free grammar shows certain properties, this selection is unique,
... id << const ...
so that the parser can complete its job in linear time in the input size. However, the same also holds for LR(k) parsers which can process a broader range of context-free grammars. They work bottom-up, i.e. the input token stream is reduced step-by-step until nally reaching the start symbol. Instead of making a reduction step solely based on the knowledge of the k lookahead symbols, the parser additionally stores input symbols temporarily on a stack until enough symbols for an entire right-hand side of a grammar rule have been read. Due to this, the implementation of an LR(k) parser is less intuitive and requires more effort than for LL(k). Constructing LL(k) and LR(k) parsers manually provides some advantage in parsing speed. However, in most practical cases tools like yacc [5] (that generates a variant of LR(k) parsers) are employed for semi-automatic parser implementation. Finally, the semantic analyser performs correctness checks not covered by syntax analysis, e.g. forward declaration of identiers and type compatibility of operands. It also builds up a symbol table that stores identier information and visibility scopes. In contrast to scanners and parsers, there are no widespread standard tools like lex and yacc for generating semantic analysers. Frequently, attribute grammars [1] are used, though, for capturing the semantic actions in a syntax-directed fashion, and special tools like ox [6] can extend lex and yacc to handle attribute grammars.
2.2 Intermediate representation and optimisation

In most cases, the output of the frontend is an intermediate representation (IR) of the source code that represents the input program as assembly-like, yet machine-independent low-level code. Three-address code (Figs. 3 and 4) is a common IR format. There is no standard format for three-address code, but usually all high-level control ow constructs and complex expressions are decomposed into simple statement sequences consisting of three-operand assignments and gotos. The IR generator inserts temporary variables to store intermediate results of computations. Three-address code is a suitable format for performing different types of ow analysis, i.e. control and data ow analysis. Control ow analysis rst identies the basic block structure of the IR and detects the possible control transfers between basic blocks. (A basic block is a sequence of IR statements with unique control ow entry and exit points). The results are captured in a control ow graph (CFG). Based on the CFG, more advanced control ow analyses can be performed, e.g. in order to identify program loops. Figure 5 shows the CFG generated for the example from Figs. 3 and 4. Data ow analysis works on the statement level and determines interdependencies between computations.
x <<
parser
2
semantic analyser IR
int f(int x) { int y; y = x << 2; return y; }
scanner token stream
source code
syntax tree
int fib (int m) { int f0 = 0, f1 = 1, f2, i; if (m <= 1) return m; else for (i = 2; i <= m; i++) { f2 = f0 + f1; f0 = f1; f1 = f2; } return f2; }
Fig. 2
210
Source language frontend structure
Fig. 3
Sample C source le b.c (Fibonacci numbers)

int fib (int m2) { int f0_4, f1_5, f2_6, i_7, t1, t2, t3, t4, t6, t5; f0_4 = 0; f1_5 = 1; t1 = m_2 <= 1; if (t1) goto LL4; i_7 = 2; t2 = i_7 <= m_2; t6 = ! t2; if (t6) goto LL1; t5 = f0_4 + f_15; f2_6= t5; f0_4 = f_ 15; f1_5 = f_26; t3 = i_7; t4 = t_3 + 1; i_7 = t4; t2 = i_7 <= m_2; if (t2) goto LL3; goto LL5; return m_2; return f_26;
LL3:
LL2:
LL1: LL4: LL5: }
Fig. 4 Three-address code IR for source le b.c

Temporary variable identiers inserted by the frontend start with letter t. All local identiers have a unique numerical sufx. This particular IR format is generated by the LANCE C frontend [7]
For instance, the data ow graph (DFG) from Fig. 6 shows relations of the form statement X computes a value used as an argument in statement Y. Both the CFG and the DFG form the basis for many code optimisation passes at the IR level. These include common subexpression elimination, jump optimisation, loop-invariant code motion, dead code elimination and other Dragon Book [1] techniques. Owing to their target machine independence, these IR optimisations are generally considered complementary to machine code generation in the backend and are supposed to be useful on average for any type of target. However, care must be taken to select an appropriate IR optimisation sequence or script for each particular target, since certain (sometimes quite subtle) machine dependencies do exist. For instance, common subexpression elimination removes redundant computations to save execution time and code size. At the assembly level, however, this effect might be over-compensated by the higher register pressure that increases the amount of spill code. Moreover, there are many interdependencies between the IR optimisations themselves. For instance, constant propagation generally creates new optimisation opportunities for constant folding, and vice versa, and dead code elimination is frequently required as a cleanup phase between other IR optimisations. A poor choice of IR optimisations can have a dramatic effect on nal code quality. Thus, it is important that IR optimisations be organised in a modular fashion, so as to permit enabling and disabling of particular passes during ne-tuning of a new compiler.
2.3 Machine code generation

During this nal compilation phase, the IR is mapped to target assembly code. Since for a given IR an innite number of mappings as well as numerous constraints exist, this is clearly a complex optimisation problem. In fact, even many optimisation subproblems in code generation are NP-hard, i.e. require exponential runtime for optimal solutions. As a divide-and-conquer approach, the backend is thus generally organised into three main phases: code selection, register allocation and scheduling, which are implemented with a variety of heuristic algorithms. Dependent on the exact problem denition, all of these phases may be considered NP-hard, e.g. [8] analyses the complexity of code generation for certain types of target machines. For the purpose of code selection, the optimised IR is usually converted into a sequence of tree-shaped DFGs. Using a cost metric for machine instructions, the code selector aims at a minimum-cost covering of the DFGs by instruction patterns (Figs. 7 and 8). In particular for target architectures with complex instruction sets, such as CISCs and DSPs, careful code selection is the key to good code quality. For complexity reasons, most code selectors work only on trees [9], even though generalised code selection for arbitrary DFGs can yield higher code quality for certain architectures [10, 11]. The computational effort for solving the NP-hard generalised code selection problem is normally considered too high in practice, though. Subsequent to code selection, the register allocator decides which variables are to be kept in machine registers to ensure efcient access. Careful register allocation is key for target machines with RISC-like load-store architectures and large register les. Frequently, there are many more simultaneously live variables than physically available machine registers. In this case, the register allocator inserts
211
Fig. 5 Control ow graph for b.c

Fig. 6
Data ow graph for b.c
REG + load d(An, Ix )
REG CONST + load
REG + REG
REG
ADDREG
REG load (An ) REG load REG REG REG + load load d(An ) REG REG CONST +
CONST
ADDCONST
Fig. 7 Five instruction patterns available for a Motorola 68k CPU
ADDREG REG + + load REG load (An) REG CONST ADDCONST
ADDREG REG + + load load d(An) REG REG CONST
Fig. 8 Two possible coverings of a DFG using the instruction patterns from Fig. 7
spill code to temporarily store register variables to main memory. Obviously, spill code needs to be minimised in order to optimise program performance and code size. Many register allocators use a graph colouring approach [12, 13] to accomplish this. For target processors with instruction pipeline hazards and=or instruction-level parallelism, such as VLIW machines, instruction scheduling is required to optimise
212
code execution speed. The scheduler requires two sets of inputs to represent the interinstruction constraints: instruction latencies and reservation tables. Both constraints refer to the schedulability of some pair (X, Y) of instructions. If instruction Y depends on instruction X, the latency values determine a lower bound on the number of cycles between execution of X and Y. If there is no dependence between X and Y, the occupation of exclusive resources represented by the reservation tables determines whether the execution of X may overlap the execution of Y in time, e.g. whether X and Y may be executed in parallel on a VLIW processor. As for other backend phases, optimal scheduling is an intractable problem. However, there exist a number of powerful scheduling heuristics, such as (local) list scheduling and (global) trace scheduling [3]. Besides the above three standard code generation phases, a backend frequently also incorporates different targetspecic code optimisation passes. For instance, address code optimisation [14] is useful for a class of DSPs, so as to fully utilise dedicated address generation hardware for pointer arithmetic. Many VLIW compilers employ loop unrolling and software pipelining [15] for increasing instruction-level parallelism in the hot spots of application code. Loop unrolling generates larger basic blocks inside loop bodies, and hence provides better opportunities for keeping the VLIW functional units busy most of the time. Software pipelining rearranges the loop iterations so as to remove intraloop data dependencies that otherwise would obstruct instruction-level parallelism. Finally, NPU architectures for efcient protocol processing require yet a different set of machine-specic techniques [16] that exploit bit-level manipulation instructions. The separation of the backend into multiple phases is frequently needed to achieve sufcient compilation speed, but tends to compromise code quality due to interdependencies between the phases. In particular, this holds for irregular non-RISC instruction sets, where the phase interdependencies are sometimes very tight. Although there have been attempts to solve the code generation problem in its entirety, e.g. based on integer linear programming [17], such phase-coupled code generation techniques are still not in widespread use in real-word compilers.
Approaches to retargetable compilation
From the above discussions is it obvious that compiler retargeting mainly requires adaptations of the backend, even though IR optimisation issues certainly should not be neglected. In order to provide a retargetable compiler with a processor model, as sketched in Fig. 1, a formal machine description language is required. For this purpose, dozens of different approaches exist. These can be classied with respect to the intended target processor class (e.g. RISC against VLIW) and the modelling abstraction level, e.g. purely behavioural, compiler-oriented against more structural, architecture-oriented modelling styles. Behavioural modelling languages make the task of retargeting easier, because they explicitly capture compiler-related information about the target machine, i.e. instruction set, register architecture and scheduling constraints. However, they usually require good understanding of compiler technology. In contrast, architectural modelling languages follow a more hardware design oriented approach and describe the target machine in more detail. This is convenient for users not so familiar with compiler technology. However, automatic retargeting gets more difcult, because a compiler view needs to be
extracted from the architecture model, while eliminating unnecessary details. In the following, we will briey discuss a few representative examples of retargetable compiler systems. For a comprehensive overview of existing systems see [18].
3.1 MIMOLA
MIMOLA denotes both a mixed programming and hardware description language (HDL) and a hardware design system. As the MIMOLA HDL serves multiple purposes, e.g. register-transfer level (RTL) simulation and synthesis, the retargetable compiler MSSQ [19, 20] within the MIMOLA design system follows the above-mentioned architecture-oriented approach. The target processor is described as an RTL netlist, consisting of components and interconnect. Figure 9 gives an example of such an RTL model. Since the HDL model comprises all RTL information about the target machines controller and data path, it is clear that all information relevant for the compiler backend of MSSQ is present too. However, this information is only implicitly available, and consequently the lookup of this information is more complicated than in a behavioural model.
Fig. 9 MIMOLA HDL model of a simple processor

213
MSSQ compiles an extended subset of the PASCAL programming language directly into binary machine code. Due to its early introduction, MSSQ employs only few advanced code optimisation techniques (e.g. there is no graph-based global register allocation), but performs the source-to-architecture mapping in a straightforward fashion, on a statement-by-statement basis. Each statement is represented by a data ow graph (DFG), for which an isomorphic subgraph is searched in the target data path. If this matching fails, the DFG is partitioned into simpler components, for which graph matching is invoked recursively. In spite of this simple approach, MSSQ is capable of exploiting instruction-level parallelism in VLIW-like architectures very well, due to the use of a exible instruction scheduler. However, code quality is generally not acceptable in the case of complex instruction sets and load=store data paths. In addition, it shows comparatively high compilation times, due to the need for exhaustive graph matching. The MIMOLA approach shows very high exibility in compiler retargeting since in principle any target processor can be represented as an RTL HDL model. In addition, it avoids the need to consistently maintain multiple different models of the same machine for different design phases, e.g. simulation and synthesis, as all phases can use the same golden reference model. MSSQ demonstrates that retargetable compilation is possible with such unied models, even though it does not handle well architectures with complex instruction pipelining constraints (which is a limitation of the tool, though, rather than of the approach itself). The disadvantage, however, is that the comparatively detailed modelling level makes it more difcult to develop the model and to understand its interaction with the retargetable compiler, since e.g. the instruction set is hidden in the model. Some of the limitations have been removed in RECORD, another MIMOLA HDL based retargetable compiler that comprises dedicated code optimisations for DSPs. In order to optimise compilation speed, RECORD uses an instruction set extraction technique [21] that bridges the gap between RTL models and behavioural processor models. Key ideas of MSSQ, e.g. the representation of scheduling constraints by binary partial instructions, have also been adopted in the CHESS compiler [22, 23], one of the rst commercial tool offerings in that area. In the Expression compiler [24], the concept of structural architecture modelling has been further rened to increase the reuse opportunities for model components.
embedded processor architectures like ARM, ARC, MIPS and Xtensa. However, it is very complex and hard to customise. It is primarily designed for compiler-friendly 32-bit RISC-like load-store architectures. While porting to more irregular architectures, such as DSPs, is possible as well, this generally results in huge retargeting effort and=or insufcient code quality.
3.3 Little C compiler

Like gcc, retargeting the little C compiler lcc [26, 27] is enabled via a machine description le. In contrast to gcc, lcc is a lightweight compiler that comes with much less source code and only a few built-in optimisations, and hence lcc can be used to design compilers for certain architectures very quickly. The preferred range of target processors is similar to that of gcc, though with some further restrictions on irregular architectures. In order to retarget lcc, the designer has to specify the available machine registers, as well as the translation of C operations (or IR operations, respectively) to machine instructions by means of mapping rules. The following excerpt from lccs Sparc machine description le [27] exemplies two typical mapping rules: addr: ADDP4 (reg, reg) %%%0 %%%1 reg: INDIRI1 (addr) ldsb %0; %%%cnn The rst line instructs the code selector how to cover address computations (addr) that consist of adding two 4byte pointers (ADDP4) stored in registers (reg). The string %%%0 %%%1 denotes the assembly code to be emitted, where %0 and %1 serve as placeholders for the register numbers to be lled later by the register allocator (and %% simply emits the register identier symbol %). Since addr is only used in context with memory accesses, here only a substring without assembly mnemonics is generated. The second line shows the covering of a 1-byte signed integer load from memory (INDIRI1), which can be implemented by assembly mnemonic ldsb, followed by arguments referring to the load address (%0, returned from the addr mapping rule) and the destination register (%c). By specifying such mapping rules for all C=IR operations plus around 20 relatively short C support functions, lcc can be retargeted quite efciently. However, lcc is very limited in the context of non-RISC embedded processor architectures. For instance, it is impossible to model certain irregular register architectures (as e.g. in DSPs) and there is no instruction scheduler, which is a major limitation for targets with instruction level parallelism. Therefore, lcc has not found wide use in code generation for embedded processors so far.
3.2 GNU C compiler

The widespread GNU C compiler gcc [25] can be retargeted by means of a machine description le that captures the compiler view of a target processor in a behavioural fashion. In contrast to MIMOLA, this le format is heterogeneous and solely designed for compiler retargeting. The gcc compiler is organised into a xed number of different passes. The frontend generates a three-address code like intermediate representation (IR). There are multiple built-in Dragon Book IR optimisation passes, and the backend is driven by a specication of instruction patterns, register classes and scheduler tables. In addition, retargeting gcc requires C code specication of numerous support functions, macros and parameters. The gcc compiler is robust and well-supported, it includes multiple source language frontends and it has been ported to dozens of different target machines, including typical
214
3.4 CoSy
The CoSy system from ACE [28] is a retargetable compiler for multiple source languages, including C and C. Like gcc, it includes several Dragon Book optimisations, but shows a more modular, extensible software architecture, which permits the addition of IR optimisation passes through well-dened interfaces. For retargeting, CoSy comprises a backend generator that is driven by the CGD machine description format. Similar to gcc and lcc, this format is full-custom and only designed for use in compilation. Hence, retargeting CoSy requires signicant compiler know-how, particularly with respect to code selection and scheduling. Although it generates the backend automatically from the CGD specication, including standard algorithms for code selection, register
registers general-purpose registers (************* ****************) R0,R1,R2,R3,R4,R5,R6,R7,R8,R9,R10,R11,R12,R13,R14,R15, floating-point registers (************* ****************) F0,F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11,F12,F13,F14,F15,
machine registers
(************* double-precision registers ****************) D0<F0,F1>,D2<F2,F3>,D4<F4,F5>,D6<F6,F7>,D8<F8,F9>,D10<F10,F11>,D12<F12,F13>,D14<F14,F15>, (************* PC, FPSR; miscellaneous registers * ***************) combined registers
(************* REGI: R2..R31; REGI0: R0..R31; REGF: F0..F31; REGD: D0..D30; RET: R31; FP: R30; SP: R29; TEMP: R2..R15,F0..F15;
aliases
****************)
register classes
AVAIL (* registers available to the register allocator *) <R0..R29,R31,F0..F15,F0..F31,D0..D30>;
Fig. 10 CGD specication of processor registers

argument registers
rule name
IR operation
matching condition
link to scheduler description assembly output
RULE [mir Plu s_regi_regi__regi ] o :mirPlus (rs1:regi 0, rs 2:regi0) -> rd: regi; CONDITION { IS_POINTER_OR_INT(o.Type) } COST 1; PRODUCER ALU_Out; CONSUMER ALU_In; TEMPLATE ALU_op ; EMIT { printf(" ADD %s,%s,%s\n" ,REGNAME(rd),REGNAME(rs 1),REGNAME(rs2)); }
Fig. 11 CGD specication of mapping rules
allocation and scheduling, the designer has to fully understand the IR-to-assembly mapping and how the architecture constrains the instruction scheduler. The CGD format follows the classical backend organisation. It includes mapping rules, a register specication, as well as scheduler tables. The register specication is a straightforward listing of the different register classes and their availability for the register allocator (Fig. 10). Mapping rules are the key element of CGD (Fig. 11). Each rule describes the assembly code to be emitted for a certain C=IR operation, depending on matching conditions and cost metric attributes. Similar to gcc and lcc, the register allocator later replaces symbolic registers with physical registers in the generated code. Mapping rules also contain a link to the CGD scheduler description. By means of the keywords PRODUCER and CONSUMER, the instructions can be classied into groups, so as to make the scheduler description more compact. For instance, arithmetic instructions performed on a certain ALU generally have the same latency values. In the scheduler description itself (Fig. 12), the latencies for pairs of instruction groups are listed as a table of numerical values. As explained in Section 2.3, these values instruct the scheduler to arrange instructions a minimum amount of cycles apart from each other. Different types of interinstruction dependencies are permitted. Here the keyword TRUE denotes data dependence. (Data dependencies are sometimes called true, since they are induced by the source program itself. Hence, they cannot be removed by the
SCHEDULER PRODUCER ALU_Out, MUL_Out; CONSUMER ALU_In, MUL_In; TRUE ALU_In AL U_Out 1 MUL_Out 2 MUL_In: 1 2
instruction classes
instruction class latencies
RESOURCES EX_alu , EX_mul ; TEMPLATES ALU_op MUL_op
reservation tables
:=() & () & EX_alu & (); :=() & () & EX_mul & EX_mul & ();
Fig. 12 CGD specication of scheduler tables
compiler. In contrast, there are false, or antidependencies that are only introduced by code generation via reuse of registers for different variables. The compiler should aim at minimising the amount of false dependencies, in order to maximise the instruction scheduling freedom). Via the TEMPLATE keyword, a reservation table entry is referenced. The & symbol separates the resource use of an instruction group over the different cycles during its processing in the pipeline. For instance, in the last line of Fig. 12, instruction group MUL_op occupies resource EX_mul for two subsequent cycles. The CGD processor modelling formalism makes CoSy a quite versatile retargetable compiler. Case studies for RISC architectures show that the code quality produced by CoSy
215
compilers is comparable to that of gcc. However, the complexity of the CoSy system and the need for compiler background knowledge make retargeting more tedious than, for example, in the case of lcc. 4 Processor architecture exploration
4.1 Methodology and tools for ASIP design

As pointed out in Section 1, one of the major applications of retargetable compilers in SoC design is to support the design and programming of application-specic instruction set processors (ASIPs). ASIPs receive increasing attention in both academia and industry due to their optimal exibility=efciency compromise [29]. The process of evaluating and rening an initial architecture model step-by-step to optimise the architecture for a given application is commonly called architecture exploration. Given that the ASIP application software is written in a high-level language like C, it is obvious that compilers play a major role in architecture exploration. Moreover, in order to permit frequent changes of the architecture during the exploration phase, compilers have to be retargetable. Todays most widespread architecture exploration methodology is sketched in Fig. 13. It is an iterative approach that requires multiple remapping of the application code to the target architecture. In each iteration, the usual software development tool chain (C compiler, assembler, linker) is used for this mapping. Since exploration is performed with a virtual prototype of the architecture, an instruction set simulator together with a proler are used to measure the efciency and cost of the current architecture with respect to the given (range of) applications, e.g. in terms of performance and area requirements. We say that hardware (processor architecture and instruction set) and software (application code) match, if the hardware meets the performance and cost goals, and there is no over- or underutilisation of HW resources. For instance, if the HW is not capable of executing the hot spots of the application code under the given timing constraints, e.g. due to insufcient function units, too much spill code, or too many pipeline stalls, then more resources need to be provided. However, if many function units are idle most of the time or half of the register le remains unused, this indicates an underutilisation. Fine-grained proling tools make such data available to the processor designer. However, it is still a highly creative process to determine the exact source of bottlenecks (application code, C compiler, processor instruction set, or microarchitecture)
and to remove them by corresponding modications, while simultaneously overlooking their potential side effects. If the HW=SW match is initially not satisfactory, the ASIP architecture is further optimised, dependent on the bottlenecks detected during simulation and proling. This optimisation naturally requires hardware design knowledge, and may comprise, for example, addition of application-specic custom machine instructions, varying register le sizes, modifying the pipeline architecture, adding more function units to the data path, or simply removing unused instructions. The exact consequences of such modications are hard to predict, so that usually multiple iterations are required in order to arrive at an optimal ASIP architecture that can be handed over to synthesis and fabrication. With the research foundations of this methodology laid in the 1980s and 1990s (see [18] for a summary of early tools), several commercial offerings are available now in the EDA industry, and more and more start-up companies are entering the market in that area. While ASIPs offer many advantages over off-the-shelf processor cores (e.g. higher efciency, reduced royalty payments and better product differentiation), a major obstacle is still the potentially costly design and verication process, particularly concerning the software tools shown in Fig. 13. In order to minimise these costs and to make the exploration loop efcient, all approaches to processor architecture exploration aim at automating the retargeting of these tools as much as possible. In addition, a link to hardware design has to be available in order to accurately estimate area, cycle time and power consumption of a new ASIP. In most cases this is enabled by automatic HDL generation capabilities for processor models, which provides a direct entry to gatetrue estimations via traditional synthesis ows and tools. One of the most prominent examples of an industrial ASIP is the Tensilica Xtensa processor [30]. It provides a basic RISC core that can be extended and customised by adding new machine instructions and adjusting parameters, e.g. for the memory and register le sizes. Software development tools and an HDL synthesis model can be automatically generated. Application programming is supported via the gcc compiler and a more optimising in-house C compiler variant. The Tensilica Xtensa, together with its design environment, completely implement the exploration methodology from Fig. 13. However, the use of a largely predened RISC core as the basic component poses limitations on the exibility and the permissible design space. An important new entry to the ASIP market is Stretch [31]. Their congurable S5000 processor is based on the Xtensa core, but includes an embedded eld programmable gate array (FPGA) for processor customisation. While FPGA vendors have combined processors and congurable logic on a single chip for some time, the S5000 instruction set extension fabric is optimised for implementation of custom instructions, thus providing a closer coupling between processor and FPGA. In this way, the ASIP becomes purely software-congurable and eldprogrammable, which reduces the design effort, but at the expense of reduced exibility.
4.2 ADL-based approach

More exibility is offered by the tool suite from Target Compiler Technologies [23] that focuses on the design of ASIPs for signal processing applications. In this approach, the target processor can be freely dened by the user in the nML architecture description language (ADL). In contrast to a purely compiler-specic machine model, such as in
Fig. 13
216
Processor architecture exploration loop
the case of gcc or CoSys CGD, an ADL such as nML also captures information relevant for the generation of other software development tools, e.g. simulator, assembler and debugger, and hence covers a greater level of detail. However, in contrast to HDL-based approaches to retargetable compilation, such as MIMOLA, the abstraction level is still higher than RTL and usually allows for a concise explicit modelling of the instruction set. The transition to RTL only takes place once the ADL model is rened to an HDL model for synthesis. LISATek is another ASIP design tool suite that originated at Aachen University [32]. It has rst been produced by LISATek Inc. and is now available as a part of CoWares SoC design tool suite [33]. LISATek uses the LISA 2.0 (language for instruction set architectures) ADL for processor modelling. A LISA model captures the processor resources like registers, memories and instruction pipelines, as well as the machines instruction set architecture (ISA). The ISA model is composed of operations (Fig. 14), consisting of sections that describe the binary coding, timing, assembly syntax and behaviour of machine operations at different abstraction levels. In an instructionaccurate model (typically used for early architecture exploration), no pipeline information is present, and each operation corresponds to one instruction. In a more negrained, cycle-accurate model, each operation represents a single pipeline stage of one instruction. LISATek permits the generation of software development tools (compiler, simulator, assembler, linker, debugger, etc.) from a LISA model, and embeds all tools into an integrated GUI environment for application and architecture proling. In addition, it supports the translation of LISA models to synthesisable VHDL and Verilog RTL models. Figure 15 shows the intended ASIP design ow with LISATek. In addition to an implementation of the exploration loop from Fig. 13, the ow also comprises the synthesis path via HDL models, which enables back-annotation of gate-level hardware metrics.
target architecture LISA 2.0 description LISATek generators
exploration C-compiler assembler
implementation VHDL description
synthesis tools linker simulator gate-level model evaluation results chip size, clock speed, power consumption
evaluation results profiling data, execution speed
Fig. 15 LISATek-based ASIP design ow
In [34] it has been exemplied how the LISATek architecture exploration methodology can be used to optimise the performance of an ASIP for an IPv6 security application. In this case study, the goal was to enhance a given processor architecture (MIPS32) by means of dedicated machine instructions and a microarchitecture for fast execution of the compute-intensive Blowsh encryption algorithm (Fig. 16) in IPsec. Based on initial application C code proling, hot spots were identied that provided rst hints on appropriate custom instructions. The custom instructions were implemented as a coprocessor (Fig. 17) that communicates with the MIPS main processor via shared memory. The coprocessor instructions were accessed from the C compiler generated from the LISA model via compiler intrinsics. This approach was feasible due to the small
plain text 64 bit 32 bit 32 bit
P1
32 bit
32 bit F
32 bit
P2
further 13 iterations
P16
P17
P18
32 bit
32 bit
64 bit cipher text
Fig. 14 LISA operation example: execute stage of an ADD instruction in a cycle-true model with forwarding hardware modelling
Fig. 16 Blowsh encryption algorithm for IPsec

Pi denotes a 32-bit subkey, F denotes the core subroutine consisting of substitutions and add=xor operations 217
1 kB ext_mem memory module
systemC bus model (simple_bus)
SystemC processor module LISA simulator A P I A P I
LISA port LIS API AP I
LISA port AP I API
SystemC processor module LISA simulator A P I A P I
MIPS32
data_bus
data_bus
coprocessor
simulator
prog_bus
prog_bus
simulator
core API kseg_rom API useg_rom API data_ram API data_ram API prog_rom IPsec coprocessor
core
MIPS32 main processor
Fig. 17
MIPS32=coprocessor system resulting from Blowsh architecture exploration
number of custom instructions required, which can be easily utilised with small modications of the initial Blowsh C source code. LISATek-generated instruction-set simulators embedded into a SystemC based cosimulation environment were used to evaluate candidate instructions and to optimise the coprocessors pipeline microarchitecture on a cycleaccurate level. Finally, the architecture implementation path via LISAto-VHDL model translation and gate-level synthesis was used for further architecture ne-tuning. The net result was a 5 speedup of Blowsh execution over the original MIPS at the expense of an additional coprocessor area of 22 k gates. This case study demonstrates that ASIPs can provide excellent efciency combined with IP reuse opportunities for similar applications from the same domain. Simultaneously, the iterative, proling-based exploration methodology permits us to achieve such results quickly, i.e. typically within a few man-weeks. The capability of modelling the ISA behaviour in arbitrary C=C code makes LISA very exible with respect to different target architectures and enables the generation of high-speed ISA simulators based on the JITCC technology [35]. As in the MIMOLA and Target approaches, LISATek follows the single golden model paradigm, i.e. only one ADL model (or automatically generated variants of it) is used throughout the design ow in order to avoid consistency problems and to guarantee correct-by-construction software tools during architecture exploration. Under this paradigm, the construction of retargetable compilers is a challenging problem, since in contrast to special-purpose languages like CGD, the ADL model is not tailored towards compiler support only. Instead, similar to MIMOLA=MSSQ (see Section 3.1), the compiler-relevant information needs to be extracted with special techniques. This is discussed in more detail in the following Section. 5 C compiler retargeting in the LISATek platform
good code quality at the same time. Normally, these goals are contradictory, since the more the compiler can exploit knowledge of the range of target machines, the better is the code quality, and vice versa. In fact, this inherent trade-off has been a major obstacle for the successful introduction of retargetable compilers for quite some time. However, a closer look reveals that this only holds for push-button approaches to retargetable compilers, where the compiler is expected to be retargeted fully automatically once the ADL model is available. If compiler retargeting follows a more pragmatic user-guided approach (naturally at the cost of a slightly longer design time), then one can escape from the above dilemma. In the case of the LISA ADL, an additional constraint is the unrestricted use of C=C for operation behaviour descriptions. Due to the need for exibility and high simulation speed, it is impossible to sacrice this description vehicle. However, this makes it very difcult to automatically derive the compiler semantics of operations, due to large syntactic variances in operation descriptions. In addition, hardwareoriented languages like ADLs do not at all contain certain types of compiler-related information, such as C type bit widths, function calling conventions etc., which makes an interactive GUI-based retargeting environment useful. In order to maximise the reuse of existing, well tried compiler technology and to achieve robustness for real-life applications, the LISATek C compiler builds on the CoSy system (Section 3.4) as a backbone. Since CoSy is capable of generating the major backend components (code selector, register allocator, scheduler) automatically, it is sufcient to generate the corresponding CGD fragments (see Figs. 10 12) from a LISA model in order to implement an entire retargetable compiler tool chain.
5.2 Register allocator and scheduler

Out of the three backend components, the register allocator, is the easiest one to retarget since the register information is explicit in the ADL model. As shown in Fig. 10, essentially only a list of register names is required, which can be largely copied from the resource declaration in the LISA model. Special cases (e.g. combined registers, aliases,
5.1 Concept
The design goals for the retargetable C compiler within the LISATek environment were to achieve high exibility and
218
opcode composition 0: two parallel instructions with register arguments
registers 1, 2
opcode
registers 1, 2
15 14 13 12 11 10
instructions: reg8_insn1 opcode composition 1: a sngle instructions with register and immediate arguments 15 14 13 12 11 10 9 8 7 registers 1, 2
instructions: reg8_insn2 immediate
instructions: imm16_insn
Fig. 18 Two instruction encoding formats (compositions)
special-purpose registers such as the stack pointer) can be covered by a few one-time user interactions in the GUI. As explained in Section 2.3, generation of the instruction scheduler is driven by two types of tables: latency tables and reservation tables. Both are only implicit in the ADL model. Reservation tables model interinstruction conicts. Similar to the MSSQ compiler (Section 3.1), it is assumed that all such conicts are represented by instruction encoding conicts. (This means that parallel scheduling of instructions with conicting resource usage is already prohibited by the instruction encoding itself. Architectures for which this assumption is not valid appear to be rare in practice, and if necessary there are still simple workarounds via user interaction, e.g. through manual addition of articial resources to the generated reservation tables). Therefore, reservation tables can be generated by examining the instruction encoding formats in a LISA model. Figures 18 and 19 exemplify the approach for two possible instruction formats or compositions. Composition 0 is VLIW-like and allows two parallel 8-bit instruction to be encoded. In composition 1, the entire 16 instruction bits are required due to an 8-bit immediate constant that needs to be encoded. In the corresponding LISA model, these two formats are modelled by means of a switch=case language construct. The consequences of this instruction format for the scheduler are that instructions that t into one of the 8-bit
slots of composition 0 can be scheduled in either of the two, while an instruction that requires an immediate operand blocks other instructions from being scheduled in parallel. The scheduler generator analyses these constraints and constructs virtual resources to represent the interinstruction conicts. Naturally, this concept can be generalised to handle more complex, realistic cases for wider VLIW instruction formats. Finally, a reservation table in CGD format (see Fig. 12) is emitted for further processing with CoSy. (We also generate a custom scheduler as an optional bypass of the CoSy scheduler. The custom one achieves better scheduling results for certain architectures [36].) The second class of scheduler tables, latency tables, depends on the resource access of instructions as they run through the different instruction pipeline stages. In LISA, cycle-accurate instruction timing is described via activation sections inside the LISA operations. One operation can invoke the simulation of other operations downstream in the pipeline during subsequent cycles, e.g. an instruction fetch stage would typically be followed by a decode stage, and so forth. This explicit modelling of pipeline stages makes it possible to analyse the reads and writes to registers at a cycle-true level. In turn, this information permits the extraction of different types of latencies, e.g. due to a data dependency. An example is shown in Fig. 20, where there is a typical four-stage pipeline (fetch, decode, execute, writeback) for a load=store architecture with a central register le. (The rst stage in any LISA model is the main operation that is called for every new simulation cycle, similar to the built-in semantics of the main function in ANSI C). By tracing the operation activation chain, one can see that a given instruction makes two read accesses in stage decode and a write access in stage writeback. For arithmetic instructions executed on the ALU, for instance, this implies a latency value of two (cycles) in the case of a data dependency. This information can again be translated into CGD scheduler latency tables (Fig. 12). The current version of the scheduler generator, however, is not capable of automatically analysing forwarding=bypassing hardware, which is frequently used to minimise latencies due to pipelining. Hence, the ne-tuning of latency tables is performed via user interaction in the compiler retargeting GUI, which allows the user to add specic knowledge in order to over-ride potentially too conservative scheduling constraints, so as to improve code quality.
5.3 Code selector

Fig. 19
LISA model fragment for instruction format from Fig. 18
As sketched in Section 2.3, retargeting the code selector requires specication of instruction patterns (or mapping rules) used to cover a data ow graph representation of
219
execute
register_ alu_instr main fetch decode imm_ alu_instr
ADD SUB writeback ADDI SUBI ALU
2 write read ports 1 2 3 1
1 write write port
register file
Fig. 20
Register le accesses of an instruction during its processing over different pipeline stages
the intermediate representation (IR). Since it is difcult to extract instruction semantics from an arbitrary C=C specication as in the LISA behaviour models, this part of backend retargeting is least automated. Instead, the GUI offers the designer a mapping dialogue (Fig. 21, see also [37]) that allows for a manual specication of mapping rules. This dialogue enables the drag-and-drop composition of mapping rules, based on (a) the IR operations needed to be covered for a minimal operational compiler and (b) the available LISA operations. In the example from
Fig. 21, an address computation at the IR level is implemented with two target-specic instructions (LDI and ADDI) at the assembly level. Although signicant manual retargeting effort is required with this approach, it is much more comfortable than working with a plain compiler generator such as CoSy (Section 3.4), since the GUI hides many compiler internals from the user and takes the underlying LISA processor model explicitly into account, e.g. concerning the correct assembly syntax of instructions. Moreover, it ensures very
Fig. 21
220
GUI for interactive compiler retargeting (code selector view)

high exibility with respect to different target processor classes, and the user gets immediate feedback on consistency and completeness of the mapping specication. The major drawback, however, of the above approach is a potential model consistency problem, since the LISA model is essentially overlayed with a (partially independent) code selector specication. In order to eliminate this problem, yet retaining exibility, the LISA language has recently been enhanced with semantic sections [38]. These describe the behaviour of operations from a pure compiler perspective and in a canonical fashion. In this way, semantic sections eliminate syntactic variances and abstract from details such as internal pipeline register accesses or certain side effects that are only important for synthesis and simulation. Figure 22 shows a LISA code fragment for an ADD instruction that generates a carry ag. The core operation (dst srcl src2) could be analysed easily in this example, but in reality more C code lines might be required to capture this behaviour precisely in a pipelined model. The carry ag computation is modelled with a separate ifstatement, but the detailed modelling style might obviously vary. However, the compiler only needs to know that the operation adds two registers and that it generates a carry, independent of the concrete implementation. The corresponding semantics model (Fig. 23) makes this information explicit. Semantics models rely on a small set of precisely dened micro-operations (_ADDI for integer add in this example) and capture compiler-relevant side effects with special attributes (e.g. _C for carry generation). This is feasible, since the meaning of generating
a carry ag (and similar for other ags like zero or sign) in instructions like ADD does not vary between different target processors. Frequently, there is no one-to-one correspondence between IR operations (compiler-dependent) and microoperations (processor-dependent). Therefore, the code selector generator that works with the semantic sections must be capable of implementing complex IR patterns by sequences of micro-operations. For instance, it might be needed to implement a 32-bit ADD on a 16-bit processor by a sequence of an ADD followed by an ADD-with-carry. For this lowering, the code selector generator relies on an extensible default library of transformation rules. Contrarily, some LISA operations may have complex semantics (e.g. a DSP-like multiply-accumulate) that cover multiple IR operations at a time. These complex instructions are normally not needed for an operational compiler but should be utilised in order to optimise code quality. Therefore, the code selector generator analyses the LISA processor model for such instructions and automatically emits mapping rules for them. The use of semantic sections in LISA enables a much higher degree of automation in code selector retargeting, since the user only has to provide the semantics per LISA operation, while mapping rule generation is completely automated (except for user interactions possibly required to extend the transformation rule library for a new target processor). The semantics approach eliminates the above-mentioned model consistency problem at the expense of introducing a potential redundancy problem. This redundancy is due to the coexistence of separate behaviour C=C and semantics (micro-operations) descriptions. The user has to ensure that behaviour and semantics do not contradict. However, this redundancy is easily to deal with in practice, since behaviour and semantics are local to each single LISA operation. Moreover, as outlined in [39], coexistence of both descriptions can even be avoided in some cases, since one can generate the behaviour from the semantics for certain applications.
5.4 Results
The retargetable LISATek C compiler has been applied to numerous different processor architectures, including RISC, VLIW and network processors. Most importantly, it has been possible to generate compilers for all architectures with limited effort, of the order of some man-weeks, dependent on the processor complexity. This indicates that the semi-automatic approach outlined in Section 5 works for a large variety of processor architectures commonly found in the domain of embedded systems. While this exibility is a must for retargetable compilers, code quality is an equally important goal. Experimental results conrm that the code quality is generally acceptable. Figure 24 shows a comparison between the gcc compiler (Section 3.2) and the CoSy based LISATek C compiler for a MIPS32 core and some benchmarks programs. The latter one is an out-of-the-box compiler that was designed within two man-weeks, while the gcc compiler, due to its wide use, probably incorporates signicantly more manpower. On average, the LISATek compiler shows an overhead of 10% in performance and 17% in code size. With specic compiler optimisations added to the generated backend, this gap could certainly be further narrowed. Further results for a different target (Inneon PP32 network processor) show that the LISATek compiler generates better code (40% in performance, 10% in code
221
Fig. 22 Modelling of an add operation in LISA with carry ag generation as a side effect
Fig. 23 Compiler semantics modelling of the add operation from Fig. 22 and a micro-operation for register le access (micro-operation _REGI)
execution cycles 140.000.000 120.000.000 100.000.000 80.000.000 60.000.000 40.000.000 20.000.000 0 gcc, -O4 gcc, -O2 code size 80.000 70.000 60.000 50.000 40.000 30.000 20.000 10.000 0 gcc, -O4 gcc, -O2 cosy, -O4 cosy, -O2 cosy, -O4 cosy, -O2
Fig. 24 Comparison between gcc compiler and CoSy based LISATek C compiler
size) than a retargeted lcc compiler (Section 3.3), due to more built-in code optimisation techniques in the CoSy platform. Another data point is the ST200 VLIW processor, where the LISATek compiler has been compared to the ST Multiow, a heavily optimising target-specic compiler. In this case, the measured overhead has been 73% in performance and 90% in code size, which is acceptable for an out-of-the-box compiler that was designed with at least an order of magnitude less time than the Multiow. Closing this code quality gap would require adding special optimisation techniques, e.g. in order to utilise predicated instructions, which are currently ignored during automatic compiler retargeting. Additional optimisation techniques are also expected to be required for highly irregular DSP architectures, where the classical backend techniques (Section 2.3) tend to produce unsatisfactory results. From our experience we conclude that such irregular architectures can hardly be handled in a completely retargetable fashion, but will mostly require custom optimisation engines for highest code quality. The LISATek=CoSy approach enables this by means of a modular, extensible compiler software architecture, naturally at the expense of an increased design effort. 6 Summary and outlook
Motivated by the growing use of ASIPs in embedded SoCs, retargetable compilers have made their way from academic research to EDA industry and application by system and semiconductor houses. While still being far from perfect, they increase design productivity and help to obtain better quality of results. The exibility of todays retargetable
222
compilers for embedded systems can be considered satisfactory, but more research is required on how to make code optimisation more retargetable. We envision a pragmatic solution where optimisation techniques are coarsely classied with respect to different target processor families, e.g. RISCs, DSPs, NPUs and VLIWs, each of which show typical hardware characteristics and optimisation requirements. For instance, software pipelining and utilisation of SIMD (single instruction multiple data) instructions are mostly useful for VLIW architectures, while DSPs require address code optimisation and a closer coupling of different backend phases. Based on a target processor classication given by the user with respect to the above categories, an appropriate subset of optimisation techniques would be selected, each of which is retargetable only within its family of processors. Apart from this, we expect growing research interest in the following areas of compiler-related EDA technology: Compilation for low power and energy: Low power and=or low energy consumption have become primary design goals for embedded systems. As such systems are more and more dominated by software executed by programmable embedded processors, it is obvious that also compilers may play an important role, since they control the code efciency. At rst glance, it appears that program energy minimisation is identical to performance optimisation, assuming that power consumption is approximately constant over the execution time. However, this is only a rule-of-thumb, and the use of ne-grained instruction-level energy models [40, 41] shows that there can be a trade-off between the two optimisation goals, which can be explored with special code generation techniques. The effect is somewhat limited, though, when neglecting the memory subsystem, which is a major source of energy consumption in SoCs. More optimisation potential is offered by exploitation of small on-chip (scratchpad) memories, which can be treated as entirely compiler-controlled, energy efcient caches. Dedicated compiler techniques, such as [42, 43], are required to ensure an optimum use of scratchpads for program code and=or data segments. Source-level code optimisation: In spite of powerful optimising code transformations at the IR or assembly level, the resulting code can be only as efcient as the source code passed to the compiler. For a given application algorithm, an innite number of C code implementations exist, possibly each resulting in different code quality after compilation. For instance, downloadable reference C implementations of new algorithms are mostly optimised for readability rather than performance, and high-level design tools that generate C as an output format usually do not pay much attention to code quality. This motivates the need for code optimisations at the source level, e.g. C-to-C transformations, that complement the optimisations performed by the compiler, while retaining the program semantics. Moreover, such C-to-C transformations are inherently retargetable, since the entire compiler is used as a backend in this case. Techniques like [44 46] exploit the implementation space at the source level to signicantly optimise code quality for certain applications, while tools like PowerEscape [47] focus on efcient exploitation of the memory hierarchy in order to minimise power consumption of C programs. Complex application-specic machine instructions: Recent results in ASIP design automation indicate that a high performance gain is best achieved with complex application-specic instructions that go well beyond the classical custom instructions like multiply-accumulate for DSPs. While there are approaches to synthesising such custom
instructions based on application code analysis [48 50], the interaction with compilers is not yet well understood. In particular, tedious manual rewriting of the source code is still required in many cases to make the compiler aware of new instructions. This slows down the ASIP design process considerably, and in an ideal environment the compiler would automatically exploit custom instruction set extensions. This will require generalisation of classical code selection techniques to cover more complex constructs like directed acyclic graphs or even entire program loops. 7 References
1 Aho, A.V., Sethi, R., and Ullman, J.D.: Compilers principles, techniques, and tools (AddisonWesley, 1986) 2 Appel, A.W.: Modern compiler implementation in C (Cambridge University Press, 1998) 3 Muchnik, S.S.: Advanced compiler design & implementation (Morgan Kaufmann Publishers, 1997) 4 Wilhelm, R., and Maurer, D.: Compiler design (AddisonWesley, 1995) 5 Mason, T., and Brown, D.: lex & yacc (OReilly & Associates, 1991) 6 Bischoff, K. M.: Design, implementation, use, and evaluation of Ox: an attribute-grammar compiling system based on Yacc, Lex, and C, Technical Report 9231, Dept. of Computer Science, Iowa State University, USA, 1992 7 Leupers, R., Wahlen, O., Hohenauer, M., Kogel, T., and Marwedel, P.: An executable intermediate representation for retargetable compilation and high-level code optimization. Int. Workshop on Systems, Architectures, Modeling, and Simulation (SAMOS), Samos, Greece, July 2003 8 Aho, A., Johnson, S., and Ullman, J.: Code generation for expressions with common subexpressions, J. ACM, 1977, 24, (1) 9 Fraser, C.W., Hanson, D.R., and Proebsting, T.A.: Engineering a simple, efcient code-generator generator, ACM Lett. Program. Lang. Syst., 1992, 1, (3), pp. 213226 10 Bashford, S., and Leupers, R.: Constraint driven code selection for xed-point DSPs. 36th Design Automation Conf. (DAC), 1999 11 Ertl, M. A.: Optimal code selection in DAGs. ACM Symp. on Principles of Programming Languages (POPL), 1999 12 Chow, F., and Hennessy, J.: Register allocation by priority-based coloring, SIGPLAN Not., 1984, 19, (6) 13 Briggs, P.: Register allocation via graph coloring. PhD thesis, Dept. of Computer Science, Rice University, Houston/TX, USA, 1992 14 Liao, S., Devadas, S., Keutzer, K., Tjiang, S., and Wang, A.: Storage assignment to decrease code size. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), 1995 15 Lam, M.: Software Pipelining: An Effective Scheduling Technique for VLIW machines. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), 1988 16 Wagner, J., and Leupers, R.: C compiler design for network processor, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 2001, 20, (11) 17 Wilson, T., Grewal, G., Halley, B., and Banerji, D.: An integrated approach to retargetable code generation. 7th Int. Symp. on HighLevel Synthesis (HLSS), 1994 18 Leupers, R., and Marwedel, P.: Retargetable compiler technology for embedded systems tools and applications (Kluwer Academic Publishers, 2001) 19 Marwedel, P., and Nowak, L: Verication of hardware descriptions by retargetable code generation. 26th Design Automation Conf., 1989 20 Leupers, R., and Marwedel, P.: Retargetable code generation based on structural processor descriptions (Kluwer Academic Publishers, 1998), Design Automation for Embedded Systems, vol. 3, no. 1 21 Leupers, R., and Marwedel, P.: A BDD-based frontend for retargetable compilers. European Design & Test Conf. (ED & TC), 1995 22 Van Praet, J., Lanneer, D., Goossens, G., Geurts, W., and De Man, H.: A graph based processor model for retargetable code generation. European Design and Test Conference (ED & TC), 1996 23 Target Compiler Technologies: http://www.retarget.com
24 Mishra, P., Dutt, N., and Nicolau, A.: Functional abstraction driven design space exploration of heterogenous programmable architectures. Int. Symp. on System Synthesis (ISSS), 2001 25 Free Software Foundation/EGCS: http://gcc.gnu.org 26 Fraser, C., and Hanson, D.: A retargetable C compiler: design and implementation (Benjamin/Cummings, 1995) 27 Fraser, C., and Hanson, D.: LCC home page, http://www.cs.princeton. edu/software/lcc 28 Associated Compiler Experts: http://www.ace.nl 29 Orailoglu, A., and Veidenbaum, A.: Application specic microprocessors (guest editors introduction), IEEE Des. Test Comput., 2003, 20, pp. 67 30 Tensilica Inc.: http://www.tensilica.com 31 Stretch Inc.: http://www.stretchinc.com 32 Hoffman, A., Meyr, H., and Leupers, R.: Architecture exploration for embedded processors with LISA (Kluwer Academic Publishers, 2002), ISBN 1-4020-7338-0 33 CoWare Inc.: http://www.coware.com 34 Scharwaechter, H., Kammler, D., Wieferink, A., Hohenauer, M., Karuri, K., Ceng, J., Leupers, R., Ascheid, G., and Meyr, H.: ASIP architecture exploration for efcient IPSec encryption: a case study. Int. Workshop on Software and Compilers for Embedded Systems (SCOPES), 2004 35 Nohl, A., Braun, G., Schliebusch, O., Hoffman, A., Leupers, R., and Meyr, H.: A universal technique for fast and exible instruction set simulation. 39th Design Automation Conf. (DAC), New Orleans, LA, (USA), 2002 36 Wahlen, O., Hohenauer, M., Braun, G., Leupers, R., Ascheid, G., Meyr, H., and Nie, X.: Extraction of efcient instruction schedulers from cycle-true processor models. 7th Int. Workshop on Software and Compilers for Embedded Systems (SCOPES), 2003 37 Hohenauer, M., Wahlen, O., Karuri, K., Scharwaechter, H., Kogel, T., Leupers, R., Ascheid, G., Meyr, H., Braun, G., and van Someren, H.: A methodology and tool suite for C compiler generation from ADL processor models. Design Automation & Test in Europe (DATE), Paris, France, 2004 38 Ceng, J., Sheng, W., Hohenauer, M., Leupers, R., Ascheid, G., Meyr, H., Braun, G.: Modeling instruction semantics in ADL processor descriptions for C compiler retargeting. Int. Workshop on Systems, Architectures, Modeling, and Simulation (SAMOS), 2004 39 Braun, G., Nohl, A., Sheng, W., Ceng, J., Hohenauer, M., Scharwaechter, H., Leupers, R., and Meyr, H.: A novel approach for exible and consistent ADL-driven ASIP design. 41st Design Automation Conf. (DAC), 2004 40 Lee, M., Tiwari, V., Malik, S., and Fujita, M.: Power analysis and minimization techniques for embedded DSP software, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 1997, 5, (2) 41 Steinke, S., Knauer, M., Wehmeyer, L., and Marwedel, P.: An accurate and ne grain instruction-level energy model supporting software optimizations. Proc. PATMOS, 2001 42 Steinke, S., Grunwald, N., Wehmeyer, L., Banakar, R., Balakrishnan, M., and Marwedel, P.: Reducing energy consumption by dynamic copying of instructions onto onchip memory. ISSS, 2002 43 Kandemir, M., Irwin, M.J., Chen, G., and Kolcu, I.: Banked scratchpad memory management for reducing leakage energy consumption. ICCAD, 2004 44 Falk, H., and Marwedel, P.: Control ow driven splitting of loop nests at the source code level. Design Automation and Test in Europe (DATE), 2003 45 Liem, C., Paulin, P., and Jerraya, A.: Address calculation for retargetable compilation and exploration of instruction-set architectures. 33rd Design Automation Conf. (DAC), 1996 46 Franke, B., and OBoyle, M.: Compiler transformation of pointers to explicit array accesses in DSP applications. Int. Conf. on Compiler Construction (CC), 2001 47 PowerEscape Inc.: http://www.powerescape.com 48 Sun, F., Ravi, S. et al.: Synthesis of custom processors based on extensible platforms, ICCAD 2002 49 Goodwin, D., and Petkov, D.: Automatic generation of application specic processors. CASES 2003 50 Atasu, K., Pozzi, L., and Ienne, P.: Automatic application-specic instruction-set extensions under microarchitectural constraints. DAC 2003
223

Retargetable Compilers and Architecture Exploration For Embedded Processors

Uploaded by

Copyright:

Available Formats

Retargetable Compilers and Architecture Exploration For Embedded Processors

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Retargetable Compilers and Architecture Exploration For Embedded Processors

Uploaded by

Copyright:

Available Formats

Retargetable compilers and architecture exploration for embedded processors

Classical against retargetable compiler

Compiler construction background

2.1 Source language frontend

2.2 Intermediate representation and optimisation

int f(int x) { int y; y = x << 2; return y; }

scanner token stream

Source language frontend structure

Sample C source le b.c (Fibonacci numbers)

LL1: LL4: LL5: }

Fig. 4 Three-address code IR for source le b.c

2.3 Machine code generation

Fig. 5 Control ow graph for b.c

Data ow graph for b.c

REG + load d(An, Ix )

REG CONST + load

Fig. 7 Five instruction patterns available for a Motorola 68k CPU

ADDREG REG + + load REG load (An) REG CONST ADDCONST

ADDREG REG + + load load d(An) REG REG CONST

Approaches to retargetable compilation

Fig. 9 MIMOLA HDL model of a simple processor

3.3 Little C compiler

3.2 GNU C compiler

AVAIL (* registers available to the register allocator *) <R0..R29,R31,F0..F15,F0..F31,D0..D30>;

Fig. 10 CGD specication of processor registers

link to scheduler description assembly output

Fig. 11 CGD specication of mapping rules

instruction class latencies

RESOURCES EX_alu , EX_mul ; TEMPLATES ALU_op MUL_op

Fig. 12 CGD specication of scheduler tables

4.1 Methodology and tools for ASIP design

4.2 ADL-based approach

Processor architecture exploration loop

target architecture LISA 2.0 description LISATek generators

exploration C-compiler assembler

implementation VHDL description

evaluation results profiling data, execution speed

Fig. 15 LISATek-based ASIP design ow

64 bit cipher text

Fig. 16 Blowsh encryption algorithm for IPsec

1 kB ext_mem memory module

systemC bus model (simple_bus)

SystemC processor module LISA simulator A P I A P I

LISA port LIS API AP I

LISA port AP I API

SystemC processor module LISA simulator A P I A P I

MIPS32 main processor

MIPS32=coprocessor system resulting from Blowsh architecture exploration

5.2 Register allocator and scheduler

opcode composition 0: two parallel instructions with register arguments

instructions: reg8_insn2 immediate

Fig. 18 Two instruction encoding formats (compositions)

5.3 Code selector

LISA model fragment for instruction format from Fig. 18

IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 2, March 2005

register_ alu_instr main fetch decode imm_ alu_instr

ADD SUB writeback ADDI SUBI ALU

2 write read ports 1 2 3 1

1 write write port