0% found this document useful (0 votes)
49 views14 pages

Exploring Area and Delay Tradeoffs in Fpgas With Architecture and Automated Transistor Design

Field-programmable gate arrays (fpgas) are used in a variety of markets. Vendors have moved to provide a diverse set of families that sit at different points in the areaspeed-power design space. This paper aims to understand the circuit and architectural design attributes of FPGAs that enable tradeoffs between area and speed.

Uploaded by

Vimala Priya
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
49 views14 pages

Exploring Area and Delay Tradeoffs in Fpgas With Architecture and Automated Transistor Design

Field-programmable gate arrays (fpgas) are used in a variety of markets. Vendors have moved to provide a diverse set of families that sit at different points in the areaspeed-power design space. This paper aims to understand the circuit and architectural design attributes of FPGAs that enable tradeoffs between area and speed.

Uploaded by

Vimala Priya
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 14

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO.

1, JANUARY 2011

71

Exploring Area and Delay Tradeoffs in FPGAs With Architecture and Automated Transistor Design
Ian Kuon, Member, IEEE, and Jonathan Rose, Fellow, IEEE
AbstractField-programmable gate arrays (FPGAs) are used in a variety of markets that have differing cost, performance and power consumption requirements. While it would be ideal to serve all these markets with a single FPGA family, the diversity in the needs of these markets means that generally more than one family is appropriate. Consequently, FPGA vendors have moved to provide a diverse set of families that sit at different points in the areaspeed-power design space. This paper aims to understand the circuit and architectural design attributes of FPGAs that enable tradeoffs between area and speed, and to determine the magnitude of the possible tradeoffs. This will be useful for architects seeking to determine the number of device families in a suite of offerings, as well as the changes to make between families. We explore a broad range of architectures and circuit designs and developed a transistor sizing tool that automatically optimizes each design. In this paper, we describe this tool and demonstrate that it achieves results that are comparable to past work but with vastly less effort. We then use the designs produced by the tool to explore the range of tradeoffs possible. We nd that through architecture and transistor sizing changes it is possible to usefully vary the area of an FPGA by a factor of 2.0 and the performance of an FPGA by a factor of 2.1. We also observe that the range of area and delay tradeoffs possible by varying only the transistor sizing of a single architecture is larger than the ranges observed in past architectural experiments. In addition to transistor size, we note that LUT size is one of the most useful parameters for trading off area and delay. Index TermsArchitecture, area delay tradeoffs, eld-programmable gate array (FPGAs), transistor sizing.

I. INTRODUCTION

IELD-PROGRAMMABLE gate array (FPGAs) have evolved to the point that they are now used in a wide range of markets including communications, consumer electronics, automotive, industrial and high-performance computing. The needs of these markets can be very different with some requiring the best performance while others are more focused on minimizing cost. These differing requirements make it difcult for a single FPGA family to adequately serve these varied market needs. As a result, industry practice has moved to provide different FPGA families to cater to these different
Manuscript received January 24, 2009; revised June 24, 2009. First published October 09, 2009; current version published December 27, 2010. I. Kuon was with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada. He is now with the Altera Toronto Technology Center, Toronto, ON M5S 1S4, Canada (email: [email protected]) J. Rose is with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada (e-mail: [email protected]; [email protected]). Digital Object Identier 10.1109/TVLSI.2009.2031318

needs. It is now common for FPGA manufacturers to offer a high end, high performance family [1][3] and a lower cost, lower performance family [4][6]. This trend is almost certain to continue as new processes require FPGA architects to make increasingly difcult design choices between cost, performance and power. These choices can dramatically affect the gap between FPGAs and both the full or partially-fabricated application-specic integrated circuit (ASICs) with which they compete. This gap was previously measured for a high performance FPGA and it was found that a pure soft-logic FPGA (with only routing and (LUT)-based logic and without hard memory or other blocks) is 35 times larger, 3 to 4 times slower and consumes 14 times more dynamic power than the equivalent standard-cell ASIC implementation [7]. For different markets, area, performance and power are of varying importance and closing one of the gaps could be essential. This makes the ability to tradeoff one attribute for another particularly important. If xed function chips begin embedding programmable logic, there will be an even greater need for FPGAs that make varied cost, performance and power tradeoffs. One technique frequently considered for improving the cost, performance and power of an FPGA is the use of hard memory and multiplier blocks. This was investigated in [7] and hard blocks were found to be useful in narrowing the area gap down to potentially as low as 4.7 times larger than an ASIC. However, these blocks had only a moderate impact on the power gap and had virtually no impact on the delay gap. Clearly, the soft-logic of the FPGA remains an important factor in determining the area, performance and power of an FPGA and the current work will focus exclusively on the soft-logic of FPGAs. To date, there has been little exploration of the tradeoffs possible through circuit design and transistor sizing between area, speed and power within FPGAs. Past studies have focused almost exclusively on high-level logical architecture such as changes to the routing [8], logic block [9], [10] or both [11], [12]. However, the high-level architecture is only one variable in the design of an FPGA that can be adjusted since, for every architecture, there are a range of possible electrical implementations. These different implementations can trade off cost, performance and power through the use of different circuit structures or transistor sizings and this has been largely ignored in past studies [8], [9], [11]. Some electrical design issues such as supply voltage or threshold voltage optimization have been explored [12] but, again, little attention has been paid to the issues of transistor sizing and circuit structure. Instead, it has generally been assumed that there is only a single circuit structure and transistor sizing of interest such as that which minimizes the circuits area delay product. However, as is often seen in the custom design world, there are a range of logically equivalent but

1063-8210/$26.00 2009 IEEE

72

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 1, JANUARY 2011

electrically distinct implementations [13]. We believe the same holds true for programmable circuits and, in this paper, we explore the range of area and delay tradeoffs that are possible in the design of an FPGA by varying both its logical architecture and its electrical implementation. This exploration can inform architects the extent to which area and delay can be improved for FPGAs and is a step towards understanding how many different FPGA families are necessary. To help this analysis, we demonstrate quantitatively the impact of these tradeoffs for soft logic on the area and delay gap with ASICs. Exploring these tradeoffs requires the optimization of a multitude of different combinations of logical architecture and electrical implementation. Performing such optimization manually is not feasible and, therefore, we have developed a tool to perform the optimization of the soft-logic fabric of an FPGA. A custom tool is necessary because, while transistor sizing for custom circuits is a well-studied problem [14][17], the programmable nature of FPGAs adds unique challenges and opportunities which we will describe. While the main goal of our optimizer is to enable the architecture exploration process, our tool could also be useful to new entrants to the market trying to quickly create new FPGA or FPGA-like devices. The optimization tool will be described in subsequent sections. It is important to note that we aim to observe the size of the design space for general-purpose FPGAs. Another degree of freedom in optimization would be to create FPGAs that are tailored towards specic application domains [18]. However, we believe there is a need for different general-purpose FPGAs that occupy different points within the design space. The remainder of the paper is organized as follows. A brief background reviewing the basic architectural, experimental methodology and circuit assumptions on which this work is based is presented in Section II. Section III describes the FPGA-specic transistor-level optimization tool that was developed to assist in exploring the design space and the quality of results from this optimizer is measured in Section IV. Section V outlines the procedure used to measure the performance and area of the FPGAs designed using the optimizer. Then, Section VI examines the range of tradeoffs possible with transistor sizing. The impact of transistor sizing in conjunction with architectural changes is then explored in Section VII to determine both the magnitude of changes possible and to determine the parameters that provide the most leverage when making area and delay tradeoffs. This understanding of the tradeoffs is then applied in Section VIII to the FPGA-to-ASIC gap to determine the potential impact of these tradeoffs on the gap. Finally, Section IX concludes. Preliminary versions of this work were published in [19], [20]. This nal version improves on that past analysis with an updated measurement methodology presented in Section VI and new measurement results are included in Section VII. As well, more detailed analysis of the results and their impact is performed in Sections VII and VIII. II. BACKGROUND We wish to explore the tradeoffs possible in the design of an FPGA. (The design of an FPGA should not be confused with the design using an FPGA.) This exploration requires accurate measurements of the area and performance of an FPGA design and these measurements are typically obtained using an experimental process as shown in Fig. 1. This assessment process takes

Fig. 1. FPGA architecture assessment.

benchmark circuits and synthesizes, packs, places and routes them on to the candidate FPGA design. This process is necessary because the programmable nature of FPGAs means delay, area and power can vary between applications implemented on the same FPGA design. Each design is dened by two attributes: its logical architecture and its transistor-level design. The logical architecture denes the logical behavior of the FPGA including the number and size of the LUTs grouped together and the structure of the routing segments that connect those clusters of LUTs. The transistor-level design denes the area, delay and power of the FPGAs components. With these inputs, the effective area, speed and power of the FPGA can be determined from the experimental procedure. One of the signicant challenges in this process is the transistor-level design as this has historically required months of manual effort [8], [9], [11]. Before describing the automated transistor sizing tool we developed to address this issue, the logical architecture and transistor-level design assumptions will be reviewed. A. Logical Architecture We focus exclusively on the classic island-style FPGAs consisting of a soft logic cluster-based logic block (CLB) surrounded by programmable routing [21]. This structure and the main architectural parameters are shown in Fig. 2. We further limit ourselves to a homogeneous routing topology in which all the routing tracks are unidirectional as described in [8], [22], [23] and all the segments within each track have the same length. The length of a segment, , is dened as the number of logic blocks it reaches. We assume that there are an equal number of tracks in the horizontal and vertical directions and we refer to this quantity as the channel width, . A fraction of the tracks in a channel, , connect to each of the logic blocks input pins and an output of the logic block can connect to a fraction, , of the tracks. Each CLB is composed of one or more Basic Logic Element (BLEs) and each BLE is made up of a LUT with inputs and a ip-op. The number of BLEs in a logic block is dened as the cluster size, . When logic blocks contain more than one BLE, programmable intra-cluster routing connects the logic block inputs and outputs to the inputs of each BLE. The intra-cluster routing is assumed to be fully populated as each BLE input is able to connect to all the logic block inputs and all the BLE outputs. All these assumptions and parameters dene the logical architecture of the FPGA. While the architecture of modern FPGAs is signicantly more complex with multiple types of routing segments, logic blocks with more features such as adders [3], [24] and various different types of logic blocks such as multiplier [3], [24],

KUON AND ROSE: EXPLORING AREA AND DELAY TRADEOFFS IN FPGAS WITH ARCHITECTURE AND AUTOMATED TRANSISTOR DESIGN

73

C. Circuit Assumptions At the circuit level, the FPGA architectures we will consider consist purely of multiplexers, inverters, conguration memory, and user-circuit ip-ops. The ip-ops are a relatively small part of the design that does not signicantly affect an FPGAs performance or area. (They typically consume less than 5% of the FPGAs area.) Therefore, we do not investigate the range of possible ip-op implementations. The remaining structures all make up the majority of the FPGAs area with the conguration memory consuming upwards of 30% of the area. The design of the conguration memory and the inverters is straightforward. For the conguration memory, a standard 6-transistor SRAM cell is assumed. However, multiplexers have a range of possible electrical implementations. We assume multiplexers are constructed using nMOS pass transistors. To restore signals to the full rail, a level restoring pMOS is added to the inverters connected to the multiplexer output. A one-level nMOS pass transistor tree is assumed for multiplexers with a width less than 4 (i.e., a one hot encoding) and a two-level nMOS tree is assumed for all larger multiplexers as described in [27], [28]. (The benets and disadvantages of these different multiplexer implementations was investigated in [29].) The high-performance 90 nm CMOS process from ST Microelectronics [30] will be used exclusively in this work. We use only standard transistors and assume a supply voltage of 1.2 V. While 90 nm CMOS is no longer the most advanced technology, it remains a useful medium for exploration as it is a well understood process with mature device models. As well, in other work [31], we have investigated the impact of the latest process technologies on FPGA design choices and we found that the latest technologies do not signicantly alter design decisions. D. Further Restrictions To keep this work tractable, we place some limitations on the design changes we will explore. First, we will focus only on area and delay tradeoffs. Power consumption tradeoffs will not be considered. This is reasonable since we have conrmed, as has been previously reported [10], that power consumption is closely related to area for many architectural changes. Techniques such as power gating [12] can alter the relationship between area and power consumption but these techniques are not supported by our current CAD tools. As well, such power management techniques do not reduce the need for a thorough understanding of area and delay tradeoffs because power management is generally applied to maintain performance while reducing power consumption. Therefore, we will only consider changes in transistor sizes within the previously described circuit structures and we will not consider threshold voltage or supply voltage changes such as those in [12] since those changes are only useful for power tradeoffs. III. TRANSISTOR-LEVEL OPTIMIZATION TOOL As described previously, exploring area and delay tradeoffs for FPGAs requires the transistor-level optimization of a wide range of FPGA designs. It is not feasible to manually size the circuitry for each possibility as has been done in past architectural experiments because such studies required months of work

Fig. 2. FPGA logical architecture (a) Island-Style Architecture (b) ClusterBased Logic Block.

memory [3], [24] and processor [3] blocks, focusing on the comparatively simple architecture described above is still useful for exploring the area-delay tradeoffs we will consider in this work. Despite the new features, the basic LUT and ip-op is still crucially important as it gives an FPGA its general-purpose capabilities and the tradeoffs made in the design of that basic logic and routing continue to have a signicant impact on the overall area and performance of FPGAs. If anything, the additional architectural features found in modern devices have the potential to further expand the range of tradeoffs possible when designing an FPGA since with each new component comes the capability to adjust its performance or area. B. FPGA Tiles A single logic block and its neighboring routing channels must be instantiated thousands of times to create a complete FPGA. It is not practical to individually optimize each logic block and routing track. Instead, in this work we assume that only a single tile consisting of a logic block and the neighboring routing tracks is designed and then instantiated repeatedly. This is a standard method for creating FPGAs [25], [26]. This use of a single tile places restrictions on the logical architectures that can be explored. In particular, the channel width is restricted to multiples of twice the segment length [8]. (For bidirectional routing tracks, the channel width must be multiples of the segment length.) This quantization of the channel width ensures that every tile is identical with an equal number of routing tracks starting and stopping in each tile. When determining the architectural parameters for our experiments we ensure this quantization is maintained.

74

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 1, JANUARY 2011

to produce a single FPGA design [9], [11]. To address this difculty we have developed a custom transistor-level optimization tool and, in this section, we will describe this tool. A. Problem Denition Before describing our automated transistor sizing tool for FPGAs, we examine the unique features of this problem and describe how we handle these issues. The broad architecture explorations we aim to enable require a great deal of exibility in the optimizer and, in this section, we will also review the inputs necessary to provide this exibility. The transistor sizing optimization problem for FPGAs is on the surface similar to the problem faced by any custom circuit designer. It involves minimizing some objective function such as area, delay or a product of area and delay, subject to a number of constraints including that all the transistors are greater than minimum size and an area or delay constraint if appropriate. While this is a standard optimization problem, the unique features of programmable circuit design change both the problem and its complexity. 1) Differences in FPGA Transistor Sizing: The transistor-level optimization of custom (non-programmable) integrated circuits has been well studied and a variety of approaches have been proposed to automate this process [14], [16], [17]. Programmable circuits and, FPGAs in particular, present unique optimization challenges. The most signicant is that, due to the programmability, it is not known what end-user circuit will be implemented on the FPGA. This means that the critical path is not known at design time (of the FPGA itself), and, therefore, there is no clear denition for the delay of an FPGA. As a result, improving the performance of the circuit is no longer straightforward because different circuits may place different demands on the various elements within the FPGA. This is the reason the experimental process shown in Fig. 1 is necessary and, in Section III-C, we describe how we handle this issue. The second difference that must be considered in the design of FPGAs is the large number of logically equivalent components. All these equivalent components must be sized identically to maintain their equivalence. As a result, the designer no longer has the freedom to increase the size of one component to improve performance and, instead, all similar components must be increased in size. An example of this is a routing track and the multiplexers that select the signal driving this track as shown in Fig. 3. For performance reasons, increasing the size of the transistors in the multiplexer (up to a point) would be advantageous but, because this same multiplexer also loads the routing channel in more cases than it is used, increasing the transistor sizes may not provide the desired reduction in overall delay. We refer to this effect as logical self loading and we note that similar issues are faced in other custom designs that involve repeated circuit structures such as memories. It is possible to alter these logical equivalency requirements by creating architectures with distinct classes of switches [32], [33]; however, for transistor-level optimization, we treat logical equivalency as a xed constraint. These repeated structures within the FPGA also simplify the optimization problem since there are many fewer independent variables compared to the total number of transistors. This reduction is signicant since an FPGA has hundreds of millions

Fig. 3. Repeated equivalent parameters.

of different transistors that can be on a users critical path but there are only on the order of a hundred unique transistor sizes that can be varied and must be optimized. 2) Inputs: The optimizer requires the following three inputs: Logical Architecture: The logical architecture denes the behavior of the FPGA visible to a user and includes all the parameters dened in Section II-A. Electrical Architecture: While the logical architecture parameters dene the functionality of the circuitry, such as the size of multiplexers that may be required, there are a multitude of different possible electrical implementations for each logical architecture. This includes buffer placement and multiplexer implementations which can signicantly alter the performance and area of an FPGA. Despite this importance there has been little consensus regarding the most appropriate structures to use. For multiplexers, the approaches used include fully encoded multiplexers [11], three-level partially-decoded structures [34] or two-level partially-decoded structures [27], [28], [35]. Similarly, the placement of buffers has also varied between placement at the input to multiplexers (in addition to the output) [9] or simply at the output [27], [28]. These implementation choices are left as inputs to allow architects to explore their impact. Optimization Objective: Finally, it is also necessary to provide an objective function that will determine the tradeoffs to make between performance and area. While past architectural studies focused exclusively on minimizing the area-delay product of the FPGA [8], [11], we believe a wider range of possibilities must be considered. Different tradeoffs are made by altering the function that the optimizer aims to minimize. We focus on functions of the form, with and greater than or equal to zero. By varying or the area and performance of a design can be varied signicantly. Minimizing a function of this form provides more intuitive feeling than simply minimizing delay or area given an arbitrary area or a delay constraint. This optimization objective requires quantitative measures of area and delay of an FPGA. The following sections describe the handling of these metrics given the programmability of the FPGA. B. Area Modeling Simple area models such as the sum of all the transistor widths have been used in the past for circuit optimization; however, since we aim to explore a wide range of architectures and area and delay tradeoffs, more accuracy is needed because unrealistic area modeling could cause the scope of these tradeoffs to be measured incorrectly.

KUON AND ROSE: EXPLORING AREA AND DELAY TRADEOFFS IN FPGAS WITH ARCHITECTURE AND AUTOMATED TRANSISTOR DESIGN

75

The most accurate area estimate would be to create complete layouts of each design but that is clearly not feasible for the large number of designs that will be created. Instead, we developed a modied version of the minimum-width transistor area model typically used in FPGA architectural studies [11]. The area of each transistor is modeled as a function of its width, , using the following: (1) are process-specic constants. where , , and and are constants that were conventionally both 0.5 [11] but, in this work, we adjusted them to more closely match our process is dened as the area required for a minimum rules. width transistor and the spacing around it to satisfy the process design rules. The total area of a design is obtained by summing up the area for each transistor except for the transistors used to make the conguration SRAM bits as their area is counted separately. The transistors that make up the SRAM cells that give FPGAs their programmability are laid out efciently as they are instantiated millions of times; therefore, the area of these transistors is modeled differently to account for the compact layout and the signicant diffusion sharing that is possible between neighboring bits. Specically, an SRAM cell was designed and laid out in our target technology. The area of this cell assuming diffusion sharing with neighboring bits was then measured. The total area for a design is taken as the sum of the SRAM transistors and the regular transistors. Finally, to calibrate the model, the predicted area was compared to the actual layout area for a few large cells, such as a 32 input multiplexer, and a scale factor that minimizes the error between the areas was determined. Applying the scale factor to the estimated area gives the nal predicted area for the design. FPGAs are generally created using a single tile, containing the logic block and the neighboring routing tracks, that is instantiated repeatedly, so only the area of a single tile need be computed. This area of a single tile will serve as the area metric for the optimizer. (The tile area will also be used to estimate the interconnect lengths for delay measurements.) In architectural studies to account for the varying amounts of logic in a tile, an effective area will be determined by multiplying the tile area by the number of tiles necessary to implement a set of circuits. C. Delay Modeling As described previously, one challenge in the optimization of programmable circuitry is that the eventual critical path is not known at design time and, in fact, the path will vary between application circuits. To handle this issue, we create a representative path which will be used during optimization. Unlike with traditional circuit optimization in which all the paths through a combinational block must be optimized to achieve a desired delay, we instead create a single path containing all the unique components within the FPGA. This path is the shortest register to register path within an FPGA that uses every unique resource. While the path contains every component in the FPGA, it is unlikely to be equivalent to any users critical path. Typical critical paths might use some resources repeatedly while never using other resources. Therefore, we do not directly optimize the

TABLE I OPTIMIZATION PATH COMPONENTS AND WEIGHTS

Fig. 4. FPGA electrical optimization methodology.

delay of the path. Instead, we measure the delay for each component in the path and combine these delay measurements to produce a representative delay measurement in which the delay for each component is weighted according to the frequency with which it was encountered in the critical paths of a set of benchmark circuits. The specic weights used for the different circuit components is summarized in Table I. The nal design produced by the optimizer was not extremely sensitive to the specic weights used for the optimization and only when extreme weights were used, such as putting negligible weight on routing resources, did the quality of the design suffer. Other possibilities, such as averaging the delay for a number of small critical paths, were also considered; however, the weighted path average was selected as it was the most efcient in terms of computation time. This is described in greater detail in [29]. D. Algorithm In this section, we give an overview of the optimization algorithm. For the rst phase of optimization, delay estimation using simple linear device models is used to obtain a near optimal sizing given those models. Linear device models have long been known to be inaccurate [36] and, therefore, to account for the true behavior of transistors, a second phase renes the sizes using a simple algorithm combined with simulation using accurate device models. The process is shown in Fig. 4 and described below. 1) Phase 1Linear-Based Models: In this rst phase of optimization, transistors are treated as simple linear resistors and capacitors. The delay of the circuit is computed using the standard Peneld-Rubinstein RC model [37]. The optimization of this delay given the linear components is a well-studied problem [15][17], [38] and it has been recognized that, for the objective functions we use in this work, the optimization problem can be mapped to a convex one [16] which implies that any local minimum obtained is in fact the global minimum. Since

76

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 1, JANUARY 2011

getting trapped in sub-optimal local minima is not a concern, a relatively simple algorithm can be used and, for this work, a modied version of the TILOS algorithm was used [16]. We describe the algorithm as changing parameter values not specic transistor sizes to emphasize that transistors are not sized independently since preserving logical equivalency requires groups of transistors to have the same sizes. The algorithm starts with all the transistors set to minimum size. For each parameter, the improvement in the objective function per change in area is measured. This improvement per area increase is termed the parameters sensitivity. With only a single representative path to optimize, the sensitivity of every parameter must be measured since they can all affect the delay. Like the original TILOS algorithm, the parameter with the greatest sensitivity is increased and the process repeats. The TILOS algorithm was modied to also decrease the size of the parameter with the most negative sensitivity. Negative sensitivity means that increasing the parameter, increases the objective function. Therefore, decreasing the parameter improves (reduces) the objective function. This eliminates one of the limitations of TILOS which can prevent it from achieving optimal results. This phase of the optimization terminates when no parameters require adjustment. Due to the ability to both increase and decrease transistor sizes, the possibility of oscillatory changes (i.e., alternating between increasing and decreasing a parameter value) exists. A number of strategies, which will not be described due to space constraints (full details can be found in [29]), are used to avoid such oscillations; however, this leads to the possibility that the sizing when the algorithm terminates is not optimal. We do not believe this is a concern because, regardless of the solution obtained, renement of the sizings is necessary to account for the non-linear behavior of the devices. This renement is done in the next phase of the algorithm. 2) Phase 2Sizing With Accurate Models: The sizes determined using the linear models in the previous phase are now adjusted using an algorithm that relies on accurate device models and full simulation with Synopsys HSPICE. This ensures that any delay measurements will be accurate but it means that it generally will not be possible to obtain a provably optimal result. A variety of approaches have been previously suggested; however, the most powerful approaches such as [14] require access to internal simulator calculations and, as they are designed for the optimization of large circuits, are relatively complex. The use of a single representative path as described in Section III-C means that interrelated circuit paths do not have to be considered which simplies the problem considerably. Furthermore, even though FPGAs contain on the order of millions of transistors, the representative path contains only thousands of transistors making optimization using HSPICE feasible. We employ a simple iterative greedy algorithm for optimization. Every parameter is considered in turn and is simulated across a range of values. The value that gives the best result is selected and the process then repeats for the next parameter. Once all the parameters have been considered, the algorithm terminates if no parameter values required adjustment or repeats over all the parameters if any values were changed. Such a greedy algorithm is not well-suited to adjusting the sizes of transistors individually because closely connected transistors must typically be sized together to realize the full benet

TABLE II INTERCONNECT BUFFER SIZING OPTIMIZATION

of size increases. We address this by operating on small groups of related transistors. For example, in a two stage buffer, one adjustment considered by the optimizer is the adjustment of all four transistors together (i.e., increasing the sizes of all the transistors by a common factor). However, we retain the freedom to adjust each transistor size individually as well to enable improvements such as those possible by skewing the pn ratios to offset the slow rise times of multiplexers with nMOS pass transistors. IV. VALIDATION AND QUALITY OF RESULTS The goal of our tool is to produce transistor sizings that can enable architectural explorations of FPGAs. In this section, we compare the quality of results we obtain to past work that used manual or partially automated approaches to achieve a similar objective. Our tool handles much larger problems than were previously considered as we aim to optimize all the soft logic of an FPGA and not just small pieces of it as was done in the past [9], [11], [28]. With our automated approach we gain the ability to search a much larger design space and we will see that this enables signicant performance improvements. To provide further condence in our results, we demonstrate that for a simplied problem in which exhaustive sizing is possible, we can obtain similar results. A. Past Interconnect Optimizers The transistor sizing of buffers for the routing interconnect, similar to that shown in Fig. 3, was considered in [28]. For 180 nm CMOS, the buffer sizes for a range of different interconnect segment lengths were optimized to minimize delay with an exhaustive search to determine the overall buffer size and the number of inverter stages to use. The size ratio between the inverters within the buffer was then determined analytically. The sizes and test circuit structure used in [28] were simulated1 and the delay results were compared to those obtained when our optimizer was used. Table II summarizes this comparison across the full range of interconnect lengths considered in [28]. The results derived from [28] are given in the second column. The third column lists the results obtained using our sizer. Clearly, we are able to achieve performance matching that obtained in [28]. The area of these designs was within of the estimated area of [28]. In [28], the size of the transistors within the routing multiplexers was xed at minimum size. A benet of our tool is that more thorough optimization and design space exploration is possible. When we allow the sizes of the multiplexer transistors
1Due to different interconnect modeling methods and HSPICE versions slightly different delay results were obtained when we replicated the simulation from [28]. This difference is minor and less than 11 ps at worst.

KUON AND ROSE: EXPLORING AREA AND DELAY TRADEOFFS IN FPGAS WITH ARCHITECTURE AND AUTOMATED TRANSISTOR DESIGN

77

TABLE III COMBINED INTRA-CLUSTER ROUTING AND LUT DELAYS

TABLE IV EXHAUSTIVE SEARCH COMPARISON

to be optimized signicant delay improvements were possible as shown in column 4. For the shortest interconnect segment, an improvement of 36% in delay was obtained at the cost of increased area. B. Past Manual Transistor Sizing While the work in [28] focused on routing circuit design, [9] performed detailed sizing of the logic block. Delays for various segments within the logic block were optimized for a range of cluster sizes and LUT sizes using 180 nm CMOS. We used the same technology and the same optimization objective (area-delay) with our optimization tool.2 The results obtained are compared with the results from [9] in Table III. The table lists the delay from the input of the logic block to the output across a range of cluster sizes. The results obtained using our optimizer closely match the previously reported delays but the present results were obtained with a fraction of the effort. While one can see that the present work produces slightly slower designs by 6% in some cases, these differences may be due to the vastly different area models used in these works. As well, the work in [9] considered the sizing of the routing and the logic block independently while our approach sized the routing and logic together. This combined approach ensures that the area and delay is appropriately balanced throughout the design such that if more area is needed for logic/routing then the area for routing/logic can be reduced appropriately to ensure that the overall area-delay of the design is optimized. C. Comparison to Exhaustive Search We also compared the results obtained from our greedy-algorithm-based optimization described in Section III-D to the best results obtained from an exhaustive search with both optimizing for delay. It is only possible to optimize the sizes of a small number of transistors for this comparison because the number of test cases quickly grows unreasonably large for the exhaustive search. Furthermore, to make the simulation time reasonable, the path under optimization was simplied to be a properly loaded routing segment similar to that shown in Fig. 3. With this simplied path, a comparison involving the optimization of three transistor sizes was possible. We varied the specic transistors whose size could be adjusted. The delay for the routing segment from our sizer and the exhaustive search were compared for four different combinations of adjustable transistors. The two results were consistently within 1.2% of each other. Table IV lists all the compar2There are differences in circuit structure between these works as [9] assumed that the selection signals to the multiplexers were driven at voltages above the nominal supply voltage (gate boosting) and [9] used a fully-encoded multiplexer structure. These two differences likely partially cancel each other out.

ison results. For these cases, our optimization tool was 30.6X times faster than the exhaustive search. For a larger number of adjustable sizes, the exhaustive search quickly becomes infeasible. V. AREA AND PERFORMANCE MEASUREMENT METHODOLOGY The results from the previous section demonstrate that the optimization tool can produce designs that are comparable to past work. However, those results only measured the performance of specic portions of an FPGA. The inherent programmability of FPGAs means that until an FPGA is programmed with an end-users design there is no denitive measure of the performance or area of the FPGA. Only after a circuit is implemented on an FPGA is it possible to measure the performance of the FPGA in a meaningful manner. Similarly, determining the effective area of an FPGA also requires the implementation of end-user circuits on the FPGA to accurately determine the resources required. In this section, the specic methodology used to measure the performance and the area of an FPGA implementation is dened. A. Performance The performance of a particular FPGA implementation is measured experimentally using the 20 largest MCNC benchmarks [39]. Each benchmark circuit is implemented through a complete CAD ow on the input FPGA fabric and a nal delay measurement is generated as an output. The geometric mean delay of all the circuits is then used as the gure of merit for the performance of the FPGA implementation. The steps involved in this process are illustrated in Fig. 5. Note that this metric fully implemented is signicantly different than the simple one used inside the transistor sizing tool. Synthesis, packing, placement and routing of the benchmark circuit onto the FPGA is done using SIS with FlowMap [40], T-VPack [41] and VPR [42] (an updated version of VPR that handles unidirectional routing is used [31]). Placement and routing is repeated with 10 different seeds for placement. The placement and routing with the best performance is used. The tools cannot directly make use of the transistor size denitions of the FPGA fabric and, instead, a simplied timing model must be provided. This timing model is encapsulated in VPRs architecture le and includes xed delays for both the routing tracks and the paths within the logic block. We generate this le automatically from the transistor size denition. After placement and routing is complete, VPR performs timing analysis to determine the critical path of the design implemented on the FPGA. While this provides an approximate measure of the FPGAs performance, it is not sufciently accurate for our purposes since the relatively simple delay model

78

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 1, JANUARY 2011

Fig. 6. Area delay space for the architecture in Table V.

TABLE V ARCHITECTURE PARAMETERS Fig. 5. Performance measurement methodology.

does not accurately capture the complex behavior of transistors in current technologies. To address this, we have created a modied version of VPR that emits the transistor-level circuitry of the critical path. This circuit is then simulated, with the appropriate transistor sizes and structures, using HSPICE. The delay as measured by HSPICE is used to dene the performance of this benchmark implemented on this particular FPGA implementation. One concern with this method is that routing and timing analysis in VPR is performed with the inaccurate timing model and, as a result, poor routing choices may be made or timing analysis may incorrectly predict the designs critical path. Adequately addressing this problem would likely require simulation of the -most critical paths in a design to verify the timing analysis results and this quickly becomes computationally infeasible for a reasonable and a large number of benchmark circuits. Furthermore, if signicant discrepancies are observed it suggests that inappropriate routing decisions may also have been made and addressing this challenge would require overhauling the entire timing analysis engine within VPR. However, we do not believe this to be a concern in this work for two reasons. First, we observed for any individual sizing a high degree of correlation between the critical path delays as reported by VPR compared to the simulated delay in HSPICE of those same critical paths for the full set of benchmarks. However, for different sizings, the delay measurements between VPR and HSPICE are not as consistent and, hence, the need for HSPICE simulation in the rst place (More details can be found in [29]). Secondly, in this work we restrict ourselves to relatively simple routing architectures that are homogeneous. With only a single type of routing interconnect the delity of VPRs timing analysis is relatively good. B. Area To allow for comparisons across different logical architectures (e.g., with different LUT or cluster sizes), the nal area metric is the product of the area of an individual tile, estimated using the method described in Section III-B, and the number of tiles (or equivalently clusters) required for all the benchmark circuits. This metric ensures that changes that alter the amount

of usable logic are reected in the area measurement. This is particularly important when the number of inputs to a cluster is reduced or when the size of the LUT is changed since those changes alter the amount of usable logic within a tile/cluster. VI. AREA AND DELAY TRADE-OFFS We now employ the methodology described above to rst examine the magnitude of the area and delay changes possible with transistor sizing. Not all tradeoffs are useful and, therefore, we discuss and dene what we believe to be the boundaries of useful tradeoffs. A. Transistor Sizing for a Single Architecture For any specic architecture, transistor sizing enables a range of implementations between the two extremes of minimum delay and minimum area solutions. The different implementations occupy different points in the area-delay design space. Fig. 6 plots these different points that form the delay versus area curve for the cluster size 10, 4-LUT architecture fully described in Table V. As described in Section V-A, delay is measured as the geometric mean of the critical paths of the 20 largest MCNC benchmarks [39] when placed and routed on an FPGA with the particular transistor sizing. Area is measured as described in Section V-B using the model detailed in Section III-B to estimate the size of the tile. Transistor sizing clearly enables a large range of area and delay possibilities with a range of 2.2 in area from the smallest to largest design and 8.0 from the fastest to slowest design. The imbalance between the area and delay ranges as well as the sharp slopes seen in the gure both suggest that not all these tradeoffs are useful. This issue is examined in the

KUON AND ROSE: EXPLORING AREA AND DELAY TRADEOFFS IN FPGAS WITH ARCHITECTURE AND AUTOMATED TRANSISTOR DESIGN

79

TABLE VI AREA AND DELAY IMPACT OF TRANSISTOR SIZING AND PAST ARCHITECTURAL CHANGES

following section. However, it is rst informative to compare the full area-delay range from transistor sizing to the full range seen when architectural parameters have been varied in past studies. This comparison is shown in Table VI. In each case, the range is measured as the largest area or delay relative to the smallest area or delay observed for the architectures considered. Previously, the largest range was achieved when cluster size and LUT sizes were both varied. In that case, ranges of 3.2 and 1.7 were observed in delay and area respectively [9]. While the area range is of a similar magnitude to that seen from transistor sizing, the delay range from architectural changes is considerably smaller than that from transistor sizing. This indicates the signicant effect transistor sizing can have on performance. B. Interesting Trade-Offs The goal in exploring the area and performance tradeoffs is to understand how the gap between FPGAs and ASICs can be selectively narrowed by exploiting these tradeoffs. However, the tradeoffs considered must be useful and, as seen in the previous section, an imbalance between the area and delay tradeoffs occurs at the extremes of the transistor sizing tradeoff curve shown earlier. Selecting the regions in which the tradeoffs are useful is a somewhat arbitrary decision. Intuitively, this region is where the elasticity [43], dened as

Fig. 7. Determining designs that offer interesting tradeoffs (a) Design B is not interesting (b) Design B may be interesting (c) Design B is not interesting (d) Design B may be interesting.

(2) is neither too small nor too large. Since we do not have a differentiable function relating the delay and area for an architecture, we approximate the elasticity as: (3) An elasticity of means that a 1% area increase achieves a 1% performance improvement. Clearly, a 1-for-1 tradeoff between area and delay is useful. However, based on conversations with a commercial FPGA architect, Trevor Bauer from Xilinx, we will view the tradeoffs as useful and interesting when at most a 3% area increase is required for a 1% delay reduction (an elasticity of ) and when a 1% area decrease increases delay by at most 3% (an elasticity of ). This factor of three that determines the degree to which area and delay tradeoffs can be imbalanced will be called the elasticity threshold factor. All points within the range of elasticities set by the threshold factor will make up what we call the interesting range of tradeoffs. While this restriction only explicitly considers delay and area, it has the effect of eliminating designs with excessive power consumption

because those designs would generally also have signicant area demands. This approach is appropriate for considering the interesting regions of a single area-delay curve. A more involved approach is necessary when considering discrete designs, such as those from [11], [44], or multiple different tradeoff curves. In such cases, the process for determining the interesting design is as follows: rst the set of potentially interesting designs is determined by examining the designs ordered by their area. Starting for the minimum area design, each design is considered in turn. A design is added to the set of potentially interesting designs if its delay is lower than all the other designs currently in the potentially interesting set. This rst step eliminates all designs that can not be interesting because other designs provide better performance for less area. The next step will apply the area-delay tradeoff criterion to determine which designs are interesting. Two possibilities must be considered when evaluating whether a design is interesting. These two possibilities are illustrated in Fig. 7 through four examples. In these examples, we will determine if the three designs labeled , and are in the interesting set. Design is rst compared to design using the elasticity requirement as shown in Figs. 7(a) and 7(b). If the delay improvement in relative to is too small compared to the additional area required as it is in Fig. 7(a), then design would be rejected. In Fig. 7(b) the delay improvement is sufciently large and design could be accepted as interesting. However, the design must also be compared to design . In this case, the elasticity requirement is used. If the delay of relative to is too large compared to the area savings of relative to the design would not be included in the interesting set. Such a case is shown in Fig. 7(c). An example in which design is interesting based on this test is illustrated in Fig. 7(d). A design whose delay satises both the and the requirements is included in the interesting set. At the boundaries of minimum area or minimum delay (i.e., design and design respectively if these were the only three designs being considered) only the one applicable elasticity threshold must be satised.

80

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 1, JANUARY 2011

Fig. 8. Full area delay space.

When examining more than three designs, the process is the same except the comparison designs and need not be actual designs. Instead, those two points represent the minimum and maximum interesting delays possible for the areas required for designs and respectively. Equivalently, designs and are the largest or smallest designs respectively that satised the or elasticity threshold. If no such designs exist then the minimum area or delay of actual designs, respectively, would be used. With this restriction to the interesting region, the range of tradeoffs for the results in Fig. 6 is decreased to a range of 1.4 in delay from slowest to fastest and 1.5 in area from largest to smallest. Clearly, there is a signicant reduction in the effective design space but the range is still appreciable and it demonstrates that there are a range of designs for a specic architecture that can be useful. Applying this same criteria to the past investigation of LUT size and cluster size [44], we nd that the range of useful tradeoffs is 1.2 for delay from fastest to slowest and 1.1 for area from largest to smallest. This space is smaller than the range observed for transistor sizing changes of a single architecture. From the perspective of designing FPGAs for different points in the design space, transistor sizing appears to be the more powerful tool. However, architecture and transistor sizing need not be considered independently and, in the following section, we examine the size of the design space when these attributes are varied in tandem. VII. TRADEOFFS WITH TRANSISTOR SIZING AND ARCHITECTURE For each logical architecture, a whole range of different transistor sizings, each with different performance and area, are possible. In the previous section, only a single architecture was considered, but now we explore varying the transistor sizes for a range of architectures. We considered a range of architectures with varied routing track lengths , cluster sizes and LUT sizes . A comparison between architectures is most useful

if the architectures present the same ease of routing. Therefore, as each parameter is varied, it is necessary to adjust other related architectural parameters such as the channel width (W) . We and the input/output pin exibilities determine appropriate values for the channel width experimentally by nding the minimum width needed to route our benchmark circuits. The minimum channel width is increased by 20% and rounded to the nearest multiple of twice the routing segment length to get the nal width which, as described earlier, is necessary to ensure a tileable design. The input pin exibility is determined experimentally as the minimum exibility which does not increase the channel width requirements. The output exibility is set as, , where is the cluster size. For each architecture, the full range of transistor sizing optimization objectives were considered and the results for all these architectures and sizes are plotted in Fig. 8. In total, 60 logical architectures were considered. With the different sizings for each architecture, this gives a total of 1331 distinctly sized architectures. Each point in the gure is a different combination of architecture and transistor sizing. Again, it is necessary to consider which points within this design space are interesting in the manner dened above. Based on the criteria with an elasticity threshold factor of 3, the smallest/slowest and fastest/largest interesting designs are highlighted in the gure. These designs and the design that achieved the minimum area delay product are listed in Table VII along with the area, delay and area delay for each design. Compared to conventional experiments which would have only considered the minimum area-delay point useful we see that in fact there are a wide range of designs that are interesting when different design objectives are considered. The span of these designs is of particular interest and is summarized in Table VII. In terms of area, we see that there is a range of 2.0 from the largest design to the smallest design and, in terms of delay, the range is 2.1 from the slowest design relative to the

KUON AND ROSE: EXPLORING AREA AND DELAY TRADEOFFS IN FPGAS WITH ARCHITECTURE AND AUTOMATED TRANSISTOR DESIGN

81

TABLE VII SPAN OF DIFFERENT SIZINGS/ARCHITECTURE

Fig. 10. Area delay space with varied cluster sizes.

These results highlight the importance of considering transistor sizing during pure architectural investigations. When optimizing for delay, the best cluster sizes were, in order, 10, 12, 8, 6, 4, and 2 but when optimizing for area a completely different ordering of 4, 6, 8, 2, 10, 12 results. Clearly, a specic design objective must be considered during architectural studies and any architectural conclusions may only be valid for that specic objective.
Fig. 9. Area delay space with varied routing segment lengths.

C. LUT Size Finally, for a cluster size of 8 and routing segments of length 4, a range of LUT sizes from 3 to 7 were examined across a full range of transistor sizings. The results are shown in Fig. 11. In terms of area-delay tradeoffs, we see that LUT size is clearly the most useful architectural parameter to vary of the parameters considered. At different points in the design space, different LUT sizes are clearly best. For minimum area-delay product, a LUT size of 4 is best but, for better performance, larger LUT sizes are advantageous. Similarly, smaller LUT sizes are useful when area is a more important concern. In comparison to the other parameters, LUT size provides the most signicant leverage for trading off area and delay. The exploration of the complete design space had also shown this as the LUT sizes between the fastest, smallest, and area-delay optimal designs in Table VII were all different. D. Multiplexer Buffer Placement The previous sections have explored the tradeoffs possible through transistor sizing and logical architecture changes. It is also possible to vary the structure of the circuits within the FPGA. While there is consensus as to the best implementations for many portions of the design, as discussed in Section III-A, one circuit question that has not been fully resolved is whether to use buffers prior to multiplexers in the routing structures. For example, in Fig. 3, a buffer could be placed at positions a and b to isolate the routing track from the multiplexers. In terms of delay, the potential advantage of the pre-multiplexer buffer is that it reduces the load on the routing tracks because only a single buffer can be used to drive the multiple multiplexers that connect to the track in a given region. (For example, at position a in Fig. 3.) The disadvantage is the addition of another stage of delay. Both logical architecture, which affects the number of multiplexers connecting to a track, and electrical design, which

fastest design. It is clear that when creating new FPGAs there is a great deal of freedom in the area-delay tradeoffs that can be made and, as can be seen in Table VII, both transistor sizing and architecture are key to achieving this full range. To gain a deeper understanding of this space and how to best make these tradeoffs we now explore a number of parameters independently. A. Segment Length Fig. 9 plots the transistor sizing curves for architectures with 4-LUT clusters of size 10 with the routing segment lengths varying from 1 to 8. It is immediately clear that the length-1 and length-2 architectures are not useful in terms of area and delay tradeoffs. Similar conclusions have been made in past investigations [11]. From the tradeoff perspective, the remaining segment lengths are all very similar. Clearly, segment length is not a powerful tool for adjusting area and delay as a single generally offers universally improved segment length performance. B. Cluster Size A range of cluster sizes from 2 to 12 were examined and the results across the full range of transistor sizings are shown in Fig. 10. The routing segment length is 4 and the clusters were composed of 4-LUT BLEs. This is a more promising parameter for making area and delay tradeoffs because at different points in the design space, different cluster sizes are best. For high performance (at the cost of high area), the larger cluster sizes are best but, for smaller area (with worse performance) smaller cluster sizes are best. However, the differences are not extremely large, and we conclude that cluster size is only of limited use when making these design tradeoffs.

82

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 1, JANUARY 2011

design, the delays and areas are provided both in a tabular format and as a le that can be directly used with an academic FPGA placement and routing tool, VPR [31], that is commonly used for architecture exploration. We believe this is the rst time that accurate and complete component areas and delays have been made available publicly. VIII. TRADEOFFS AND THE FPGA:ASIC GAP The previous sections have demonstrated that there are a wide range of interesting area and delay tradeoffs that can be made through varied architecture and transistor sizing. One goal in examining the tradeoffs was to understand how these tradeoffs could be used to selectively narrow the area and delay gaps between FPGAs and ASICs and, in this section, we investigate the impact of the tradeoff ranges observed on the gap measurements from [7]. The preceding work in this section has demonstrated that the design space for FPGAs is large and, by simply varying the transistor sizing of a design, the area and delay of an FPGAs can be altered dramatically. This presents a challenge to exploring the impact of the tradeoff ranges because the area and delay gap measurements were performed for a single commercial FPGA family, the Stratix II, and the tradeoff decisions made by the Stratix IIs designers and architects to conserve area or improve performance are not known. As a result, the specic point occupied by this family within the large design space is unknown. To address this issue, we consider a range of circumstances for the possible tradeoffs that could be applied to the area delay gap. For example, in one case, it is assumed that the Stratix II was designed to be at the performance extreme of the interesting region. Based on that assumption, it could be possible to narrow the area gap by trading off performance for area savings and create a design that is instead positioned at the area extreme of the region. We compute the possible narrowed gap by applying the tradeoff range factors determined previously. This is done as follows

Fig. 11. Area delay space with varied LUT sizing (N = 8).

Fig. 12. Area delay tradeoffs with varied pre-multiplexer inverter usage.

determines the size (and hence load) of the transistors in the multiplexers relative to the size of the transistors in the buffer, may impact the decision to use the pre-multiplexer buffers. We investigated this issue for the architecture described in Table V using 90 nm CMOS. Using the procedure described above, the effective area and delay was determined using the full experimental ow for a range of varied transistor sizings without a buffer, with a single inverter, and with a two-inverter buffer. Fig. 12 plots the area delay curves for each of these cases. It is interesting to consider the full area-delay space because it might be possible that for different transistor sizings the buffers might become useful. However, in Fig. 12, we see that across the range of the design space the fastest delay for any given area is obtained without using the buffers. For this architecture no pre-multiplexer buffering is appropriate. Similar results were obtained for other cluster sizes as well. E. Enabling Architecture Exploration The automated transistor sizing tool underlying this work enabled the exploration of a wide range of different FPGA designs. To assist future FPGA architecture researchers, we have published, at http://www.eecg.utoronto.ca/vpr/architectures, the areas and delays for these varied designs. Hundreds of different combinations of logical architecture, optimization objective and process technology are included and, for each

(4) (5) These tradeoffs clearly narrow the gap in only one dimension and in the other dimension the gap grows larger. If we were to assume that the Stratix II was designed with a greater focus on area, then the tradeoffs would be applied in the opposite manner with the area gap growing and the delay gap narrowing by the area range and delay range factors respectively. The results for a variety of cases are summarized in Table VIII. The row labeled Baseline repeats the area and delay gap measurements for soft-logic only (i.e., not including the effect of the hard logic blocks such as memories and multipliers) as reported in [7]. The subsequent rows list the area and delay gaps when the area and delay tradeoffs are used. The Starting Point column refers to the position within the design space that the FPGA occupies before making any tradeoffs and the Ending Point describes the position in the design space after making the tradeoffs. Three positions within the design space are considered: Area, Delay and Area-Delay. The Area

KUON AND ROSE: EXPLORING AREA AND DELAY TRADEOFFS IN FPGAS WITH ARCHITECTURE AND AUTOMATED TRANSISTOR DESIGN

83

TABLE VIII POTENTIAL IMPACT OF AREA AND DELAY TRADEOFFS ON SOFT LOGIC FPGA TO ASIC GAP

TABLE IX AREA AND DELAY TRADEOFF RANGES COMPARED TO COMMERCIAL DEVICES

IX. CONCLUSION and Delay points refer to the smallest and fastest positions (that still satisfy the interesting tradeoff requirements) in the design space respectively and the Area-Delay point refers to the point within the design space with minimal Area-Delay. For the example described above, the starting point would be the Delay point and the ending point would be the Area point. When making tradeoffs from the Area point to the Delay point or vice versa, the full area and delay range factors would be applied. For tradeoffs involving the Area-Delay point then only the range to or from that point would be considered. For example, if starting at the Delay point and ending at the Area-Delay point the partial ranges would be calculated as (6) (7) and these ranges would be applied as per (4) and (5) to determine the gap after making the tradeoffs. From the data in the table, it is clear that leveraging the tradeoffs can allow the area and delay gap to vary signicantly. In particular, it is most interesting to consider starting from the delay optimized point in the design space because the Stratix II is Alteras higher performance/higher cost FPGA family [45] at the 90 nm technology node. In that case, the area gap can be shrunk to 18 for soft logic and, if such tradeoffs were combined with the appropriate use of heterogeneous blocks, the overall area gap would shrink even lower. The row in the table with an Area starting point and a Delay ending point suggests that the delay gap could be narrowed (at the expense of area); however, this is unlikely to be possible as the Stratix II is sold as a high performance part which suggests its designers were not focused primarily on conserving area. A. Comparison With Commercial Families While the reduction in the area gap is useful, the impact on the delay gap is also signicant. It is useful to compare these tradeoffs to those found in commercial FPGA families. Altera has two 90 nm FPGA families, the high performance/high cost Stratix II [45] and the lower cost/lower performance Cyclone II [46]. For the benchmarks used in [7], we measured the Stratix II to be on average 40% faster than the Cyclone II. This means that the delay range between these families was 1.40. This closely matches the delay range of 1.41 we observed between the largest/fastest design and the minimal area-delay design. This result is summarized in Table IX. Unfortunately, the core area for the Cyclone II is not known and, therefore, a direct area comparison is not possible. This paper has explored the tradeoffs between area and delay that are possible in the design of FPGAs when both architecture and transistor sizing are varied. An automated transistor design tool was used to create a range of different circuit implementations for every architecture investigated. Compared to past pure architecture studies, we nd that varying the transistor sizing of a single architecture offers a greater range of possible tradeoffs between area and delay than was possible by only varying the architecture. By varying the architecture along with the transistor sizings, we see that performance could be usefully varied by a factor of 2.1 and area by a factor of 2.0. We observe that LUT size is the most useful architectural parameter for making tradeoffs between area and delay. As well, we see that such architectural conclusions could not be properly made if transistor sizing were not explicitly considered. Finally, we speculated on how these tradeoffs could be used to adjust the performance or area of a commercial device family. REFERENCES
[1] Stratix IV Device Handbook Volumes 14 SIV5V11.0, Altera Corporation, May 2008 [Online]. Available: http://www.altera.com/literature/hb/stratix-iv/stratix4_handbook.pdf [2] Lattice SC/M Family Handbook, Version 02.1, DS 1004, Lattice Semiconductor Corporation, Jun. 2008 [Online]. Available: http://www.latticesemi.com/dynamic/viewdocument.cfm?document_id=19028 [3] Virtex-5 User Guide uG190 (v4.0), Xilinx, Mar. 2008 [Online]. Available: http://www.xilinx.com/support/documentation/user_guides/ug190.pdf [4] Cyclone III Device Handbook CIII5V11.2, Altera Corporation, Sep. 2007 [Online]. Available: http://www.altera.com/literature/hb/cyc3/cyclone3_handbook.pdf [5] Lattice ECP2/M Family Handbook, Version 02.9, HB 1003, Lattice Semiconductor Corporation, Jul. 2007 [Online]. Available: http://www.latticesemi.com/dynamic/viewdocument.cfm?document_id=21733 [6] Spartan-3E, ver. 3.4, Xilinx, Nov. 2006 [Online]. Available: http://direct.xilinx.com/bvdocs/publications/ds312.pdf [7] I. Kuon and J. Rose, Measuring the gap between FPGAs and ASICs, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 26, no. 2, pp. 203215, Feb. 2007. [8] G. Lemieux, E. Lee, M. Tom, and A. Yu, Directional and singledriver wires in FPGA interconnect, in Proc. IEEE Int. Conf. Field-Programmable Technol., Dec. 2004, pp. 4148. [9] E. Ahmed and J. Rose, The effect of LUT and cluster size on deepsubmicron FPGA performance and density, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 3, pp. 288298, Mar. 2004. [10] F. Li, Y. Lin, L. He, D. Chen, and J. Cong, Power modeling and characteristics of eld programmable gate arrays, IEEE Trans. Comput.Aided Des. Integr. Circuits Syst., vol. 24, no. 11, pp. 17121724, Nov. 2005. [11] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for DeepSubmicron FPGAs. Norwell, MA: Kluwer, 1999. [12] L. Cheng et al., Device and architecture cooptimization for FPGA power reduction, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 26, no. 7, pp. 12111221, Jul. 2007. [13] J. M. Rabaey, Digital Integrated Circuits a Design Perspective. Englewood Cliffs, NJ: Prentice Hall, 1996.

84

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 1, JANUARY 2011

[14] A. R. Conn, I. M. Elfadel, J. W. W. Molzen, P. R. OBrien, P. N. Strenski, C. Visweswariah, and C. B. Whan, Gradient-based optimization of custom circuits using a static-timing formulation, in Proc. DAC, New York, 1999, pp. 452459. [15] C.-P. Chen et al., Fast and exact simultaneous gate and wire sizing by langrangian relaxation, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 18, no. 7, pp. 10141025, Jul. 1999. [16] J. P. Fishburn and A. Dunlop, TILOS: A posynomial programming approach to transistor sizing, in Proc. Int. Conf. Comput.-Aided Des., Nov. 1985, pp. 326328. [17] S. S. Sapatnekar et al., An exact solution to the transistor sizing problem for CMOS circuits using convex optimization, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 12, no. 11, pp. 16211634, Nov. 1993. [18] K. Compton et al., Flexible routing architecture generation for domain-specic recongurable subsystems, in Proc. Int. Conf. Field Programmable Logic Appl., 2002, pp. 5968. [19] I. Kuon and J. Rose, Area and delay tradeoffs in the circuit and architecture design of FPGAs, in Proc. ISFPGA, New York, 2008, pp. 149158. [20] I. Kuon and J. Rose, Automated transistor sizing for FPGA architecture exploration, in Proc. DAC, New York, 2008, pp. 792795. [21] S. D. Brown, R. Francis, J. Rose, and Z. Vranesic, Field-Programmable Gate Arrays. Norwell, MA: Kluwer, 1992. [22] S. P. Young, T. J. Bauer, K. Chaudhary, and S. Krishnamurthy, FPGA Repeatable Interconnect Structure With Bidirectional and Unidirectional Interconnect Lines, U.S. 5 942 913, Aug. 24, 1999. [23] D. Lewis et al., The Stratix routing and logic architecture, Proc. ISFPGA, pp. 1220, 2003. [24] Stratix III Device Handbook, sIII5V11.4 Altera Corporation, Nov. 2007 [Online]. Available: http://www.altera.com/literature/hb/stx3/stratix3_handbook.pdf [25] D. Tavana et al., FPGA Architecture With Repeatable Tiles Including Routing Matrices and Logic Matrices, U.S. 5,682,107, Oct. 28, 1997. [26] P. Chow et al., The design of a SRAM-based eld programmable gate array-part II: Circuit design and layout, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 3, pp. 321330, Sep. 1999. [27] D. Lewis et al., The Stratix II logic and routing architecture, in Proc. ISFPGA., New York, 2005, pp. 1420. [28] E. Lee et al., Interconnect driver design for long wires in eld-programmable gate arrays, Proc. FPT, pp. 8996, Dec. 2006. [29] I. Kuon, Measuring and navigating the gap between FPGAs and ASICs, Ph.D. dissertation, Univ. Toronto, ON, Canada, 2008. [30] STMicroelectronics, 90 nm CMOS090 Design Platform. 2005 [Online]. Available: http://www.st.com/stonline/products/technologies/soc/90plat.htm [31] J. Luu et al., VPR 5.0: FPGA CAD and architecture exploration tools with single-driver routing, heterogeneity and process scaling, in Proc. ISFPGA, New York, 2009, pp. 133142. [32] V. Betz and J. Rose, Circuit design, transistor sizing and wire layout of FPGA interconnect, Proc. CICC, pp. 171174, 1999. [33] M. Hutton et al., Interconnect enhancements for a high-speed PLD architecture, in Proc. ISFPG, New York, 2002, pp. 310. [34] J. H. Anderson and F. N. Najm, Low-power programmable routing circuitry for FPGAs, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., Washington, DC, 2004, pp. 602609. [35] S. P. Young, Six-Input Multiplexer With Two Gate Levels and Three Memory Cells, U.S. 5 744 995, Apr. 17, 1998. [36] J. K. Ousterhout, Switch-level delay models for digital MOS VLSI, in Proc. DAC, 1984, pp. 542548. [37] J. Rubinstein, P. Peneld, and M. A. Horowitz, Signal delay in RC tree networks, IEEE Trans. Comput.-Aided Des. of Integr. Circuits Syst., vol. 2, no. 3, pp. 202211, Jul. 1983. [38] I. Sutherland, R. Sproule, and D. Harris, Logical Effort : Designing Fast CMOS circuits. San Diego, CA: Morgan Kaufmann, 1999. [39] S. Yang, Logic Synthesis and Optimization Benchmarks User Guide Version 3.0, Microelectronics Center of North Carolina, 1991.

[40] J. Cong, J. Peck, and Y. Ding, RASP: A general logic synthesis system for SRAM-based FPGAs, in Proc. ISFPGA, New York, 1996, pp. 137143. [41] A. Marquardt, V. Betz, and J. Rose, Using cluster-based logic blocks and timing-driven packing to improve FPGA speed and density, Proc. FPGA, pp. 3746, 1999. [42] V. Betz and J. Rose, VPR: A new packing, placement and routing tool for FPGA research, in Proc. 7th Int. Workshop Field-Programmable Logic, 1997, pp. 213222. [43] J. E. Weber, Mathematical Analysis: Business and Economic Applications, 3rd ed. New York: Harper & Row, 1976. [44] E. Ahmed, The effect of logic block granularity on deep-submicron FPGA performance and density, Masters thesis, Univ. Toronto, ON, Canada, 2001. [45] Stratix II Device Handbook SII5V14.3, Altera Corporation, May 2007 [Online]. Available: http://www.altera.com/literature/hb/stx2/stratix2_handbook.pdf [46] Cyclone II Device Handbook CII5V13.3, Altera Corporation, Feb. 2008 [Online]. Available: http://www.altera.com/literature/hb/cyc2/cyc2_cii5v1.pdf Ian Kuon (M02) received the B.Sc. degree in electrical engineering from the University of Alberta, AB, Canada, in 2002 and the M.A.Sc. and Ph.D. degrees in electrical and computer engineering from the University of Toronto, Toronto, ON, Canada in 2004 and 2008, respectively. He is currently with the Altera Toronto Technology Center, Altera Corporation, Toronto, ON, Canada, working in the area of timing modeling. He previously held a number of co-op and intern positions at PMC Sierra in product engineering, Research in Motion in their digital ASIC design group, Altera in the area of power modeling and Altera in their IC Design department. Dr. Kuon was the recipient of the Natural Sciences and Engineering Research Council of Canada (NSERC) PGS A and CGS D post-graduate scholarships.

Jonathan Rose (F03) received the Ph.D. degree in electrical engineering in 1986 from the University of Toronto, Toronto, ON, Canada. He is a Professor in the Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto. From 1986 to 1989, he was a Post-Doctoral Scholar and then Research Associate in the Computer Systems Laboratory at Stanford University. He spent the 1995-1996 year as a Senior Research Scientist at Xilinx, San Jose, CA, working on the Virtex FPGA. In October 1998, he co-founded Right Track CAD Corporation, which delivered architecture for FPGAs and packing, placement and routing software for FPGAs to FPGA device vendors. He was President and CEO of Right Track until May 1, 2000. Right Track was purchased by Altera, and became part of the Altera Toronto Technology Centre, where Rose was Senior Director until April 30, 2003. His group at Altera Toronto shared responsibility for the development of the architecture for for the Altera Stratix, Stratix II, Stratix GX and Cyclone FPGAs and associated software. His research covers all aspects of FPGAs including their architecture, computer-aided design (cad), eld-programmable systems, soft processors, and graphics, vision and bio-informatic applications of programmable hardware. Dr. Rose is the co-founder of the ACM FPGA Symposium, and remains part of that Symposium on its steering committee. He served as Chair of the Edward S. Rogers Sr. Department of Electrical and Computer Engineering from January 2004 through June 2009. He is a Senior Fellow of Massey College in the University of Toronto, a Fellow of the ACM, and a Fellow of the Canadian Academy of Engineering.

You might also like