Yield-Oriented Evaluation Methodology of Network-on-Chip Routing Implementations
Yield-Oriented Evaluation Methodology of Network-on-Chip Routing Implementations
Yield-Oriented Evaluation Methodology of Network-on-Chip Routing Implementations
S. Medardoni, D. Bertozzi
ENDIF University of Ferrara 44100 Ferrara, Italy Email: [email protected]
A. Meja, D. Dai
Intel Corporation Email: [email protected]
AbstractNetwork-on-Chip technology is gaining wide popularity for the interconnection of an increasing number of processor cores on the same silicon die. However, growing process variations cause interconnect malfunction or prevent the network from working at the intended frequency, directly impacting yield and manufacturing cost. Topology agnostic routing algorithms have the potential to tolerate process variations without degrading performance. We propose a three step methodology for evaluating routing algorithms in their ability to deal with variability. Using yield enhancement and operation speed preservation as the criteria, we demonstrate how this methodology can be used to select the best design choice among several plausible combinations of routing algorithms and implementations. Also, we show how an efcient table-less routing implementation can be used to minimise the impact of variability on manufacturing and operating frequency.
I. I NTRODUCTION As technology scaling brings the intricacies of nanoscale designs to the forefront, reliability is rapidly becoming a major design concern. Manufacturing faults may appear in the form of defective cores, links, or switches. For example, due to the small feature size, imprecise impurity deposition or the non-uniformity in the lithographic exposure eld can cause transistors to malfunction completely or deviate from nominal performance/power gures. As technology scales down further, variability is expected to increase. A report by leading semiconductor manufacturers [5] predicts that variability will rise to 18% for 22nm technology from 5% of current 4565nm technologies. At the same time, multi-core designs have become the dominant organization for high-performance microprocessors and even in the embedded computing domain. Networks-on-chip are today advocated as a scalable interconnect fabric for multi-core processors, overcoming the limitations of shared bus structures and even of multi-layer interconnects. Therefore, not only fault-tolerance is becoming a critical requirement in designing modern chips, but the on-chip networking scenario is raising new challenges for fault-tolerance. Because of variability-induced performance asymmetry, the post-silicon NoC topology at the target nominal frequency may differ from the projected one at design time, and in particular it might turn out to be highly irregular in spite of the inherent
regularity of the original design. All modules and links could be then slowed down to the frequency of the slowest element, but this is not acceptable for all application scenarios. This example points out the ultimate challenge: the support for unexpected and unpredictable topology irregularity should be there in the architecture for NoC building blocks. The routing framework is particularly affected by this new requirement, which poses a new task on burden of the NoC designer. On one hand, among all fault-tolerant routing algorithms that can still provide complete post-silicon connectivity (a vast literature covers this topic in different environments [12], [11], [8]), he has to select the most performance-efcient one with respect to application trafc patterns. On the other hand, he is faced by the choice of the routing implementation, which has deep implications on the nal architecture and on overall performance and complexity gures. In fact, many fault-tolerant routing algorithms lend themselves to a table-based implementation. Unfortunately, routing tables are expensive in terms of access time and resources, and feature poor scalability properties. This paper advocates the use of logic based distributed routing as an efcient routing implementation mechanism for NoCs. Unfortunately, different kinds of logic routing implementations feature a different coverage of topology irregularities. As an example, the LBDR mechanism illustrated in [9] can implement many distributed routing algorithms even on irregular topologies, provided the communication between two end nodes in the post-silicon topology can still go through a minimal path of the nominal regular topology. When the irregularity pattern is such to violate this constraint, more complex logic and more conguration bits are required in the routing mechanism to be able to still route trafc successfully. In this case, the designer needs to know whether the increase in complexity of the routing implementation is adequate or disproportionate with respect to the achieved coverage of topology irregularities, similar to the KILL rule for multi-core design [6]. To the best of our knowledge, there is no generally accepted methodology for quantitatively evaluating the tradeoff between complexity (area, power, impact on critical path) of the routing implementation and coverage of irregularity patterns. This paper moves a rst step in this direction and
100
(a)
(b)
(c)
Chip
a topology instance
Fig. 1.
proposes a yield-oriented methodology for the evaluation of NoC routing implementations other than look-up tables. The proposed framework also considers the common practice of decreasing the post-silicon operating frequency of the network to make variability-affected switches and links again operational. Although any combination of variability model, NoC architecture, routing algorithm and routing implementation can be evaluated with respect to yield enhancement and fabrication cost, this paper proposes a case-study which proves the practical viability of the proposed methodology. II. E VALUATION M ETHODOLOGY The methodology allows the joint evaluation of a routing algorithm together with its implementation mechanism (hereafter denoted as the routing framework) assuming the existence of a variability pattern in the chip manufacturing process. The methodology helps in assessing the effectiveness of the routing framework in sustaining yield when facing withindie process variations, while relating its benets to its cost. In this direction, the methodology estimates the percentage of cases alternative routing frameworks succeed in tolerating variability patterns and makes a relative comparison between such percentages and the resources used by each solution. Ultimately, the designer has a clear view of the complexitycoverage trade-off that the routing frameworks under test span. Figure 1 shows the three steps of the methodology. Each step performs a different task independently from the others. This property makes the methodology easily extensible to several types of variability models, NoC architectures and routing frameworks. The design entry is the top-level layout of the system, comprising networked IP cores (functional units, memories,..) and the interconnect topology. The rst step of the methodology is to build a statistically signicant database of chip instances resulting from variability patterns, in a similar way to what is done in a Montecarlo analysis. To do this, a process variation model needs to be fed to the methodology, which is then used to statistically inject deviations from nominal delay values of links and switches. The methodology can support rstorder models as well as more accurate ones. For instance, abstract models projecting switch delay variations based on the number of critical paths and on the logic depth can be used [4]. Alternatively, the true switch gate-level netlist can be used for the injection of Gaussian noise to the gate delays,
and static timing analysis tools can then be used to measure the critical path variation (similar to [2]). As regards links, variability models can capture interconnect-related effects such as dishing, combined with the delay variability of drivers and repeaters (see [3] for an example). Our methodology does not pose any limitation to the kind of variability model used, as long as it can result in delay variations of NoC switches and switch-to-switch links with respect to nominal values. Each chip instance features a topology with a given irregularity pattern. However, such pattern changes as a function of the target operating speed. As this latter is slowed down, progressively more switches and links become operational again and the critical path may move from links to switches or vice-versa. The lowest frequency for the analysis is the one that makes all switches and links fully operational. As the target frequency is raised, the network becomes increasingly irregular until some designer-dened boundary is achieved (e.g., a maximum number of disconnected nodes, full connectivity retained between predened nodes). Each topology with its associated speed is denoted as a topology instance. The second step deals with the denition of a suitable routing algorithm for every chip/topology instance. In this step care must be taken to guarantee the necessary conditions of a deadlock-free network and of a connected chip. Also, the resources devoted (at design time) to routing purposes (e.g. virtual channels) must be considered at this step. The methodology has been designed to accept any routing algorithm at this step, so the designer is free to test any routing algorithm from the wide myriad of options available. However, the designer must be aware of the irregularities the chip is facing (due to the variability). So, at this step, it seems more interesting to test either topology agnostic routing algorithms (able to work on any topology) or fault-tolerant routing algorithms. In both cases the implementation of the routing engine will need to be assessed, and this will be performed at the next step in the methodology. In the third step, a range of alternative routing implementations is tested for every successful routing algorithm achieved in the previous step. Notice that fault-tolerant routing algorithms must be evaluated together with their particular implementations. Also, topology-agnostic routing algorithms can be evaluated with a wider set of implementations ranging from look-up tables to minimum specic logic-based implementations. In other words, not all routing algorithms can be
101
implemented by a given mechanism, and therefore, not all possible combinations are feasible. There are several ways of dening the test success at this stage. For instance, communication ows as extrapolated from an annotated task graph can be applied to the topology instance and the feasibility of those ows can be assessed. A large number of evaluation tools can be used at this stage, including custom home-made connectivity verication tools all the way to functional simulation tools. The methodology will list the set of topology instances that succeeded with a particular routing algorithm and routing implementation. Therefore, percentages of successful chips will be exposed to the designer in order to assess the tolerance of a particular routing framework to the effects of variability. We dene a coverage metric as the percentage of chips that are usable for a given frequency and routing framework. However, the implementation cost of the routing mechanism that achieves a good coverage result might be disproportionate with respect to the coverage itself, or in contrast might be well justied. With the coverage/routing area metric we intend to capture whether the benets are worth the increase in complexity, which can be understood by comparing this metric for competing solutions. III. C ASE S TUDY Many routing algorithms and routing implementations can be assessed with the yield-oriented evaluation methodology. In order to probe the usability of the method with a real case study, we selected a variability model, a set of routing algorithms (derived from a routing methodology) and three different NoC routing implementations. Our aim is to compare the routing implementations under their achieved coverage (percentage of supported topologies/chips) once the variability model is applied. At each stage of the methodology we will customize the denition of successful tests for our needs. Without lack of generality, we selected the variability model presented in [7] as the starting point for generating the pool of chips and topologies. In particular, this model takes into consideration the effect of variability on switch-to-switch link delay. Although we are assuming here that link delays determine network operating speed (and therefore that switch critical path can be tuned with compensation techniques or is simply shorter than link delay), this is an interesting case study since recent implementation works on NoCs have proved that the critical path of the network is rapidly moving from logic to inter-switch links as technology scales below 65nm [1]. Table I shows the (link-delay variability) and (variability spatial correlation) parameters used by the variability model (as predicted by the ITRS roadmap). After applying the variability model with the different parameters we obtain 50 different chip instances per conguration. In each case all the links are labeled with their post-silicon delay. The methodology obtains different topology instances from each chip (each instance corresponding to a different network speed) by setting the link delay threshold of the chip and enabling only those links below the threshold. The rst topology for a chip is
Routing instances 16 16 8 8
TABLE I PARAMETERS
USED IN THE EVALUATION .
set by the maximum frequency the network can tolerate (the connectivity among all nodes is ensured but the topology is probably highly irregular) and the last topology is set by the minimum frequency the network can tolerate (maximum connectivity is achieved). In each case all the links will work at the same frequency and a varying number of topologies will be achieved from each chip. We feed the second step with a topology-agnostic routing algorithm that can be used for irregular topologies, the Segment-based Routing (SR) [8]. SR allows multiple instances of the routing algorithm for the same topology. Thus, for each topology obtained in the previous step, we compute a set of 16 SR instances for the 4 4 layouts and a set of 8 SR instances for the 8 8 layouts (note that 8x8 topologies allow for 64 instances by starting the computation of the routing algorithm from a different node, but for the sake of reducing analyzed data, we only compute 8 of them by starting in the diagonal nodes). The set of routing instances will be used in the next step. Finally for the third step we selected four different routing implementations: Logic-based Distributed Routing (LBDR) [9], LBDR with de-routes (LBDRdr ; LBDR with a deroute output port set on every switch where LBDR fails) and Region-based Routing (RbR) [10] (two different versions of RbR will be used, RbR8r with eight regions per input port and RbR12r with 12 regions per input port). These mechanisms have been selected since they represent efcient implementations of routing mechanisms for on-chip networks without the need for routing tables and virtual channels. LBDR implementation [9] is based on a small logic and uses two sets of bits on every switch: the Rxy routing bits and the Cx connectivity bits. All bits are set before normal operation and are computed based on the routing algorithm used (totally independent from this implementation and has to be deadlock-free and assure full connectivity) and the current topology. A routing bit, Rxy , at a given switch indicates if a packet is allowed (by the applied routing algorithm) to leave the switch through output port x and at the next switch to turn to direction y. Connectivity bits, Cx , are used to dene the current topology being used. Each output port has a connectivity bit that indicates if a neighbour switch is attached through the output port. With all these bits a small logic block is set to effectively route packets in a deadlock-free and connected manner (Figure 2). As commented by the authors, LBDR does not allow the use of non-minimal paths. And this fact limits its applicability to highly irregular topologies, thus potentially affecting yield. As
102
..
..
..
..
..
..
Routing Region 1
RowDst Row1 RowDst Row2 ColDst Col1 ColDst Col2
+ + + +
. . . . . .
Output Port Selector
N E W S
. . .
OP register
Col2 >= ColDst
. . .
Fig. 4.
Fig. 2.
an effort to reduce this shortcoming, LBDR was extended with an additional feature. The addition is referred to as LBDRdr and will provide non-minimal paths at strategic switches. To do so, we add a small logicand two bits per switch to LBDR. The addition is shown in Figure 2 The logic (NOR gate) is enabled when LBDR is not able to select a proper output port for a packet. Then, a multiplexer is enabled to provide a deroute output port for the packet. Two bits are used per switch to congure the multiplexer accordingly to the routing algorithm being used. Notice that when computing the routing bits and the deroute bits we need to guarantee that no routing restrictions are crossed by packets.
switch X. At that switch the packet is evaluated. In particular, the direction is computed. In this case direction is southeast. With LBDR the packet would select either east or south. Notice that in this case the packet should not take south since it was coming through the south port. To prevent this we lter U turns with a small logic. From that point the packet will be routed with the basic LBDR mechanism. There is no need for new deroute ports. Region-based Routing (RbR) framework allows to route messages by using simple and efcient blocks of logic referred to as regions. A different set of regions is implemented in logic at every input port of a given switch. Figure 4 shows hardware required per input port. Each region is a square box of destinations that is identied by the possible output ports (OP register), the top left most switch (Row1 and Col1 registers) and the bottom right most switch (Row2 and Col2 registers) of the region. When a message header arrives to the input port, it inspects all regions in parallel in order to nd out which are the possible output ports it can take. Regions are computed and identied by the RbR mechanism based on the routing restrictions and the network topology. They are used to setup hardware registers located at every input port of every switch. These registers must be programmed before routing any packet at network boot time. A nice feature of RbR is that it allows the implementation of topology-agnostic routing algorithms providing appropriate support for routing under the presence of link and node failures. As pointed out later, the LBDR and RbR mechanisms feature different levels of complexity. In addition, we use routing tables for comparison purposes. In the gures, the MAX label will represent the maximum coverage that can be reached with routing tables. For every computed routing algorithm, we will test if every routing implementation works or not. It works if it is able to route packets from every source to every destination. If a routing implementation works for at least one of the routing algorithms for a topology with a given operating speed, then, it is said that the routing implementation covers the current chip at that frequency. We will rank all the chips with the maximum supported link frequency for every routing implementation.
S0
S1
S2
S5
S4
S3
X
S6 S7
Fig. 3.
Figure 3 shows an example where the deroute is used. The gure shows a case not supported by LBDR (the non-minimal path shown). At switch A the packet should be sent through the east output port, however the link is not operational. With LBDRdr , now the provided example is supported. At switch A, the packet needs to be derouted. In the example we select the north port as the deroute port for the packet. The packet will leave the switch through the north port and will reach
103
IV. E VALUATION A. Coverage Results In this rst evaluation we analyze the coverage that can be achieved for each routing implementation. To do so, we classify the chips according to the frequency they can work at. Figure 5 shows the distribution along the link frequency of all the 4 4 chips ( = 0.05) analyzed for two values of spatial correlation, = 0.4 and = 1.2, respectively. The gure includes results for LBDR, LBDRdr , RbR8r , RbR12r , and M AX (routing tables). The M AX curve represents the percentage of chip instances that have a connected network (with respect to the total amount of instances). The same baseline frequency f0 is assumed for all experiments, selected in the high-performance range of the network (e.g., 1GHz) where the reliability-performance trade-off appears more clearly. As a rst observation (see M AX curve) we can see how the spatial correlation of the variability affects the connectivity of the network. With a low spatial correlation ( = 1.2) more chips can be operated at a higher frequency (the topology gets connected at higher frequencies). However, the most interesting results are related to LBDR and RbR coverage. We can see that the best option for coverage purposes is RbR with 8 regions. It achieves 100% coverage for all connected chips (RbR8r equals M AX) in all the frequency ranges. Also, LBDR achieves a decent although minor result for the coverage. It achieves, in most cases, 50% of coverage for the connected chips for = 0.4, however it achieves lower coverage for = 1.2. Differences are larger for 8 8 chips (not shown). It can be seen that LBDRdr , however, increases LBDR coverage by a larger extend in the = 1.2 case, and to fully cover all the connected chips in the = 0.4 case. These results are achieved also for the 8 8 chip case (not shown). B. Performance and Cost Results In the previous section we have analyzed the coverage each routing implementation provides for different values of and . The analysis has been performed for the same design frequency f0 . In this section we analyze the maximum operating frequency reachable by each switch with the different routing implementations and the area overhead for their logic and registers. The intention behind this is to characterize the reliability-performance/area trade-off, and to
Fig. 6. Maximum speed achievable by the switch with different routing mechanisms inside. Post-layout results.
ultimately assess whether the implementation complexity of a routing mechanism largely exceeds its coverage improvements or whether these latter are worth the cost. We synthesized the NoC system using the xpipesLite design platform [14] to obtain the switch layout under investigation. In order to make the study more complete, we used two different chip layouts (4x4 and 8x8 Mesh). The xpipesLite design platform is used to rene RTL descriptions into actual switch layouts. An STMicroelectronics 65nm low-power technology library is used with commercial backend synthesis tools. Figure 6 shows the maximum achievable switch frequency when using each possible routing mechanism. Results are normalized to the fastest solution (LBDR). Logic depth for each scheme is important, since it resides on the switch critical path. Clearly, LBDRdr incurs only a minor degradation of the maximum speed (which is slightly more than 1 GHz for baseline LBDR). In contrast, RbR suffers from about 30% frequency reduction, with a further degradation when moving from RbR8 to RbR12 . The main cause for this large critical path lies in the larger number of cascaded logic stages (local-port matching, region matching, output port selection) and in the use of high-fanout nets at switch inputs. The low performance of RbR can be hidden by pipelining switch operation. To reach 1GHz switch frequency the single cycle switch architecture can be retained only for LBDR solutions. RbR requires 2 cycle switches to keep up with that frequency. As regards area, we can see in Figure 7 the results. The left-hand plot shows post-layout area at maximum performance, which is the one just found and on the right-hand side of Figure 7 an area estimation is reported for the case where all switches have to be operated at 1 GHz. Therefore, we end up having two mechanisms with opposite benets,
104
1,8 1,6
Normalized Switch Area
and implementations, pointing out the cost-performance tradeoff for each of them. Also, by means of a case study we showed how an efcient table-less routing implementation can be used to minimise the impact of variability on manufacturing cost and operating frequency. As a future research, we will extend the methodology to GALS systems where each link can run at its own independent frequency. ACKNOWLEDGMENT This work was supported by the Spanish MEC and European Comission FEDER funds under grants CONSOLIDERINGENIO 2010 CSD2006-00046 and TIN 2006-15516-C0401, and by Junta de Comunidades de Castilla-La Mancha under Grants PCC08-0078. R EFERENCES
[1] D.Ludovici et al., Assessing Fat-Tree Topologies for Regular Networkon-Chip Design under Nanoscale Technology Constraints, DATE 2009. [2] D.Bertozzi et al., Process Variation Tolerant Pipeline Design Through a Placement-Aware Multiple Voltage Island Design Style, DATE 2008. [3] M. Mondal et al., Provisioning On-Chip Networks under Buffered RC Interconnect Delay Variations, ISQED07 [4] K.A. Bowman et al., Impact of die-to-die and within-die parameter uctuations on the maximum clock frequency distribution for gigascale integration, IEEE Journal of Solid-State Circuits, 2002 [5] International Technology Roadmap for Semiconductors, 2007 Edition, available online at http://www.itrs.net/Links/2007ITRS/Home2007.htm [6] A. Agarwal et at., The Kill Rule for Multicore, DAC 2007 [7] C. Hernandez et al. A Model for With-in Die Variability in NoCs, Technical University of Valencia, 2009. Technical report available online at http://www.disca.upv.es/articulos/docs/informes/I001 09.PDF [8] A. Meja et al., Segment-Based Routing: An Efcient Fault-Tolerant Routing Algorithm for Meshes and Tori, in IPDPS 2006 [9] J. Flich et al., An Efcient Implementation of Distributed Routing Algorithms for NoCs, in NOCS, 2008 [10] J. Flich et al. Region-Based Routing: An Efcient Routing Mechanism to Tackle Unreliable Hardware in Network on Chip, NOCS 2007. [11] D. Braginsky and D. Estrin, Rumor Routing Algorithm For Sensor Networks WSNA 2002. [12] B. Karp and H. T. Kung, Greedy Perimeter Stateless Routing for Wireless Networks, MobiCom 2000. [13] G. Paci et al., Effectiveness of Adaptive Supply Voltage and Body Bias as Post-Silicon Variability Compensation Techniques for Full-Swing and Low-Swing On-Chip Communication Channels, DATE 2009. [14] S. Stergiou et al.,XPipes Lite: a Synthesis Oriented Design Library for Networks on Chips, DATE 2005.
Fig. 7. Switch area. Area at 1 GHz is estimated accounting for the switch pipelining overhead (best and typical case area ratios).
LBDR having low coverage but small area requirements and RbR having excellent coverage but large area requirements. In order to quantify this trade-off, we dene the coverage/routing area metric, which expresses how effectively network area is used with respect to the reliability objective. Coverage/routing area is measured by dividing the coverage by the chip area devoted to the routing mechanism (area of the mechanism multiplied by the number of routers). Results shown in Figure 8. When considering real implementation costs we get opposite coverage results. LBDR achieves a high coverage/routing area value since its implementation costs are very modest. Values higher than 300 are obtained (300 times more coverage than area required). However, RbR achieves a low coverage/routing area value because although it achieved an overall high coverage value, its area requirements are too high. Even more interesting, we can see that the addition of a de-route port to LBDR is coverage-efcient (previous section) and area-efcient (Figure 8). So, the designer is exposed to the real benets of an incremental addition (LBDRdr ) to a previous routing implementation (LBDR).
Fig. 8.
V. C ONCLUSIONS Within-die variability is becoming a rst-order concern for proper chip function and manufacturing cost. Yield oriented design methods and techniques are critical to minimize the negative impact of variability. In this paper, we propose a methodology for quantitatively evaluating the effectiveness of different NOC routing frameworks in the presence of process variations. Using yield enhancement and operation speed preservation as the criteria, we have demonstrated how this methodology can be used to select the best design choice among several plausible combinations of routing algorithms
105