Seminar 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

A SEMINAR REPORT

ON

“ASYNCHRONOUS TECHNOLOGIES FOR


SYSTEM ON CHIP DESIGN”

B.TECH- IV (ELECTRONICS & COMMUNICATION)


SUBMITTED BY:

AKASHDEEP

(Roll No.: U12EC003)

GUIDED BY:

PROF. ZUBER M. PATEL

ECED, SVNIT

DEPARTMENT OF ELECTRONICS ENGINEERING


Year: 2015-16

SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY


(SVNIT)
SURAT-395007

U12EC003, Odd Semester 2015-16 (i)


Sardar Vallabhbhai National Institute of Technology, Surat-07

Electronics Engineering Department

CERTIFICATE

This is to certify that candidate Mr. AKASHDEEP bearing Roll No: U12EC003 of B.TECH
IV, 7TH Semester has successfully and satisfactorily presented UG Seminar & submitted the
Report on the topic entitled “ASYNCHRONOUS TECHNOLOGIES FOR SYSTEM ON
CHIP DESIGN” for the partial fulfillment of the degree of Bachelor of Technology (B.Tech) in
Dec. 2015.

Guide: PROF. ZUBER M. PATEL

Examiner 1 Sign: ______________ Name: ______________

Examiner 2 Sign: ______________ Name: ______________

Examiner 3 Sign: ______________ Name: ______________

Head,
ECED, SVNIT.

(Seal of the Department)

U12EC003, Odd Semester 2015-16 (ii)


Acknowledgements
I would like to express my profound gratitude and deep regards to my guide Prof. Z. M. Patel for
their valuable guidance. I am heartily thankful for suggestion and the clarity of the concepts of
the topic that helped me a lot for this work. I would also like to thank Dr. (Mrs.) U. D. Dalal,
Head of the Electronics and communication Engineering Department, SVNIT and all the
faculties of ECED for their co-operation and suggestions. I am very much grateful to all my
classmates for their support.

U12EC003, Odd Semester 2015-16 (iii)


Abstract
As it is now generally agreed that the sizable very large scale integration (VLSI) systems of the
nanoscale era will not operate under the control of a single clock, so it requires asynchronous
techniques. As the large parameter variations across the chip will make it impossible to control
delays in clock networks and other global signals efficiently.

This will introduce the main design principal methods and building blocks for asynchronous very
large scale integration system with an emphasis on communication and synchronization. At first,
System on chips (SoCs) will be globally asynchronous and locally synchronous (GALS). But the
complexity of the various asynchronous/synchronous interfaces required in a GALS will
eventually lead to totally asynchronous solution.

In this we will discuss four main areas: first, an overview of system on chip including
descriptions of major approaches and what are the basic challenges for system on chip design
those arises commonly and their solutions to overcome those challenges. We will also discuss
the motivating factors behind this development. Next, we will briefly summarize key design
methodologies, processes and flows. We will briefly also describe some of the next generation
advanced concepts that are emerging for SoCs.

Asynchronous circuits with the only delay assumption of isochronic forks are called quasi delay
insensitive (QDI).QDI is used as the basis for asynchronous logic. We will discuss asynchronous
handshake protocols for communication and the notion of validity/neutrality tests. We will also
discuss basic building blocks for sequencing, storage, function evaluation and buses.

U12EC003, Odd Semester 2015-16 (iv)


Table of Contents
Acknowledgements ...................................................................................................................... iii

Abstract ......................................................................................................................................... iv

Table of Contents .......................................................................................................................... v

List of Figures ............................................................................................................................. viii

CHAPTER 1 INTRODUCTION ........................................................................................... 1

1.1 Overview of system on chip design:............................................................................. 1

1.2 System on chip block diagram: .................................................................................... 1

1.3 System on chip design flow: ......................................................................................... 2

1.4 System on chip evolution: ............................................................................................ 3

1.5 System on chip architecture:......................................................................................... 3

1.6 Reason for adoption of asynchronous techniques: ....................................................... 4

1.7 Some important definitions: ......................................................................................... 5

1.7.1 Asynchronous circuit: ........................................................................................ 5

1.7.2 Quasi delay insensitive (QDI): .......................................................................... 5

1.7.3 Speed independent circuits: ............................................................................... 6

1.7.4 Self timed circuits: ............................................................................................. 6

1.7.5 Delay insensitive (DI):....................................................................................... 6

CHAPTER 2 SYSTEM ON CHIP DESIGN CHALLENGES ............................................ 7

2.1 IP quality, complexities of IP integration: .................................................................... 8

2.2 SoC testability: ............................................................................................................. 8

2.3 RC delay management and optimization: ..................................................................... 8

2.4 Power optimization:...................................................................................................... 9

2.5 Hierarchical data management, constraint verification: ............................................... 9

U12EC003, Odd Semester 2015-16 (v)


2.6 Functional verification:............................................................................................... 10

2.7 Variation-aware analysis: ........................................................................................... 10

2.8 EM .............................................................................................................................. 10

2.9 Lithography dependencies: ......................................................................................... 10

CHAPTER 3 SoCs AS DISTRIBUTED SYSTEMS .......................................................... 12

3.1 Modeling Systems: Communicating Processes: ......................................................... 12

3.1.1 Communication, Ports, and Channels: ............................................................ 12

3.1.2 Assignment: ..................................................................................................... 13

3.1.3 Sequential and Parallel Compositions: ............................................................ 13

3.1.4 Selection, Wait, and Repetition: ...................................................................... 13

3.1.5 Pipeline Slack and Slack Matching: ................................................................ 13

3.2 Modeling System Components: HSE ......................................................................... 14

3.3 Stability and Noninterference: .................................................................................... 14

3.4 Isochronic Forks: ........................................................................................................ 15

CHAPTER 4 ASYNCHRONOUS COMMUNICATION PROTOCOLS ....................... 17

4.1 Bare Handshake Protocol: .......................................................................................... 17

4.1.1 Two-Phase Handshake: ................................................................................... 18

4.1.2 Four-Phase Handshake: ................................................................................... 19

4.2 Handshake Protocols With Data: Bundled Data......................................................... 19

4.3 DI Data Codes: ........................................................................................................... 20

4.4 Dual-Rail Code: .......................................................................................................... 20

4.5 1-of-N Codes: ............................................................................................................. 21

4.6 k-out-of-N Codes: ....................................................................................................... 21

4.7 Which DI Code? ......................................................................................................... 22

4.8 Validity and Neutrality Tests:..................................................................................... 22

U12EC003, Odd Semester 2015-16 (vi)


CHAPTER 5 BASIC BUILDING BLOCKS ...................................................................... 23

5.1 Sequencer: .................................................................................................................. 23

5.1.1 Reshuffling and Half-Buffers: ......................................................................... 25

5.1.2 Simple Half-Buffer: ......................................................................................... 25

5.1.3 C-Element Full-Buffer:.................................................................................... 25

5.2 Reshuffling and Slack:................................................................................................ 26

5.3 Single-Bit Register: .................................................................................................... 26

5.4 N-bit Register and Completion Tree: ......................................................................... 27

5.5 Completion Trees versus Bundled Data: .................................................................... 27

5.6 Two Design styles for Asynchronous Pipelines: ........................................................ 27

5.6.1 First Approach: Control-Data Decomposition: ............................................... 28

5.6.2 Second Approach: Integrated Pipelines:.......................................................... 28

CHAPTER 6 CONCLUSION .............................................................................................. 30

References:................................................................................................................................... 31

Acronyms ..................................................................................................................................... 32

U12EC003, Odd Semester 2015-16 (vii)


List of Figures
Figure 1.1 Basic block diagram of System on chip [2] .............................................................. 2
Figure 1.2. System on chip design flow [2] ............................................................................... 2
Figure 1.3. System on chip evolution stages [4] ........................................................................ 3
Figure 1.4. System on chip architecture [4] ............................................................................... 4
Figure 2.1. System on chip design reuse[2] ............................................................................... 7
Figure 3.1: Isochronic forks [5] ............................................................................................... 16
Figure 4.1: Implementation of a ″bare″ channel (L,R) with two handshake[1] ....................... 17
Figure 4.2. A bundled-data communication protocol [1] ......................................................... 20
Figure 4.3. A dual-rail coding of a Boolean data-channel [1] ................................................. 20
Figure 4.4. A 1-of-4 coding of a four-valued integer data channel [1] .................................... 21
Figure 5.1. Implementation of an active–active buffer (sequencer) with a C-element
implementation of the state bit [1] ........................................................................................... 24
Figure 5.2. Implementation of an active–active buffer (sequencer): with a cross-coupled
NOR-gate implementation of the state bit [1] .......................................................................... 24
Figure 5.3. A passive–active buffer implemented as an active–active buffer [1] .................... 24
Figure 5.4. A simple half-buffer using Bare handshake [1] ..................................................... 25
Figure 5.5. A full-buffer FIFO stage using Bare handshake [1] .............................................. 25
Figure 5.6. Handshake wires for the single-bit register[1]....................................................... 26

U12EC003, Odd Semester 2015-16 (viii)


CHAPTER 1
INTRODUCTION

A system on chip is a system on an IC that integrates software and hardware IP using more than
one design methodology. System on chip design includes embedded processor core and a
significant software component which leads to additional design challenges. In addition to IC
SoC consists of software and interconnection structure for integration.

1.1 Overview of system on chip design:

System on chip design is significantly more complex. Chip designs have for the last 20 years
reused design elements. SoC design has involved the reuse of more complex elements at higher
levels of abstraction. Block-based design, which involves partitioning, designing and assembling.
SoCs using a hierarchical block-based approach, has used the Intellectual Property (IP) block as
the basic reusable element [4].

1.2 System on chip block diagram:

SoC: More of a System not a Chip. In addition to IC, SoC consists of software and
interconnection structure for integration. SoC may consists of all or some of the following:

 Processor/CPUs (cores)
 On-chip interconnection (busses, network, etc.)
 Analog circuits
 Accelerators or application specific hardware modules
 ASICs Logics
 Software – OS, Application, etc.
 Firmware [2]

U12EC003, Odd Semester 2015-16 (1)


Figure 1.1 Basic block diagram of System on chip [2]

1.3 System on chip design flow:

 Due to Chip Complexity and lower IC area, it is difficult to reduce Placement,


Layout and Fabrication steps.
 There is need to reduce the time of other steps before Placement, Layout and
Fabrication steps.
 One should consider Chip Layout issues up-front.

Figure 1.2. System on chip design flow [2]

U12EC003, Odd Semester 2015-16 (2)


1.4 System on chip evolution:

Figure 1.3. System on chip evolution stages [4]

1.5 System on chip architecture:

SOC covers many topics –

 Processor: pipelined, superscalar, vliw, array, vector


 Storage: cache, embedded and external memory
 Interconnect: buses, network-on-chip
 Impact: time, area, power, reliability, configurability
 Customizability: specialized processors, reconfiguration
 Productivity/tools: model, explore, re-use, synthesize, verify
 Future: autonomous SoC, self-optimizing/verifying design

U12EC003, Odd Semester 2015-16 (3)


Figure 1.4. System on chip architecture [4]

1.6 Reason for adoption of asynchronous techniques:

The large parameter variations across a chip will make it prohibitively expensive to control
delays in clocks and other global signals. Also, issues of modularity and energy consumption
plead in favor of asynchronous solutions at the system level. It is now generally agreed that the
sizable very large scale integration (VLSI) systems [systems-on-chip] of the nanoscale era will
not operate under the control of a single clock and will require asynchronous techniques.

Whether those future systems will be entirely asynchronous, as we predict, or globally


asynchronous and locally synchronous (GALS), as more conservative practitioners would have
it, we anticipate that the use of asynchronous methods will be extensive and limited only by the
traditional designers’ relative lack of familiarity with the approach.

U12EC003, Odd Semester 2015-16 (4)


Fortunately, the past two decades have witnessed spectacular progress in developing methods
and prototypes for asynchronous (clock less) VLSI. Today, a complete catalogue of mature
techniques and standard components, as well as some computer-aided design (CAD) tools, are
available for the design of complex asynchronous digital systems.

In this we will introduce the main design principles, methods, and building blocks for
asynchronous VLSI systems, with an emphasis on communication and synchronization. Such
systems will be organized as distributed systems on a chip consisting of a large collection of
components communicating by message exchange. Therefore, it places a strong emphasis on
issues related to network and communication issues for which asynchronous techniques are
particularly well-suited.

We hope that after reading this, the designer of an SoC should be familiar enough with those
techniques that he or she would no longer hesitate to use them. Even those adepts of GALS who
are adamant not to let asynchrony penetrate further than the network part of their SoC must
realize that network architectures for SoCs are rapidly becoming so complex as to require the
mobilization of the complete armory of asynchronous techniques.[1]

1.7 Some important definitions:

Here we are defining some of the important words those will be used many times further. Those
are-

1.7.1 Asynchronous circuit:

A digital circuit is asynchronous when no clock is used to implement Sequencing. Such circuits
are also called clock less. The various asynchronous approaches differ in their use of delay
assumptions to implement sequencing.

1.7.2 Quasi delay insensitive (QDI):

Asynchronous circuits with the only delay assumption of isochronic forks are called quasi-delay-
insensitive (QDI).We use QDI as the basis for asynchronous logic. All other forms of the

U12EC003, Odd Semester 2015-16 (5)


technology can be viewed as a transformation from a QDI approach by adding some delay
assumption.[5]

1.7.3 Speed independent circuits:

An asynchronous circuit in which all forks are assumed isochronic corresponds to what has been
called a speed independent circuit, which is a circuit in which the delays in the interconnects
(wires and forks) are negligible compared to the delays in the gates.

1.7.4 Self timed circuits:

self-timed circuits are asynchronous circuits in which all forks that fit inside a chosen physical
area called equipotential region are isochronic.

1.7.5 Delay insensitive (DI):

A circuit is delay-insensitive (DI) when its correct operation is independent of any assumption on
delays in operators and wires except that the delays are finite and positive.

U12EC003, Odd Semester 2015-16 (6)


CHAPTER 2
SYSTEM ON CHIP DESIGN CHALLENGES

A SoC is a system on an IC that integrates software and hardware IP using more than one design
methodology. SoC design includes embedded processor cores and a significant software
component which leads to additional design challenges.

SoC design is significantly more complex because of these following reasons:

 Need cross-domain optimizations


 IP reuse will increase productivity, but not enough
 Even with extensive IP reuse, many of the ASICs design problems will remain.[4]

Figure 2.1. System on chip design reuse[2]

There are so many challenges in system on chip designing but some of the important challenges
are described in the following content-[3]

U12EC003, Odd Semester 2015-16 (7)


2.1 IP quality, complexities of IP integration:

 Quality of IP functional models.


 Quality of IP electrical models (at PVT corners and packaging technology of
interest).
 IP to be “proven in silicon” (as part of a shuttle)
 aggravated by greater diversity of IP functions:
Memory, AMS, SerDes, PLL’s and clock gens. GPI/O’s, on-chip power
management.

2.2 SoC testability:

 SoC IP complexity necessitates greater test overhead:


 Embedded core wrap test structures
 BIST, compression circuitry on-chip
 AMS, SerDes chip test (loopback test)
 Power supply sequencing
 Power mode testing

 Tester limitations (e.g. reduced pin count probing, power delivery)


 New failure mechanisms
 renewed interest in circuit-level faults

2.3 RC delay management and optimization:

 Foundries offer a variety of BEOL metallization stack options


 Metal pitch
 RC delay characteristics
 Tradeoffs between routing track density and metal layers
 Thicker metals
 Wider (1.5X, 2X, 3X) routing wires
 Needs to support clock distribution

U12EC003, Odd Semester 2015-16 (8)


 Needs to support power distribution
 Needs to support any hard IP
 Chosen metal stack requires full PDK support

2.4 Power optimization:


 Power optimization is becoming more “routine”
 Clock gating optimization
 RTL analysis methods for reducing switching activity
 Algorithms for calculating switching activity power:
 Simulation trace-based
 Logic estimation-based(probabilistic)
 “Deep sleep” power domain design + analysis is still tricky
 Design, sizing, positioning of “sleep FET” circuitry
 Controlling on resistance, on/off transitions (di/dt)
 The “power format” description supported by EDA tools has helped enable
additional power domain verification.

2.5 Hierarchical data management, constraint verification:


 SoC’s have a large volume of functional and electrical constraints to manage for
hierarchical design blocks and IP
 SoC’s have a greater number of modes of functional operation and power
management
 Block optimization requires constraints that are accurate and consistent across the
design hierarchy:
 Timing don’t cares, MCP’s
 Multi-mode, multi-corner definitions
 Clock domain identification, CDC’s

• EDA companies are focusing on constraint verification.

U12EC003, Odd Semester 2015-16 (9)


2.6 Functional verification:
 What mix of event-driven sim., cycle-driven sim., hardware acceleration, and
FPGA emulation is appropriate?
 What test case generation method(s) should be used?
 What compute server resources are needed?
 Improved algorithms for formal assertion and property verification (over a larger
sequential state space)
 Improved (language) methods for defining properties and measuring coverage.

2.7 Variation-aware analysis:


 The % variation of electrical parameters is increasing – both for devices and
interconnect.
 SRAM’s may operate on a unique Vmin domain to reduce leakage power.
 New sources of variation are arising:
 20nm: Double-patterned lithography overlay tolerance
 FinFET: new device topology
 EDA tools are increasingly supporting multi-corner and statistical analysis
(including “high-sigma” statistical analysis for SRAM “weak bit” robustness
verification).

2.8 EM:
 “In the future, more designs will be EM-limited.”
 The “self-heating” thermal profile of high-switching activity, high-frequency
devices requires detailed modeling, to determine the (local) increase in metal temp.
 EM reliability analysis will become increasingly more complex.

2.9 Lithography dependencies:


 The role of the design engineer must now include direct involvement in all
aspects of the layout composition.
 DSM process technologies require greater attention to physical proximity effects.

U12EC003, Odd Semester 2015-16 (10)


 n-well proximity effect
 dual stress liner topology
 STI stress effect
 pre-layout physical parasitic estimation is difficult
 post-layout extraction requires heuristics to embed the IP in a suitable
surrounding “environment” for analysis
 Chip design projects must incorporate resources for Lithography Process
Checking and layout “scoring” for pattern sensitivities that impact yield.
 With double-patterned lithography, all layout design flows need to accommodate
DPL compliance requirements.
 stdcell and IP abstracts need DPL verification
 routing on DPL layers needs to avoid loops
 pre-coloring for matching shapes to reduce overlay tolerance

• The delay in the EUV lithography requires new techniques for decomposition and
patterning (e.g., TPL, sidewall-based).
 Expect MUCH more engineering participation in layout strategies to be
multi-patterning compliant.
 Additional restricted design rules
 Complex coloring requirements and verification

U12EC003, Odd Semester 2015-16 (11)


CHAPTER 3
SoCs AS DISTRIBUTED SYSTEMS

SoCs are complex distributed systems in which a large number of parallel components
communicate with one another and synchronize their activities by message exchange.
Synchronous (clocked) logic brings a simple solution to the problem by partially ordering
transitions with respect to a succession of global events (clock signals) so as to order conflicting
read/write actions. In the absence of a global time reference, asynchronous logic has to deal with
concurrency in all its generality, and asynchronous logic synthesis relies on the methods and
notations of concurrent computing.

There exist many languages for distributed computing. The high-level language used in this is
called Communicating Hardware Processes (CHP). It is used widely in one form or other in the
design of asynchronous systems. We introduce only those constructs of the language needed for
describing the method and the examples, and that are common to most computational models
based on communication.

3.1 Modeling Systems: Communicating Processes:

A system is composed of concurrent modules called processes. Processes do not share variables
but communicate only by send and receive actions on ports.

3.1.1 Communication, Ports, and Channels:

A send port of a Process-say, port R of process p1-is connected to a receive port of another
process- say, port L of process p2-to form a channel. A receive command on port L is denoted
L?y. It assigns to local variable y the value received on L. A send command on port R, denoted
R!x, assigns to port R the value of local variable x. The data item transferred during a
communication is called a message. The net effect of the combined send R!x and receive L?y is
the assignment y := x together with the synchronization of the send and receive actions.

U12EC003, Odd Semester 2015-16 (12)


3.1.2 Assignment:

The value of a variable is changed by an explicit assignment to the variable as in x := expr. For b
Boolean, b ↑and b ↓ stand for b: =true and b: = false, respectively.

3.1.3 Sequential and Parallel Compositions:

CHP and HSE provide two composition operators: the sequential operator S1; S2 and the parallel
operator. Unrestricted use of parallel composition would cause read/write conflicts on shared
variables. CHP restricts the use of concurrency in two ways. The parallel bar ║, as in S1║S2,
denotes the parallel composition of processes. CHP also allows a limited form of concurrency
inside a process, denoted by the comma, as in S1, S2. The comma is restricted to program parts
that are noninterfering: if S1 writes x, then S2 neither reads x nor writes x.

3.1.4 Selection, Wait, and Repetition:

The selection command is a generalization of the if statement. It has an arbitrary number (at least
one) of clauses, called ″guarded commands, ″ Bi →Si where Bi is a Boolean condition and Si is
a program part. The execution of the selection consists of: 1) evaluating all guards and 2)
executing the command Si with the true guard Bi. In this version of the selection, at most one
guard can be true at any time. There is also an arbitrated version where several guards can be
true. In that case, an arbitrary true guard is selected.

In both versions, when no guard is true, the execution is suspended: the execution of the
selection reduces to a wait for a guard to become true. Hence, waiting for a condition to be true
can be implemented with the selection [B → skip], where skip is the command that does nothing
but terminates. A shorthand notation for this selection is [B].

3.1.5 Pipeline Slack and Slack Matching:

Slack matching is an optimization by which simple buffers are added to a system of distributed
processes to increase the throughput. A pipeline is a connected subgraph of the process graph
with one input port and one output port. The static slack of a pipeline is the maximal number of
messages the pipeline can hold. A pipeline consisting of chain of n simple buffers has a static

U12EC003, Odd Semester 2015-16 (13)


slack of n, since each simple buffer can hold at most one message, and the channels have slack
zero-unless, as we shall see, the buffers implementations are subjected to a transformation called
reshuffling, which can reduce their slack.

3.2 Modeling System Components: HSE

Each CHP process is refined into a partial order of signal transitions, i.e., transitions on Boolean
variables. The HSE notation is not different from CHP except that it allows only Boolean
variables, and send and receive communications have been replaced with their handshaking
expansion in terms of the Boolean variables modeling the communication wires. The modeling
of wires introduces a restricted form of shared variables between processes (the variables
implementing channels).

The input variables li and ri can only be read. The output variables lo and ro can be read and
written.

3.3 Stability and Noninterference:

Stability and noninterference are the two properties of PRS that guarantee that the circuits are
operating correctly, i.e. without logic hazards. A hazard is the possibility of an incomplete
transition.

How do we guarantee the proper execution of production rule G → t? In other words, what can
go wrong and how do we avoid it? Two types of malfunction may take place: 1) G may cease to
hold before transition t has completed, as the result of a concurrent transition invalidating G, and
2) the complementary transition t0 of t is executed while the execution of t is in progress, leading
to an undefined state. We introduce two requirements, stability and noninterference that
eliminate the two sources of malfunction.

Definition 1: A production rule G → t is said to be stable in a computation if and only if G can


change from true to false only in those states of the computation in which R(t) holds. A
production-rule set is said to be stable if and only if all production rules in the set are stable.

U12EC003, Odd Semester 2015-16 (14)


Definition 2: Two production rules Bu → x↑ and Bd → x↓ are said to be noninterfering in a
computation if and only if г Bu ᴠ г Bd is an invariant of the computation. A production-rule set
is noninterfering if every pair of complementary production rules in the set is noninterfering.

Any concurrent execution of a stable and noninterfering PRS is equivalent to the sequential
execution model in which, at each step of the computation, a PR with a true guard is selected and
executed. The selection of the PR should be weakly fair, i.e., any enabled PR is eventually
selected for execution.

The existence of a sequential execution model for QDI computations greatly simplifies reasoning
about, and simulating, those computations. Properties similar to stability are used in other
theories of asynchronous computations, in particular, semi modularity and persistency. At the
logical level, the execution of transition x ↓ when the guard holds invalidates the guard. (Such
production rules are therefore called self-invalidating.) We exclude self-invalidating production
rules, since, in most implementations, they would violate the stability condition.

3.4 Isochronic Forks:

A computation implements a partial order of transitions. In the absence of timing assumptions,


this partial order is based on a causality relation. For example, transition x ↑ causes transition y ↓
in state S if and only if x ↑ makes guard by of y ↓ true in S. Transition y ↓ is said to acknowledge
transition x ↑. We do not have to be more specific about the precise ordering in time of
transitions x ↑ and y ↓. The acknowledgment relation is enough to introduce the desired partial
order among transitions, and to conclude that x ↑ precedes y ↓. In an implementation of the
circuit, gate Gx with output x is directly connected to gate Gy with output y, i.e., x is an input of
Gy.

The fork (x, x1, x2) is isochronic: a transition on x1 causes a transition on y only when c is true,
and a transition on x2 causes a transition on z only when c is false. Hence, certain transitions on
x1 and on x2 are not acknowledged, and therefore a timing assumption must be used to
guarantee the proper completion of those unacknowledged transitions.

U12EC003, Odd Semester 2015-16 (15)


Figure 3.1: Isochronic forks [5]

Hence, a necessary condition for an asynchronous circuit to be delay-insensitive is that all


transitions are acknowledged.

Unfortunately, the class of computations in which all transitions are acknowledged is very
limited. Consider the example of Fig. 6 Signal x is forked to x1, an input of gate Gy with output
y, and to x2, an input of gate Gz with output z. A transition x ↑ when c holds is followed by a
transition y ↑, but not by a transition z ↑, i.e. transition x1 ↑ is acknowledged but transition x2 ↑
is not, and vice versa when ̚c holds. Hence, in either case, a transition on one output of the fork is
not acknowledged. In order to guarantee that the unacknowledged transition completes without
violating the specified order, a timing assumption called the isochronicity assumption has to be
introduced, and the forks that require that assumption are called isochronic forks.

U12EC003, Odd Semester 2015-16 (16)


CHAPTER 4
ASYNCHRONOUS COMMUNICATION
PROTOCOLS

The implementation of send/receive communication is central to the methods of asynchronous


logic, since this form of communication is used at all levels of system design, from
communication between, say, a processor and a cache down to the interaction between the
control part and the data path of an ALU. Communication across a channel connecting two
asynchronous components p1 and p2 is implemented as a handshake protocol. In a later section,
we will describe how to implement communication between a synchronous (clocked) component
and an asynchronous one. Such interfaces are needed in a GALS SoC.

4.1 Bare Handshake Protocol:

Let us first implement a ″bare″ communication between processes p1 and p2: no data is
transmitted. (Bare communications are used as a synchronization point between two processes.)
In that case, channel (R, L) can be implemented with two wires: wire (ro, li) and wire (lo, ri).
(The wires that implement a channel are also called rails.) It is shown in fig4.1.

Figure 4.1: Implementation of a ″bare″ channel (L,R) with two handshake[1]

U12EC003, Odd Semester 2015-16 (17)


Wire (ro,li) is written by p1 and read by p2.Wire (lo,ri) is written by p2 and read by p1. An
assignment ro ↑ or ro ↓ in p1 is eventually followed by the corresponding assignment li ↑or li ↓in
p2 due to the behavior of wire (ro,li) . And symmetrically for variables lo and ri, and wire (lo,ri) .
By convention, and unless specified otherwise, all variables are initialized to false.

4.1.1 Two-Phase Handshake:

The simplest handshake protocol implementing the slack-zero communication between R and L
is the so-called two-phase handshake protocol, also called non return to zero (NRZ). The
protocol is defined by the following handshake sequence Ru for R and Lu for L:

Ru : ro ↑; [ri]

Lu : [li]; lo ↑:

Given the behavior of the two wires (ro,li) and (lo,ri) , the only possible interleaving of the
elementary transitions of Ru and Lu is ro ↑; li ↑; lo ↑; ri ↑.

This interleaving is a valid implementation of a slack zero execution of R and L, since there is no
state in the system where one handshake has terminated and the other has not started. But now all
handshake variables are true, and therefore the next handshake protocol for R and L has to be

Rd : ro ↓; [̚ri]

Ld : [̚li]; lo ↓

The use of the two different protocols is possible if it can be statically determined (i.e., by
inspection of the CHP code) which are the even (up-going) and odd (down-going) phases of the
communication sequence on each channel. But if, for instance, the CHP program contains a
selection command, it may be impossible to determine whether a given communication is an
even or odd one.

U12EC003, Odd Semester 2015-16 (18)


4.1.2 Four-Phase Handshake:

A straightforward solution is to always reset all variables to their initial value (zero). Such a
protocol is called four-phase or return-to-zero (RZ). R is implemented as Ru; Rd and L as Lu; Ld
as follows:

R : ro ↑; [ri]; ro ↓; [̚ri]

L : [li]; lo ↑; [̚li]; lo ↓ :

In this case, the only possible interleaving of transitions for a concurrent execution of R and L is
ro ↑; li ↑; lo ↑; ri ↑; ro ↓; li ↓; lo ↓; ri ↓.

Again, it can be shown that this interleaving implements a slack-zero communication between R
and L. It can even be argued that this implementation is in fact the sequencing of two slack-zero
communications: the first one between Ru and Lu, the second one between Rd and Ld. This
observation will be used later to optimize the protocols by a transformation called reshuffling.

4.2 Handshake Protocols With Data: Bundled Data

Let us now deal with the case when the communication also entails transmitting data, for
instance, by sending on R(R!x) and receiving on L(L?y). A solution immediately comes to mind:
let us add a collection of data wires next to the handshake wires. The data wire (rd; ld) is
indicated by a double arrow on Fig. 4.2. The protocols are as follows:

R!x : rd := x; ro ↑; [ri]; ro ↓; [̚ri]

L?y : [li]; y := ld; lo ↑; [̚li]; lo ↓ :

This protocol relies on the timing assumption that the order between rd := x and ro ↑ in the
sender is maintained in the receiver: when the receiver has observed li to be true, it can assume
that ld has been set to the right value, which amounts to assuming that the delay on wire (ro; li) is
always safely longer than the delay on wire (rd; ld). Such a protocol is used and is called
bundled-data. The efficiency of bundle-data versus DI codes is a hotly debated issue.

U12EC003, Odd Semester 2015-16 (19)


Figure 4.2. A bundled-data communication protocol [1]

4.3 DI Data Codes:

In the absence of timing assumptions, the protocol cannot rely on a single wire to indicate when
the data wires have been assigned a valid value by the sender. The validity of the data has to be
encoded with the data itself. A DI data code is one in which the validity and neutrality of the data
are encoded within the data. Furthermore, the code is chosen such that when the data changes
from neutral to valid, no intermediate value is valid; when the data changes from valid to neutral,
no intermediate value is neutral. Such codes are also called separable. There are many DI codes
but two are almost exclusively used on chip-the dual-rail and 1-of-N codes.

4.4 Dual-Rail Code:

In a dual-rail code, two wires, bit.0 and bit.1, are used for each bit of the binary representation of
the data.

Figure 4.3. A dual-rail coding of a Boolean data-channel [1]

U12EC003, Odd Semester 2015-16 (20)


4.5 1-of-N Codes:

In a 1-of-N code, one wire is used for each value of the data. Hence, the same two-bit data word
is now encoded as follows:

For a Boolean data-word, dual-rail and 1-of-N are obviously identical. For a 2-bit data word,
both dual-rail and 1-of-4 codes require four wires. For an N-bit data word, dual-rail requires 2 *
N wires. If the bits of the original word are paired and each pair is 1-of-4 encoded, this coding
also requires 2 * N wires. An assignment of a valid value to a dual-rail-coded word requires 2 *N
transitions, but requires only N transitions in the case of a 1-of-4 code.

Figure 4.4. A 1-of-4 coding of a four-valued integer data channel [1]

4.6 k-out-of-N Codes:

The 1-of-N code, also called one-hot, is a special case of a larger class of codes called k-out-of-
N. Instead of using just one true bit out of N code bits, as is done in the 1-of-N, we may use k, 0
G k G N, true bits to represent a valid code value. Hence, the maximal number of valid values for
a given N is obtained by choosing k as N=2. Sperner has proved that this code is not only the

U12EC003, Odd Semester 2015-16 (21)


optimal k-out-of-N code, but also the optimal DI code in terms of the size of the code set for a
given N.

4.7 Which DI Code?

The choice of a DI code in the design of a system on a chip is dictated by a number of practical
requirements. First, the tests for validity and neutrality must be simple. The neutrality test is
simple: as in all codes, the unique neutral value is the set of all zeroes or the set of all ones. But
the validity test may vary greatly with the code. Second, the coding and decoding of a data word
must be simple. Third, the overhead in terms of the number of bits used for a code word
compared to the number of bits used for a data word should be kept reasonably small.

Finally, the code should be easy to split a coded word is often split into portions that are
distributed among a number of processes-for example, a processor instruction may be
decomposed into an opcode, and several register fields. It is very convenient if the portions of a
code word are themselves a valid code word. This is the case for the dual rail code for all
partitions and for the 1-of-4 code for partitions down to a quarter-byte.

4.8 Validity and Neutrality Tests:

The combination of four-phase handshake protocol and DI code for the data gives the following
general implementation for communication on a channel. In this generic description, we use
global names for both the sender and receiver variables. A collection of data wires called data
encodes the message being sent. A single acknowledge wire ack is used by the receiver to notify
the sender that the message has been received. This wire is called the enable wire when it is
initialized high.

U12EC003, Odd Semester 2015-16 (22)


CHAPTER 5
BASIC BUILDING BLOCKS

The three basic building blocks are: 1) a circuit that sequences two bare communication actions-
the sequencing of any two arbitrary actions can be reduced to the sequencing of two bare
communications; 2) a circuit that reads and writes a single-bit register; and 3) a circuit that
computes a Boolean function of a small number of bits.

5.1 Sequencer:

The basic sequencing building block is the sequencer process, also called left–right buffer p1 :
*[L; R] which repeatedly does a bare communication on its left port L followed by a bare
communication on its right port R. The two ports are connected to an environment which
imposes no restriction on the two communications. The simplest implementation is when both
ports are active. (The reason is that a handshake on a passive port is initiated by the environment
and therefore requires extra effort to be synchronized.)

Now, all the states that need to be distinguished are uniquely determined and we can generate a
PR set that implements the HSE. This leads to the two solutions shown in Fig. 5.1 and 5.2. In the
first solution, the state variable x is implemented with a C-element, in the second one with cross-
coupled nor-gates.

All other forms of the left–right buffer are derived from the active–active buffer by changing an
active port into a passive one. The conversion is done by a simple C-element. The passive–active
buffer is shown on Fig.5.3

U12EC003, Odd Semester 2015-16 (23)


Figure 5.1. Implementation of an active–active buffer (sequencer) with a C-element implementation of the
state bit [1]

Figure 5.2. Implementation of an active–active buffer (sequencer): with a cross-coupled NOR-gate


implementation of the state bit [1]

Figure 5.3. A passive–active buffer implemented as an active–active buffer [1]

U12EC003, Odd Semester 2015-16 (24)


5.1.1 Reshuffling and Half-Buffers:

We have already mentioned that the down-going phase of a four-phase handshake is solely for
the purpose of resetting all variables to their initial (neutral state) values, usually false. The
designer therefore has some leeway in the sequencing of the down-going actions of a
communication with respect to other actions of an HSE. The transformation that moves a part of
a handshake sequence in an HSE is called reshuffling. It is an important transformation in
asynchronous system synthesis as many alternative implementations of the same specification
can be understood as being different reshufflings of the same initial HSE.

5.1.2 Simple Half-Buffer:

Its interest is that it leads to a very simple implementation: a simple C-element with the output
replicated to be both lo and ro, as shown in Fig. 5.4

Figure 5.4. A simple half-buffer using Bare handshake [1]

5.1.3 C-Element Full-Buffer:

Another (less drastic) reshuffling of the original HSE is shown in the fig 5.5 which admits the
two-C-element implementation.

Figure 5.5. A full-buffer FIFO stage using Bare handshake [1]

U12EC003, Odd Semester 2015-16 (25)


5.2 Reshuffling and Slack:

Reshuffling is used to simplify implementation. By overlapping two or more handshaking


sequences, reshuffling reduces the number of states the system has to step through, often
eliminating the need for additional state variables. Reshuffling also makes it possible to pass data
directly from an input port-say, L-to an output port-say, R-without using an internal register x. In
such a case, we write R!(L?) instead of L?x; R!x.

But reshuffling may also reduce the slack of a pipeline stage when it is applied to an input port
and an output port, for instance, L and R in the simple buffer. Hence, reshuffling a buffer HSE is
usually a tradeoff between reducing the circuit complexity on the one hand, and reducing the
slack on the other hand, thereby reducing the throughput.

5.3 Single-Bit Register:

Next, we implement a register process that provides read and write access to a single Boolean
variable, x. The environment can write a new value into x through port P, and read the current
value of x through port Q. Read and write requests from the environment are mutually exclusive.

As shown in Fig.16, input port P is implemented with two input wires, p:1 for receiving the value
true, and p:0 for receiving the value false; and one acknowledge wire, po. Output port Q is
implemented with two output wires, q:1 for sending the value true, and q:0 for sending the value
false; and one request wire, qi. Variable x is also dual-rail encoded as the pair of variables xt; xf.

Figure 5.6. Handshake wires for the single-bit register[1]

U12EC003, Odd Semester 2015-16 (26)


5.4 N-bit Register and Completion Tree:

An n-bit register R is built as the parallel composition of n one-bit registers ri. Each register ri
produces a single write-acknowledge signal wack. All the acknowledge signals are combined by
an n-input C-element to produce a single write-acknowledge for R.

The completion tree puts a delay proportional to logn elementary transitions on the critical cycle.
Combined with the write-acknowledge circuit itself, the completion tree constitutes the
completion detection circuit, which is the main source of inefficiency in QDI design. Numerous
efficient implementations of completion detection have been proposed. The read part of the n-bit
register is straightforward: the read-request signal is forked to all bits of the register.

5.5 Completion Trees versus Bundled Data:

It is because of completion-tree delays that bundled data is believed by some designers to be


more efficient than DI codes for data path. Completion tree is replaced with a delay line
mirroring the delays required to write data into the registers.

However, the increasing variability of modern technology requires increasing delay margins for
safety. It is the authors’ experience that after accounting for all margins, the total delay of
bundled data is usually longer than the completion-tree delay-and bundled data gives up the
robustness of QDI [1].

5.6 Two Design styles for Asynchronous Pipelines:

In systems where throughput is important, computation is usually pipelined. A pipeline stage is a


component that receives data on several input ports, computes a function of the data, and sends
the result on an output port. The stage may simultaneously compute several functions and send
the results on several output ports. Both input and output may be used conditionally.

In order to pipeline successive computations of the function, the stage must have slack between
input ports and output ports. In this section, we present two different approaches to the design of
asynchronous pipelines.

U12EC003, Odd Semester 2015-16 (27)


In the first approach, each stage can be complex (coarse-grain); the control and data path of a
stage are separated and implemented independently. The decomposition is syntax-directed. (This
style was introduced, and was used in the design of the first asynchronous Microprocessor.

The second approach is aimed at fine-grain high throughput pipelines. The data path is
decomposed into small portions in order to reduce the cost of completion detection, and for each
portion, control and data path are integrated in a single component, usually a precharge half
buffer. The implementation of a pipeline into a collection of fine-grain buffers is based on data-
driven decomposition [6].

5.6.1 First Approach: Control-Data Decomposition:

In its simplest form, a pipeline stage receives a value x on port L and sends the result of a
computation, f (x), on port R. The design of a pipeline stage combines all three basic operations:
sequencing between L and R, storage of parameters, and function evaluation. A simple and
systematic approach consists of separating the three functions.

A control part implements the sequencing between the bare ports of the process and provides a
slack of one in the pipeline stage.

• A register stores the parameter x received on L.

• A function component computes f (x) and assigns the result to R.

5.6.2 Second Approach: Integrated Pipelines:

Simplicity and generality are the strengths of the previous approach to pipeline design; it allows
quick circuit design and synthesis. However, the approach puts high lower bounds on the cycle
time, forward latency, and energy per cycle.

First, the inputs on L and the outputs on R are not interleaved in the control, putting all eight
synchronizing transitions in sequence. Second, the completion-tree delay, which is proportional
to the logarithm of the number of bits in the data path, is included twice in the handshake cycle
between two adjacent pipeline stages.

U12EC003, Odd Semester 2015-16 (28)


The fine-grain integrated approach we are going to describe next is targeted for high-throughput
designs. It eliminates the performance drawbacks of the previous approach by two means:

 The handshake sequence of L and the handshake sequence of Rare reshuffled with
respect to each other so as to overlap some of the transitions, and eliminate the need for
the explicit registers for input data, and
 The data path is decomposed into independent slices so as to reduce the size of the
completion trees, and improve the cycle time.

U12EC003, Odd Semester 2015-16 (29)


CHAPTER 6
CONCLUSION

The purpose of this paper was to expose the SoC architect to a comprehensive set of standard
asynchronous techniques and building blocks for SoC interconnects and on chip communication.
Although the field of asynchronous VLSI is still in development, the techniques and solutions
presented here have been extensively studied, scrutinized, and tested in the field-several
microprocessors and communication networks have been successfully designed and fabricated.

The techniques are here to stay. The basic building blocks for sequencing, storage, and function
evaluation are universal and should be thoroughly understood. At the pipeline level, we have
presented two different approaches: one with a strict separation of control and data path, and an
integrated one for high throughput. Different versions of both approaches are used.

At the system level, issues of slack, choices of handshakes and reshuffling affect the system
performance in a profound way. We have tried to make the designer aware of their importance.
At the more fundamental level, issues of stability, isochronic forks, validity, and neutrality tests,
state encoding must be understood in order for the designer to avoid the recurrence of hazard
malfunctions that have plagued early attempts at asynchrony.

While we realize very well that the engineer of an SoC has the freedom and duty to make all
timing assumptions necessary to get the job done correctly, we also believe that, from a didactic
point of view, starting with the design style from which all others can be derived is the most
effective way of teaching this still vastly misunderstood but beautiful VLSI design method.

U12EC003, Odd Semester 2015-16 (30)


References:
[1] Alain J. Martin and Mika Nystro¨m, ”Asynchronous Techniques for System-on-Chip
Design”, Proceedings of the IEEE 94 ,vol. E 94, No. 6, pp 1089-1120, June 2006.
[2] Michael J. Flynn, Wayne Luk, “Computer System Design: System on Chip”, John Wiley and
Sons Inc. 2011.
[3] Tom Dillinger, “Top 10 Challenges for SoC Chip Design”, IEEE CS chapter technical
meeting, April 8, 2014.
[4] H. Chang, L. Cooke, M. Hunt, G. Martin, A. McNelly, and L. Todd,” Surviving the SOC
Revolution: A Guide to Platform-Based Design”, Kluwer Academic Publishers, Boston,
1999.
[5] D. E. Muller and W. S. Bartky, “A Theory of Asynchronous Circuits”, Proc. Int. Symp.
Theory of Switching, pp. 204–243, 1959.
[6] C. G. Wong and A. J. Martin, “High-level synthesis of asynchronous systems by data-driven
decomposition”, In Proceedings of the 40th annual Design Automation Conference, pp. 508–
513 , ACM, 2003.
[7] A. J. Martin et al., “The design of an asynchronous microprocessor”, Proc. Decennial Caltech
Conf. Advanced Research in VLSI, C. L. Seitz, Ed., pp. 351–373,1991.
[8] J. Sparsø and S. Furber, Eds., “Principles of Asynchronous Circuit Design: A Systems
Perspective” Boston, MA: Kluwer, 2001.
[9] Cortadella, Jordi, Michael Kishinevsky, Alex Kondratyev, Luciano Lavagno, and Alex
Yakovlev, “Logic synthesis for asynchronous controllers and interfaces”, Vol. 8. Springer
Science & Business Media, 2012.
[10] A. M. Lines, “Pipelined asynchronous Circuits”, M.S. thesis, California Inst. Technol.,
Pasadena, 1997.

U12EC003, Odd Semester 2015-16 (31)


Acronyms
SOC System on chip

IP Intellectual property

VLSI Very large scale integration

GALS Globally asynchronous and locally synchronous

CAD Computer aided design

QDI Quasi delay insensitive

DI Delay insensitive

CHP Communicating hardware processes

U12EC003, Odd Semester 2015-16 (32)

You might also like