Wen-Long Chin - Principles of Verilog Digital Design-CRC Press (2022)
Wen-Long Chin - Principles of Verilog Digital Design-CRC Press (2022)
Wen-Long Chin - Principles of Verilog Digital Design-CRC Press (2022)
Digital Design
Principles of Verilog
Digital Design
Wen-Long Chin
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
[email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/9781003187196
Preface.....................................................................................................................xiii
Acknowledgments.................................................................................................... xv
v
vi Contents
Index......................................................................................................................585
Preface
Modern digital circuits are described using a hardware description language based
on the semi-custom design methodology. A logic gate schematic is then synthesized
by the use of a standard cell library, and the physical layout can subsequently be
implemented. Therefore, several electronic design automation tools, especially the
synthesizer, should be learned early as a design counterpart.
Key components of computer organization, such as interconnect, memory sys-
tem, arbiter, I/O controller, embedded processor, first-in-first-out, and accelerator,
together with their register-transfer level (RTL) codes, are presented in this book.
An embedded co-processor for the Advanced Encryption Standard algorithm is il-
lustrated. Moreover, assembly codes to drive the co-processor are introduced so that
readers can fully understand the way every instruction is performed in a processor.
Several application-specific integrated circuit (ASIC) designs are completely pro-
vided in this book. Major digital signal processing (DSP) techniques, such as digi-
tal filters, fast Fourier transform transformation, source coding, and image process-
ing, will be implemented in RTL designs as well. We also demonstrate step-by-step
instructions for a fixed-point DSP design in this book.
In addition to the theoretical background, such as the probability of a synchro-
nizer entering an illegal state, the system-level design for the synchronization of sig-
nals across different clock domains is comprehensively presented via three sections:
single-bit synchronizer, deterministic multi-bit synchronizer, and nondeterministic
multi-bit synchronizer (with and without flow control).
There are ten chapters and five appendices in this book. For your easy reference,
they are listed below:
Chapter 1 Introduction
Chapter 2 Fundamentals of Verilog
Chapter 3 Advanced Verilog Topics
Chapter 4 Number Representation
Chapter 5 Combinational Circuits
Chapter 6 Sequential Circuits
Chapter 7 Digital System Designs
Chapter 8 Advanced System Designs
Chapter 9 I/O Interface
Chapter 10 Logic Synthesis with Design Compiler
Appendix A Basic Logic Gates and User Defined Primitives
Appendix B Non-Synthesizable Constructs
Appendix C Advanced Net Data Types
Appendix D Signed Multipliers
Appendix E Design Principles and Guidelines
Wen-Long Chin
xiii
Acknowledgments
I would like to express my gratitude to my students who contributed to this book,
particularly to David Chen and Vivian Pan, who drew illustrations and verified RTL
codes used in this book; to Gabriella Williams, my Editor at Taylor & Francis Group;
and to the Taylor & Francis publishing staff for their support during this publication
project.
xv
1 Introduction
The design methodologies for digital and analog circuits, with an emphasis on the
application-specific integrated circuit (ASIC) design flow, are introduced. You will
be able to gain a clear picture of what the modern register-transfer level (RTL) de-
sign is and the requirements of a workable chip. Timing constraints of setup time
and hold time are briefly presented. Further, you can understand the terminology in
ASIC design, such as functional verification, logic synthesis, timing verification, and
physical implementation in this chapter.
DOI: 10.1201/9781003187196-1 1
2 Principles of Verilog Digital Design
D
A
as conversion from analog to digital and vice versa, voltage regulator, phase-locked
loop (PLL), and processing of ultra high-speed signals.
• Performance: they have higher accuracy, cost efficiency, and lower power
consumption.
• Reliability: they are less affected by ageing, noise, and variations in tem-
perature and environment.
• Flexibility: they have memory and easier to design. Information and data
can be easily stored, processed, and communicate. These systems are more
versatile and can achieve highly complicated functions. Moreover, system
operation can be changed by interacting with the software.
Digital electronics are the main foundation for the digitized world. To understand
the function of each digital module, it is necessary to have fundamental knowledge of
digital circuits and their logical operation. Almost every electronics where transistors
are used as a switch applies the basic concepts of digital technique. The first family of
digital logic gaining widespread use was the transistor–transistor logic (TTL) family
4 Principles of Verilog Digital Design
within which the logic gates are formed by bipolar junction transistors (BJTs). The
electrical properties of these devices led to design standards that still influence logic
design practice nowaday.
In more recent times, TTL components have been largely replaced by those us-
ing complementary metal-oxide semiconductor (CMOS) circuits, which are based
on both n-channel and p-channel field-effect transistors (FETs). For example, the
simplest CMOS logic gate is the inverter (or NOT gate in logic design) consisting of
one n-channel and one p-channel metal-oxide semiconductor field-effect transistors
(MOSFETs) shown in Figure 1.3. The FETs can be viewed as switches controlled by
the input A. For n-channel MOSFETs (NMOS), they turn on (or off) when input A is
logic 1 (or logic 0), while, for p-channel MOSFETs (PMOS), the situation reverses.
For example, as displayed in Figure 1.3(c), when input A is logic 1, Q2 turns off and
Q1 turns on, and hence, Y gets logic 0. In Figure 1.3(d), when input A is logic 0,
Q2 turns on and Q1 turns off, which leads to logic 1 of Y . So that, ideally, no static
power is consumed from VDD to ground in CMOS logic gates except the leakage
power. However, dynamic power is unavoidable and it is consumed during the signal
transition.
The most commonly used logic is the positive or active-high logic. In which, a
low logic level represents the false condition while a high logic level represents the
true condition. In contrast, the negative or active-low logic is used in the reverse
condition, particularly for situations that some digital circuits are able to sink more
current than drive. Many control signals in electronics are active-low signals, such
as reset signal of flip-flops, chip-select signal and so on. Logic families such as TTL
can sink more current than they can source, so fanout and noise immunity increase.
Active-high and active-low logic are typically mixed: for example, a memory IC may
have a chip-select and output-enable signals that are active-low, while the data and
address signals are typically active-high.
One of the salient feature of digital logic is the noise margin for possible fluctua-
tion or disturbance induced on the input signals, as shown in Figure 1.4. The output
low/high voltage is lower/higher than the input low/high voltage so that disturbance
induced on the input signals of a logic gate does not affect its logic function. The
symbols for the voltage thresholds are listed below.
• VOL : output low voltage – a component must drive a signal with a voltage
below this threshold to establish a low level at the output.
• VOH : output high voltage – a component must drive a signal with a voltage
above this threshold to establish a high level at the output.
• VIL : input low voltage – a component receiving a signal with a voltage below
this threshold will be represented as a logic low at the input.
• VIH : input high voltage – a component receiving a signal with a voltage
above this threshold will be represented as a logic high at the input.
With these thresholds, noise margin exists and signals could not be misinterpreted.
For example, in Figure 1.4, the signal with noise is assumed to be the output signal
of a logic gate. When the output is logic 0, its voltage level under the effects of noise
Introduction 5
Figure 1.3: CMOS inverter: (a) symbol, (b) transistor-level schematic, (c) operation
when the input A is logic 1, and (d) operation when the input A is logic 0.
is lower than VOL , and certainly lower than VIL of the input of another logic gate
it drives as well. Therefore, the disturbance will not cause misinterpretation on the
input of another logic gate it drives.
Y = A · B · C̄ + Ā · B · C̄ + C (1.1)
6 Principles of Verilog Digital Design
where (·), ·, and + denote the logical bitwise NOT, AND, and OR operations, respec-
tively. Sometimes, the AND operation is omitted for clarity. The Boolean equation
of Y can be visualized using the truth table, as shown in Table 1.1.
1 1 1 1 1
Figure 1.5: Moore’s law and the trend in transistor count of Intel processors.
was done on paper or by hand on a graphic computer terminal. However, the physi-
cal properties of the IC are determined by many important operating characteristics,
including switching speed between low and high voltages, which is affected by the
current driven and sinked, and the minimum size of each transistor, i.e., the minimum
feature size. In addition, according to the Moore’s law, rapid and continuous devel-
opment of IC technology called for effective electronic design automation (EDA)
techniques.
Computer-aided design (CAD) is a paradigm shift for design automation to in-
crease the productivity of the designer, improve the quality of design, ease commu-
nications through documentation, and to create a database for manufacturing. CAD
tools are computer programs that help manage one or more aspects of the design
process. Modern systems with high complexity are usually impossible to develop
and verify without the aids of CAD. The CAD software is the use of computers
(or workstations) to aid in the creation, modification, analysis, or optimization of a
design. Its output is often in the form of electronic files for print, machining, or other
manufacturing operations. Nowadays, chip designers are using CAD tools, such as
simulation, verification, and physical implementation tools, to handle the complexity
of their circuits and speed up the design process.
analog and digital circuits are designed using totally different methodologies and
tools. Cares should be given to integrate them into a single chip.
There are two different IC design methodologies, that is, full-custom and semi-
custom designs, as shown in Figure 1.7. Full-custom design fully specifies the
transistor size, placement of each transistor, and their interconnections manually.
Full-custom designs offer the highest performance and smallest die size with the
disadvantages of increased design time, complexity, and risk. Small or high-speed
analog and digital circuits requiring custom optimization adopt the full-custom
design. Traditional microprocessors were exclusively full-custom designs, but en-
gineers are turning to semi-custom designs in this field too.
Semi-custom or cell-based design describes the behavior of digital circuits us-
ing a high-level language, i.e., the hardware description language (HDL), which is
widely adopted for modern digital ASIC designs. Depending on suitable software
tools, logic gate schematic is then synthesized by the use of a standard cell library,
and physical layout can subsequently be implemented. A standard cell library is a
collection of characterized logic gates that can be used by the logic synthesis tool to
realize the design described by a hardware description language. The library needs
to update when technology advances.
Digital design based on cells or logic gates thus becomes an easier task with-
out needing to consider physical and detailed information of semiconductor devices.
Though cells themselves are developed using the full-custom design. All mask layers
are still customized and optimized for placements of logic gates and their intercon-
nects without needing to care about the physical details within cells themselves. Ow-
ing to advantages in the cell-based design, it has become the de facto design method
for digital circuits.
We introduce the ASIC design flow here. In general, as the design flow progresses
toward a physically realizable form, the design database becomes progressively more
laden with technology-specific information. After the stage of system specification,
where the leader confirms the design feasibility, decides which components take the
10 Principles of Verilog Digital Design
in-house or outsourcing solutions, and then partitions the (digital and analog) designs
into several blocks including interface definition, the design stage is kicked off.
As presented in Figure 1.8, the ASIC design flow mainly has three stages: ASIC
design, synthesis, and layout. Frontend phase of ASIC design (or cell-based design)
using standard cell generally ends at the synthesis stage. Once the synthesis tool
has mapped the HDL description into a gate netlist, the netlist is passed off to the
backend phase, where HDLs do not play a significant role.
In the design stage, the RTL simulation uses the behavioral models of analog cir-
cuits and silicon intellectual properties (IPs) to verify the functions of digital designs.
In the synthesis stage, the timing models of analog circuits and silicon IPs are read in
by the synthesizer ignoring any timing constructs in RTL codes, and then, together
with the synthesis constraints, they apply to optimize the digital circuits. The physi-
cal layouts of sub-blocks, including digital, analog, and outsourcing IPs, are merged
and verified for design rules of semiconductor process in the layout stage. Digital and
analog circuits are separately developed by digital and analog teams. Owing to the
sensitivity to disturbances in analog circuits, new analog circuits are verified through
measurements by test chip in addition to SPICE simulation.
To cut down the design time, traditional pre-layout simulation (pre-sim) with
delay annotation, which is used to validate the constraints used for synthesis, can
be skipped if designers have enough confidence in their synthesis constraints. For
instance, the clock schemes of design are simple, or the design is not or has just been
slightly modified for maintenance and its constraints had been verified in its previous
version. Annotated with exact standard delay format (SDF) file, the post-layout sim-
ulation (post-sim) is very time consuming for a large design. Therefore, designers
only select and simulate a few normal patterns. Simulation with SDF annotation is
a sort of timing analysis. To differentiate it from the timing analysis based on the
design constraints, the post-sim is also named dynamic timing analysis. By contrast,
the timing analysis based on the design constraints is very fast, and directly performs
on the characterized delays of logic gates. The analysis is also called static timing
analysis (STA) to emphasize that the analysis relying on the design constraints does
not need time-consuming simulations and dynamic simulation patterns. Finally, the
design database is taped-out for masking and fabrication.
Introduction 11
ASIC can be manufactured using standard cell or gate array. Gate array is a pre-
fabricated silicon chip with most transistors having no predetermined function. Com-
ponents in a gate array are later interconnected to fulfill the desired functionality.
Shared masks can save the cost for advanced semiconductor process.
To give an idea what will fit on a typical ASIC, Table 1.2 lists the number of gate
counts for typical digital building blocks. The gate count of a specific component is
assessed base on its area relative to that of a 2-input NAND gate. The area size is
subject to the processing technology while gate count is not. To evaluate the size of
a circuit, we often convert the area to equivalent gate count. Consequently, a gate
count equivalent is the 2-input NAND gate composed of four transistors.
Example 1.1. Estimate the total amount of gate counts occupied by the eight-tap
finite impulse response (FIR) filter. The output y(n), where n denotes the sampling
index, is calculated as follows:
7
y(n) = ∑ hm x(n − m).
m=0
We assume that all 8 inputs x(n − m) and 8 weights hm , m = 0, 1, ..., 7, are 32-bit
wide. The operands of multipliers and adders are all 32 bits as well.
Solution: There are 8 multipliers and 7 adders in the circuit. Therefore, based on
the gate counts of multipliers and adders listed in the previous table,
The total gate count of multipliers (Am ) = 8 × 7500 = 6 × 104 gate counts,
The total gate count of adders (Aa ) = 7 × 750 = 5.25 × 103 gate counts.
12 Principles of Verilog Digital Design
To get the total gate count, AFIR , we sum the gate counts of each components. The
gate count of a FIR filter is dominated by the multipliers as
where K= 103 .
Example 1.2. Estimate how many FIR filters of the previous example will, in 2019,
fit into the area of a single FIR filter implemented in 2015.
Solution: Based on the Moore’s Law, the number of transistors in an IC doubles
about every two years, which leads to the following increase N in FIR filter density:
N = 2(2019−2015)/2 = 4.0.
Figure 1.9: From design concept to physical layout: (a) circuit interface, (b) block
diagram and its implementation described by HDL, (c) a portion of synthesized logic
gates in the circuit, and (d) layout of the inverter.
used by system architects, ASIC and field programmable gate array (FPGA) design-
ers, verification engineers, and model developers.
Prototype ICs are too expensive and time consuming to build, so all modern de-
signs rely on HDL to describe, design, and test a circuit in software before it finally
goes into the manufacturing stage. The HDL enables a precise, formal description
of an electronic circuit that allows for the automated analysis and simulation of an
electronic circuit.
The Verilog HDL had become the popular standard of HDLs and the choice of
many design teams. The Verilog HDL simulator, Verilog-XL, developed by Cadence
Design Systems quickly gained acceptance from design engineers. The advent of
logic synthesis in the late 1980s also radically changed the methodology of digi-
tal designs to cell-based design. The Verilog HDL description is synthesized into a
netlist (a specification of physical electronic components and how they are connected
together), as shown in Figure 1.10, which can then be placed and routed to produce
the set of masks used to create an IC. The gate-level netlist describes the logic gate
schematic.
As a documentation language, an HDL is used to represent digital systems in
a self-documenting form that can be read by both humans and computers. The lan-
guage content can be stored, retrieved, edited, exchanged, and transmitted easily. The
HDL needs not to be tied to a specific semiconductor technology, such as CMOS or
14 Principles of Verilog Digital Design
BJT, in an early stage. Therefore, the designs are usually engineered at a higher
level of abstraction than transistor or logic gate levels. However, HDL still supports
four kinds of descriptions: behavioral, dataflow, gate (or structural), and transistor
(or switch) descriptions.
In contrast to the sequential nature in software languages (like C language), the
HDL allows the designers to model the concurrency of processes found in hard-
ware elements, such as flip-flops (FFs) and adders, without needing to consider their
electrical characteristics. The HDL can be used to represent truth tables, Boolean
expressions, and even complex behavioral abstractions of a digital design. One way
to view an HDL is to understand the relationship between input and output signals
of a circuit that it describes.
HDLs can be processed by different computer softwares efficiently and can also
be used in major steps of the traditional design flow, such as design entry, functional
simulation, logic synthesis, timing verification, and fault simulation. Timing verifi-
cation is assessed by both timing analysis and timing simulation. Fault simulation
is used to confirm the testability for the mass production of an IC. They will be
introduced in the following sections.
An HDL looks much like a high-level programming language, such as C, includ-
ing a textual description consisting of expressions, statements, and control struc-
tures. Another vital difference between most programming languages and HDLs is
that HDLs explicitly include the notion of time, which is a distinctive attribute of
hardware. An HDL supports the co-simulation of digital and analog circuits. In a
cell-based design, digital circuits can be designed in different abstract levels, while
analog circuits are designed and simulated in the transistor level because detailed
electrical characteristics of the circuit is essential for an analog circuit. Behavior
models of analog circuits are mainly used to verify the digital circuits. Owing to the
complexity in analog circuits, digital and analog circuits are designed and verified
separately. Physical analog circuits will be merged into the whole chip in a later
design stage.
a few digital designs. Operations of modern digital circuits are designed based on
clocks and synchronous to the edges of them. The term RTL refers to the fact that
it focuses on describing the flow of signals between registers, as presented in Figure
1.11. RTL describes the model in terms of cycles, based on a defined clock. The flow
of data in RTL modeling is based on a clock. The result of an operation transfers con-
tents from one register to another through combinational circuits, and then replaces
the contents of registers. An RTL model must be accurate at the boundary of every
clocked element. Therefore, the timing of the registers (or sequential D flip-flops),
such as setup time and hold time, should be guaranteed.
The binary information in a digital circuit must have a physical component for
storing individual bits. A binary sequential cell is such a device that is capable of
storing one bit of information, i.e., one of two stable states, 0 or 1. A register is
a group of binary sequential cells. The Verilog HDL simulation enables engineers
to work at a higher level of abstraction than simulation at the schematic level, and
thus tremendously increases design capacity. It is mainly used to implement RTL
abstraction for the functional model of synchronous digital circuits in terms of the
data (signals) flows between hardware registers and the logical operations performed
on those data (signals), and timing modeling of a circuit as well.
There are several salient features of RTL designs compared to traditional
schematic designs.
Figure 1.12: Half adder with registered output: (a) a glance at the RTL design (b)
schematic.
quality of codes written. It can check the flow of different paths in the HDL
source and ensure that whether those path are tested or not as well. However,
it does not give a reliable indication that all of the required functionality has
been implemented correctly.
• Functional coverage: besides code coverage, we can use the functional
coverage, even though it is more difficult to quantify. The functional cov-
erage should identify the operations and the sequences of operations that
have been verified, the range of data values that has been applied, and the
proportion of states of registers and state machines that have been visited.
• Direct testing: it uses particular test cases to apply to the design under test
(DUT) and then validates the outputs of each case. This approach is very
effective for small sub-blocks which implement fairly simple functions.
• Constrained random testing: for a complex system, achieving significant
function coverage is not feasible by direct testing because it is becom-
ing impossible to simulate all the possible scenarios using traditional test-
benches. Hence, the constrained random testing starts gaining attraction.
This involves a test pattern generator that can randomly generate input data,
subject to constraints specified for the inputs. Specialized verification lan-
guages, such as Vera and SystemVerilog, include features for specifying
constraints and random generation of data values to be used as stimulus.
The constrained random testing allows the user to generate random test pat-
terns in a way of exercising the DUT with more combinations of inputs in
less simulation time.
The verification plan also requires a testbench to generate the test patterns or
vectors for each applied test case. Then, the outputs of DUT are compared to the
golden results, which could be produced offline by other behavioral models written in
different programming languages, or produced online by a behavioral model written
in Verilog HDL, as shown in Figure 1.14. The checker might make any adjustments
for timing differences if necessary.
Example 1.3. Timing diagram is crucial for the functional verification. In Figure
1.15, a pipelined design with 3 pipeline stages implements y = f (a), where the com-
binational function f (·) can be decomposed into 2 functions f1 (·) and f2 (·), i.e.,
f (·) = f2 ( f1 (·)). As can be seen, the pipelined design best suits the RTL design.
Without considering gate delays of combinational circuits and timing constraints of
flip-flops, plot the timing diagram of the pipelined design.
Solution: In a synchronous digital circuit, processing data are supposed to ad-
vance one stage triggered by the synchronous clock edge. This is achieved by syn-
chronizing sequential elements such as flip-flops by simply copying their input to
output guided by a clock. For RTL behavioral modeling, the gate delays of combi-
national circuits are all assumed to be 0. Consequently, the timing diagram can be
displayed as that shown in Figure 1.16. As displayed, when stage 3 is processing a1
in clock cycle 3 leading to the data value of f2 f1 (a1 ) in signal c, stage 2 can concur-
rently process a2 leading to the data value of f1 (a2 ) in signal b. A pipelined design
20 Principles of Verilog Digital Design
Figure 1.14: A testbench that can automatically compare outputs of behavioral model
and its RTL implementation.
Figure 1.16: Timing diagram of the 3-stage pipelined design without considering the
logic gate delay.
can give an output every clock cycle. Consequently, the throughput can be enhanced
by the pipelining technique even though the design has a latency of 3 cycles.
too large for exhaustive simulation. On the other hand, formal verification does not
require simulations and allows complete verification whether a design meets its spec-
ification. System-level verification is more difficult for embedded systems. Efficient
co-simulation of software and hardware is a challenge we are facing. Before fulfilling
the hardware design, the software team can use an instruction set simulator (ISS) and
the hardware behavioral model to start software development. System performance
can therefore be evaluated at an early stage.
Simulating a circuit in the presence of faults is known as the fault simulation. Fault
simulation is used to verify the fault coverage and effectiveness of the test patterns,
and guide the test pattern generator program. The test patterns are automatically gen-
erated using the tool, automatic test pattern generator (ATPG). To reduce the use of
expensive test equipments, testing time should be saved. For a digital circuit, the scan
chain is a popular test technique that can make the input signals “controllable” and
output signals “observable”, including internal input and output signals of combina-
tional and sequential circuits.
another essential factor for a workable design. That is, it must be guaranteed. There-
fore, the synthesis tool will give priority to optimize the delay of our design. Then the
area recovery is performed to reduce the area while attaining the timing specification.
A synthesis tool would start by analyzing and checking the conformability of
the model to its style requirements, such as checking design rules, like discover-
ing unconnected outputs, undriven inputs, and multiple drives leading to unresolved
signals. Undriven inputs and multiple drives are errors that must be resolved, while
unconnected outputs are warning and can be waived by designers. At this stage, the
tool uses a simple wire load model to determine the average wire length and its load-
ing, since at this stage the actual layout and wiring has not been done.
Using an EDA tool for synthesis, HDL description can usually be directly trans-
lated to an equivalent hardware netlist file for an ASIC or FPGA. Compared to
FPGA, ASIC requires long design cycle, and is less flexible and suitable for larger-
scale production to reduce its price. From the high-level representation of a circuit,
actual wiring and components can be eventually derived.
the abstract behavioral level. Gate-level simulation can also validate the timing con-
straints and mismatches between RTL and gate-level netlists or layout. For example,
when a signal is wrongly omitted in an always block for a combinational circuit, the
simulation results of RTL designs would differ those of gate-level or layout designs.
However, such a typo can be easily found by looking into the synthesis report. There-
fore, in modern design flow, a design with high confidence can skip the pre-layout
gate-level simulation to reduce the design cycle.
On the contrary, the post-layout gate-level simulation must be performed as a
final verification for the functional and timing requirements of a physical design,
including the confirmation of the high-fanout network, such as the clock and reset
nets. Modern design flow also adopts the formal equivalence check between RTL
and gate-level designs, and timing verification is done mainly by the STA introduced
below to save the design cycle.
• Max time violation: when a signal arrives too late before a clock’s edge,
setup time violation occurs.
• Min time violation: when an input signal changes too soon after a clock’s
edge, hold time violation occurs.
The computational efficiency of the STA has resulted in its widespread use, and
it is linear in the number of paths in a circuit. Therefore, the worst-case or best-
case delay of the combinational circuit over all possible input combinations can be
quickly identified using the STA. Owing to process, voltage, and temperature (PVT)
variations, the propagation delay of a signal can vary. Three corners including worst,
typical, and best corners are commonly investigated. The STA can consider the clock
skew in the synthesis and identify the clock skew after the layout. The clock jitter
can be considered as well.
Example 1.4. Considering gate delays of combinational circuits, and setup time and
hold time constraints of flip-flops, plot the timing diagram of the pipelined design
24 Principles of Verilog Digital Design
Figure 1.18: Timing diagram of the 3-stage pipelined design considering the logic
gate delay.
in Figure 1.15 and explain the relations between gate delay, timing constraints, and
clock period.
Solution: The timing diagram is displayed in Figure 1.18, where T , T1 , and T2 are
assumed to denote the clock period, delay of combinational circuit f1 (·), and delay
of combinational circuit f2 (·), respectively, and we assume that T2 > T1 . Compared
to Figure 1.16, where gate delays of combinational circuits and timing constraints
of flip-flops are ignored, the outputs in Figure 1.18 synchronized by flip-flops are
the same. Therefore, delays are commonly neglected for functional verification. A
pipelined design considering timing information can still give an output every clock
cycle. Besides, the throughput can be enhanced by the pipelining technique even
though the design has a latency of 3 cycles.
Assuming that the setup time and hold time constraints are TS and TH , respectively.
To guarantee that the setup time constraints of all flip-flops are satisfied, since T2 >
T1 , the requirement of the setup time constraint is
TS < T − T2 .
Or, equivalently,
T > T2 + TS .
That is, the clock period must be larger than the max delay (or critical path delay) of
combinational circuits plus the setup time.
To guarantee that the hold time constraint of the 2nd flip-flop is satisfied, the
requirement of its hold time constraint is
T1 > TH .
That is, the delay of combinational circuit must be larger than the hold time.
Introduction 25
The STA calculates the delays and verifies whether the setup time constraint is
satisfied under user-specified clock period requirement. The hold time constraint can
also be verified by the STA. If, unfortunately, the setup time violation happens in
a real chip, the clock period can be extended to solve this issue. However, the sys-
tem performance lowers accordingly. If the hold time violation unfortunately occurs,
since the delay of logics and timing constraint are fixed in a certain operating condi-
tion, it is generally a fatal error and cannot be solved.
easier for floorplanning and packaging than rectangular ones. The first step, floor-
planning, decides the locations on the chip for each blocks in the partitioned design,
particularly for the hard macro cells. Intuitively, to reduce wiring congestion and
wire length, connected blocks should be placed closely, and blocks that sink/drive
external signals should be placed near the I/O pins of the chip. Pin assignment, in-
cluding the arrangement of power supply and ground pins, is therefore also affected
by the block location, and vice versa.
The second step, placement and detailed routing, decides the location of each
component, i.e., placement, and the routing channel for each interconnection wire,
i.e., detailed routing. Due to a large amount of details involved, this step is automated
by EDA tools by considering critical paths and area minimization, and other signal
integrity issues. If not feasible, the floorplan might be adjusted or we might even go
back to the frontend, such as synthesis or even design exploration. Based on phys-
ical layout, detailed timing information can be produced for post-layout gate-level
simulation. Finally, the chip can be taped-out to the foundry for fabrication.
Physical design for FPGAs consists of synthesis, floorplanning, placement, and
routing as well. It implements the design using the resources of a programmable chip
which was prefabricated. For FPGAs, the synthesis tool maps the design into logic
blocks, look-up tables, or input/output (I/O) blocks in an FPGA. By contrast, the
synthesis tool maps the design into cells in a technology library for ASICs. Given
a large amount of details involved, the physical design of FPGA is automatically
generated by the vendor’s EDA tools. Many design issues for FPGA implementation
are similar to those of ASICs. However, the problem is much more constrained for
the FPGA because its implementation is not customized like an ASIC. The same RTL
design implemented by an FPGA typically runs slower than ASIC implementation. If
the timing constraints of FPGA implementations are not satisfied, we might guide the
floorplanning, specify constraints on placement and routing, or use a larger FPGA to
reduce the resource utilization. Finally, a bit stream file is generated to program the
FPGA. Based on the netlist and timing information derived from the physical design
of an FPGA, we can also verify the timing constraints through post-layout gate-level
simulation.
whole system, including the RTL codes and embedded software, can be debugged
and designed with real data through the use of emulators.
As the more number of wafers integrated in the same package, the more com-
plex its structure is. The difficulty in finding individual bad wafers as well as the
compatibility and interconnection between devices can impact the reliability of IC
products. In addition, the cost issue and a shortened time-to-market cycle have urged
the evolution of conventional test methods. As such, the importance of the final test
has been reduced by the wafer level test.
Improving the controllability and observability of signals from the design point
of view, i.e., design for test, will also be enhanced by the concept of test for design.
Test for design emphasizes the collection of data generated by the test process, and
analysis and feedback of them to the design side to adjust design specifications. In
the future, design, fabrication, packaging, and testing will no longer be a step-by-step
procedure, but a process with continuously optimized loop.
PROBLEMS
1. Assume that four accidents (A, B, C, D) might occur, where logic 1 and logic 0
denote that an accident occurs and not occurs, respectively. Alarm is activated
when (1) more than three accidents occur or (2) the fourth accident D comes
together with other accidents. Please design and optimize the Boolean equation
of output alarm.
2. A seven-segment decoder that can display the binary-coded decimal (BCD) is a
combinational circuit with a binary-coded decimal digit as its input and seven
outputs a, b, c, d, e, f , and g for displaying the decimal digit. The seven out-
puts of the decoder select and light up the corresponding segments in the display
as shown in Figure 1.19. Design the BCD-to-seven-segment decoder circuit and
derive the Boolean equations for the outputs a, b, c, d, e, f , and g.
3. Digital circuits are easier to fabricate using NAND and NOR gates than AND
and OR gates, which require NAND and NOR gates together with NOT gates.
Convert the function
Y = A · B +C· D (1.4)
DOI: 10.1201/9781003187196-2 29
30 Principles of Verilog Digital Design
Keywords, which are marked in boldface in this chapter, are reserved identi-
fiers by Verilog that a user cannot use, such as module, endmodule, input, output,
inout, etc. Keywords define the language constructs and all are in lower case. Verilog
keywords are listed in Table 2.1.
3 /* Multiple lines
4 containing comments */
You can use white space to enhance the readability and the organization of codes.
The Verilog language ignores these characters.
the RTL and transistor level increases detailed circuit characteristics and complexity,
and going up to system-level concept tends to be more abstract behavioral model.
Figure 2.2: (a) Partition A: an embedded system. (b) Partition B: a system consisting
of a number of processing steps.
Ports are I/O signals or interconnections of a module. Module ports are equivalent
to pins in hardware. To connect to other modules, all modules have ports except the
testbench. The testbench, which will be introduced later, is the top module used to
test the device under test and it will not be instantiated in another module. A module
hides details in it and enables reusability by instantiation. You can create a larger
system or component by listing instances of other modules and connecting those
instances by their ports. This allows designers to modify internal functionality of
a module without affecting the rest of the design provided that the I/O ports are not
changed. Instantiating a module is not the same as calling a subroutine. Each instance
is a complete, independent, and concurrently physical realization of the module.
34 Principles of Verilog Digital Design
1 // Module definition
2 module module_name ( port_name );
3 port declaration
A module needs a module (or design) name and is defined between the keyword
pair: module and endmodule. The module name is an identifier named by a designer.
Modules communicate with the outside world through ports. All port names are listed
in parentheses. The port declaration can be keyword: input, output, or inout. The
inout port declaration is used to model a bidirectional port. Basic data types, which
will be introduced later, in Verilog are wire and reg. Also, wire is a net used to
connect components, and reg is an “abstract” data storage element. The design of
the circuit is described in the functionality description area.
To illustrate how to use the module to describe a circuit, we design a 2-to-1 multi-
plexer. The multiplexer is basically a selector. The truth table of it is written in Table
2.2.
14 endmodule
Verilog allows a multi-bit wide (or 1-D array) declaration, which is called a vector
or bus, for both port and data type declarations. While the 2-D array declaration is
permitted in data type declaration, it is not allowed for the port declaration.
Verilog primitives are basic logic elements (or gates), which are synthesizable,
and are structural description used for the low-level RTL (or gate-level) design. A
structural model in Verilog represents a schematic that is created using existing com-
ponents. Since Verilog primitives are synthesizable, actual realization of the muti-
plexer is determined and optimized by the synthesis tool. You can create a larger
system by listing instances (instantiation) of other modules or primitives, and con-
necting those instances by their ports. There are one not, two and’s, and one or
instances in this module. The first port on the port list of Verilog primitives is an
output port. In this example, the ports of Verilog primitives are connected in order
(or by position association). For example, the wires sel_inv and sel connect to the
output and input ports of not gate, respectively.
To build up a 4-to-1 multiplexer based on the 2-to-1 multiplexer through the
bottom-up design, we introduce its function table first. When sel[1 : 0] is “00” (i.e.,
sel[0] is 0 and sel[1] is 0), mux_in[0] is selected as the output mux_out; when
sel[1 : 0] is “01” (i.e., sel[0] is 1 and sel[1] is 0), mux_in[1] is selected as the output
mux_out, and so on.
36 Principles of Verilog Digital Design
The circuit symbol of 4-to-1 multiplexer is presented in Figure 2.6(a). Its hierar-
chical modeling is realized by the bottom-up design strategy using 3 2-to-1 multi-
plexers named u0, u1, and u2, as shown in Figure 2.6(b). The same instance name
(and signal name) can be used in different hierarchies of a design.
The Verilog codes are written below. The 4-to-1 multiplexer has 3 submodules of
2-to-1 multiplexer that demonstrates the hierarchical design. The module mux2to1 is
instantiated 3 times as u0, u1, and u2. Notice that instance names (u0, u1, and u2) in
different hierarchies can be the same. When sel[1 : 0] is 2’b00, which denotes that 2
bits in the binary format are 00, i.e., sel[0] is 0 and sel[1] is 0, mux_in[0] is selected;
when sel[1 : 0] is 2’b01, i.e., sel[0] is 1 and sel[1] is 0, mux_in[1] is selected, and
so on. Notice that the wires, mux1_out and mux2_out, are concatenated using the {}
operator to form a 2-bit wire connecting to the input port, mux_in, of u2.
There are two kinds of port mapping: by name or by position (or in order). In the
above example, the ports of the design, mux2to1, are connected by name association.
It’s a good practice to adopt the port mapping by name, which needs not worry
about the actual port positions (that may change in different design versions) in the
instantiated modules. Otherwise, functional errors may occur when ports are wrongly
connected. Unfortunately, these errors might not be pointed out by a simulator unless
a test pattern fails.
A module written in RTL is usually saved in a Verilog file with filename extension
“.v”. Though a file can consist of more than one module designs (or definitions),
it is strongly recommended that a file should contain a module definition and the
filename is the same as the module name in it. Doing so makes the management of
your designs easier.
an unknown X may be neither logic 0 nor logic 1 depending on the driving strength
in physical circuits or the convergence time owing to timing violation. Either way,
an uncertain logic will most likely cause the circuit to fail and requires to draw the
engineer’s attention to solve it in an early design stage. Unknown logic is displayed
in red color on a waveform.
In Verilog, numbers are integer or real constants. Integer constants are expressed
as
<size>’<base><value>
Figure 2.8: Examples of (a) sized and (b) unsized number representations.
Notice that, in Figure 2.8, underscore can be used to separate the number digit
for readability, and it is ignored by Verilog. More examples are presented here: the
most significant bits (MSBs) of the number 6’hCA are truncated to a binary string of
001010 instead of the original number represented by the binary string of 11001010;
6’hA becomes 001010, which is filled (or padded) with two 0’s on the MSBs; 16’bz
becomes zzzzzzzzzzzzzzzz filled with 15 z’s on the MSBs. To represent a negative
number, the negative sign should be put before the <size>. For example, to express
−3 in decimal, we should use −8’d3 instead of 8’d−3. Finally, real number can be
represented in decimal or scientific format.
Solution: There are 7 × 24 × 60 × 60 = 604800 seconds per week. So, the largest
number we need to represent is 604800 sec = 6.048 × 1014. Therefore, the required
10−9 sec
number of bits is
2.4.1 NETS
Nets are connections between structural components, as shown in Figure 2.9.
Figure 2.10: Truth tables of y driven by a and b, when y is declared as: (a) wire/tri,
(b) wand/triand, and (c) wor/trior.
2.4.2 REGISTERS
There are four register types: reg, integer, real, and time, as displayed in Table 2.5.
Notice that integer is most commonly used for the index of a for loop (introduced
later), while real and time are non-synthesizable.
Registers represent abstract data storage elements. Hold its value until a new
value is assigned to it. Registers are used extensively in behavioral modeling and
in applying stimuli. We assign values to registers from procedural blocks. Default
initial value for a register is “X”. Verilog allows you to assign the value of a reg data
type only within an initial block, an always block or function. In Verilog, a time
variable is a 64-bit variable used to represent the simulation time typically used with
the $time system task, which will be introduced later.
Since reg is mostly used in Verilog, below are some declarations for it. Though
the 2-D array declaration is permitted in data type declaration, it is not allowed for
the port declaration.
Originally, you can access individual register of a 2-D memory, but you cannot
access individual bits of register directly. To do that, you must read a word in the
2-D memory. Then, you can access desired bits in that word. For example,
After Verilog-2001, we can access individual bits of 2-D memory. Besides, high-
dimensional array with variable addresses is synthesizable as shown below.
2.4.3 PARAMETERS
You can use a parameter anywhere that you have to use a literal (constant). For
example, the following module demonstrates a parameterized design.
1 // Parameterized design
2 module param_test ( port_name );
3 parameter m1 =8; // Parameter declaration
4 wire [ m1 : 0] w1 ; // Wire with m1 +1 bits
5 ...
6 endmodule
Parameters are local, known only to the module in which they are defined, and they
can be signed and sized. In you need a global parameter, you can use the ‘define
compiler directive’.
You can define a parameter, m3, by other parameters, m1 and m2, as follows.
Finally, before leaving this section, we exemplify some common mistakes of dec-
laration below.
4 input in ;
9 // blocks as reg .
10 wire op2 ; // Wrong ! Declare variables in procedural
11 // blocks as reg .
12 wire out ; // Correct ! Declare variables in continuous
13 // assignments as wire .
14 adder add0 (. s ( sum ) , . c ( cout ) , . a (. op1 ) , . b ( op2 ));
18 ...
19 endmodule
22 input a , b ;
23 ...
24 endmodule
In addition to using gates and interconnect nets, you can model combinational
circuit with continuous assignments. The continuous assignment is used to represent
a combinational logic circuit that can be conveniently represented by an equation
or Boolean equation. During simulations, continuous assignments execute whenever
their expressions on the right-hand side change. As its name implies, the execution is
44 Principles of Verilog Digital Design
immediate and its effect is that the output, y, on the left-hand side of the expression is
updated promptly once inputs, a and/or b, change. The y on the left-hand side of the
continuous assignment is the output of the combinational circuit, and hence, it must
be declared as wire. The variables, a and b, are inputs of the combinational circuit.
Because the assignment is continuous, the output updates simultaneously with any
input change; therefore, combinational logic is implied.
Alternatively, the continuous assignment can be implicitly put into the declaration
for shorthand as follows.
• Procedural assignment statements to describe the data flow within the block.
• High-level constructs, such as loops and conditional statements, to describe
the functional operation of the block.
• Timing controls to control the execution of the block and the statements in
it.
4 b=a;
5 c=b;
6 end
where “=” is the blocking assignment. The assignments are performed sequentially.
Eventually, c will have the same value of a. Notably, those variables on the left side
of an assignment in procedural blocks should be declared as reg representing abstract
data storage elements. Initial block is non-synthesizable. It is only used in testbench.
If you write codes like this.
4 initial
5 a =0;
6 initial
7 a =1;
1 initial
2 a =0;
3 initial
6 #0 a =1;
The pound sign character, #, is used for the delay control, and denotes the delay spec-
ification for procedural statements and gate instances, but not for module instances.
Adding zero delay control will ensure the statement to be executed and placed last
in the event queue of Verilog. If you write a code like this, a will eventually be 1.
However, “#0” is a bad coding style and shall be discouraged.
The functional verification is used to ensure that the design performs required
operations correctly. To this, we can develop a testbench model that generates input
signals to the DUT and checks its outputs, as shown below. You can put all Verilog
constructs into a testbench whether they are synthesizable or not. The simplest way
to produce the input signal, input_a, is using the initial block, which specifies input
signals sequentially after the reset activation (rst_n from 0 to 1) and the edges of a
clock signal, clk. The wait statement halts the execution until its argument becomes
true.
46 Principles of Verilog Digital Design
1 // A simple testbench
2 module testbench ;
3 reg sel ;
4 reg [1 : 0] mux_in ;
The DUT, u0, is an instance of the Verilog module, mux2to1, that describes the whole
design. The testbench is also a Verilog model that, owing to its purpose, without any
input and output signals. The testbench is not intended to describe the circuit that we
are developing. Rather, its purpose is to apply a sequence of signal values, called test
cases or test patterns, to the inputs of DUT, and to monitor its outputs (by $monitor
system task) to ensure that correct output signals are produced. In the testbench,
the $monitor system task displays the values of monitored signals whenever they
change in a binary format as indicated by %b. The $stime system task returns current
time as a 32-bit unsigned integer value to notify the event time of occurrence. A
Fundamentals of Verilog 47
simulator executes the DUT and testbench models that assign values to nets and
variables indicated by the testbench and DUT with a notion of the progress of time,
as presented in Figure 2.12.
The sensitivity list controls the way an always block executes. To describe a com-
binational circuit, the sensitivity list consists of one or more signals. When at least
one of these signals changes, the always block executes through to the end keyword.
Then, it can be triggered again when the condition meets what you specified in the
sensitivity list. For example, a combinational circuit output, x, has three inputs, a, b,
and c. When at least one of a, b, or c changes, the output x must respond to it. That’s
our sensitivity list. The statements inside the always block describe the functionality
of the design.
Completely specify the sensitivity list to avoid mismatches between RTL and
gate-level netlists, as shown below. If the sensitivity list is incomplete without signal
c, the synthesized netlist will still be a 3-input OR gate but the result of gate-level
simulation will differ from that of its RTL model.
4 always @ ( a or b or c )
5 x=a|b|c;
Notably, those variables on the left side of an assignment in procedural blocks should
be declared as reg representing abstract data storage elements. An always block is
evaluated whenever the sensitivity list is triggered. In this example, x is recalculated
as soon as any input (a or b or c) has a level transition (0 to 1 or 1 to 0). This behavior
equals the functionality of a combinational circuit. So, x will be synthesized to be the
output of a combinational circuit.
Verilog-2001 introduced additional syntax for describing sensitivity lists.
Some synthesis tools consider incomplete sensitivity list illegal. You can use (*)
to represent all inputs in Verilog-2001. That is, all variables on the right-hand side
of all expressions in an always block are treated as signals in the sensitivity list. For
example, the above example can be rewritten as
Fundamentals of Verilog 49
5 x=a|b|c;
6 end
The always block can be used to describe a complex expression that needs multi-
ple assignments. For example,
4 y1 = a + b ;
5 y2 = c * d + e ;
6 y = y1 * y2 ;
7 end
Sequential circuits are storage units. Rising edge (or positive edge) and falling edge
(or negative edge) are represented by posedge and negedge, respectively, as shown
in Figure 2.14. It must be emphasized that the sensitivity list of posedge and negedge
for sequential circuits cannot be replaced with (*).
For example,
5 x <= a + b ;
Notably, the non-blocking assignment “<=” will be introduced in the next chapter.
For the time being, you can treat it as the blocking assignment “=”. In this example, x
is recalculated as soon as clk changes from 0 to 1 (or at every positive edge of signal
clk). This behavior equals the functionality of a hardware storage element. So, x will
be synthesized to a true hardware storage element. Using both posedge and negedge
in the same sensitivity list, as shown below, is not permitted for synthesizable codes.
5 x <= a + b ;
If you opt to do this, the same functionality can be achieved by using a clock clk2x
with double frequency of clk and rewrite the codes as follows.
5 x <= a + b ;
9 if ( b ==3 ’ d4 ) a <= a + b ;
Fundamentals of Verilog 51
6 if (! rst_n ) b <=3 ’ d0 ;
7 else b <= b +1;
For example, each of the following two always blocks contains a for loop in it.
The index i can be used for both for loops by the named blocks.
52 Principles of Verilog Digital Design
8 always @ (...)
4 input [7 : 0] a , b ;
5 // Array of instantiati on s
14 endmodule
To quickly instantiate them, Verilog supports that similar to the use of a vector as
follows.
4 input [7 : 0] a , b ;
5 // Instantiation using vectorized instance name
54 Principles of Verilog Digital Design
Verilog has four types of conditional primitives: bufif0, bufif1, notif0, and notif1,
as shown in Figure 2.16. They all have three pins: output y, input x, and enable e.
For example, bufif0 is a conditional buffer. It is enabled so that output y is driven
by input x when enable pin, e, is 0. Otherwise, y has high impedance because it is
isolated (or disconnected) from x and has no drivers.
Figure 2.16: Conditional buffers: (a) bufif0, (b) bufif1, (c) notif0, and (d) notif1.
2.8 EXPRESSION
An expression comprises of operators and operands. For example. the expression
w = x+y−z (2.2)
x + y = 2n . (2.3)
Fundamentals of Verilog 55
For example, the 2’s complement number y of 3-bit binary number x = 0012 is y =
1112 because 0012 (110 ) + 1112 (710 ) = 23 (10002 = 810 ). Alternatively, you can
obtain the 2’s complement number of x by “inverting every bits of x and then adding
1 to the result”. For example, after inverting every bits of x, you got 1102 . Then, add
1 to the result, you got 1112 as well.
The two’s-complement number system encodes 2’s complement of x as its neg-
ative number in a binary number representation. Therefore, in a two’s-complement
number system, the 2’s complement number y = 1112 of x = 0012 is treated as −110
instead of +710.
Yet another viewpoint, an n-bit unsigned binary number, x = {xn−1 , xn−2 , ..., x0 },
is represented in a weighted sum of powers of two by
When xn−1 = xn−2 = ... = x0 = 0, the smallest positive number that can be repre-
sented is 0. Likewise, when xn−1 = xn−2 = ... = x0 = 1, the largest positive number
is 2n − 1. Hence, the dynamic range of an n-bit unsigned binary number is
0 ≤ x ≤ 2n − 1. (2.5)
If xn−1 is 1, the number represented is negative, since the sum of all the powers of 2
with positive weights is less than 2n−1 . Likewise, if xn−1 is 0, the represented number
is positive. Thus, xn−1 is well known to serve as a sign bit.
Example 2.2. What values are represented by the 8-bit 2’s complement unsigned
numbers 00100101 and 10110001?
Solution: The first number is
1 × 25 + 1 × 22 + 1 × 20 = 37.
1 × 27 + 1 × 25 + 1 × 24 + 1 × 20 = 177.
56 Principles of Verilog Digital Design
Example 2.3. What values are represented by the 8-bit 2’s complement signed num-
bers 00100101 and 10110001 in the previous example?
Solution: The first number is
1 × 25 + 1 × 22 + 1 × 20 = 37.
The second number is
−1 × 27 + 1 × 25 + 1 × 24 + 1 × 20 = −79.
From above examples, the same binary number may be interpreted differently by
different number representations. Therefore, it is important that one should know the
number representations of all variables when they are described using Verilog RTL
models.
2.8.2 OPERAND
There are four data objects that can form the operands of an expression, as shown in
Table 2.7.
4 output [4 : 0] y1 , y2 ;
5 input [1 : 0] sel1 ;
6 input sel2 ;
7 reg [4 : 0] y1 , y2 ;
10 identifiers */
11 always @ ( sel1 )
The following example illustrates index and slice operands. Index operand spec-
ifies a single element of a vector and slice operand specifies a segment of elements
within a vector.
4 input [2 : 0] a ;
5 reg [3 : 0] y ;
6 always @ ( a ) begin
10 endmodule
The function is delimited by the keywords function and endfunction. The fol-
lowing example illustrates function call operands. Function calls, which must reside
in an expression, are operands. The single value returned from a function is the
operand value used in the expression.
4 input [2 : 0] a , b , c ;
5 reg [3 : 0] y ;
6 always @ ( a or b or c )
9 input [2 : 0] i1 , i2 ;
10 add_func = i1 + i2 ;
11 endfunction
12 endmodule
conditions on the results: 1) a carry out is generated. In this condition, the result is
positive. Dropping the carry out, remaining part is the final result; 2) a carry out is not
generated. In this condition, the result is negative. Remaining part is the final result.
For example, 210 − 110 = 110 . In binary representation, 0102 + 1112 = 10012. A carry
out is generated. So, the final result is 0012 = 110 . Another case, 110 − 310 = −210.
In binary representation, 0012 + 1012 = 1102. A carry out is not generated. So, the
final result is 1102 = −210 .
2.8.3 OPERATORS
Operators, shown in Table 2.8, perform an operation on one or more operands within
an expression. An expression combines operands with appropriate operators to pro-
duce the desired function expression.
binary XNOR ∼∧ or ∧ ∼
Shift Operators
logical shift left ≪
logical shift right ≫
arithmetic shift left ≪
arithmetic shift right ≫
Concatenation & Replication Operators
concatenation {}
replication {{}}
Reduction Operators
AND &
OR |
NAND ∼&
NOR ∼|
XOR ∧
XNOR ∼∧ or ∧ ∼
Conditional Operators
conditional ?:
4 output [3 : 0] y1 , y2 ;
5 output [4 : 0] y3 ;
6 output [2 : 0] y4 , y5 ;
7 input [2 : 0] a , b ;
8 reg [3 : 0] y1 , y2 ;
9 reg [5 : 0] y3 ;
10 reg [2 : 0] y4 , y5 ;
11 always @ ( a or b ) begin
12 y1 = a + b ; // Synthesizabl e
13 y2 =a - b ; // Synthesizabl e
60 Principles of Verilog Digital Design
14 y3 = a * b ; // Synthesizable
15 y4 = a / b ; // Synthesizable
16 y5 = a % b ; // Synthesizable
17 end
18 endmodule
To prevent overflow, one more bit of the result than the operand is required for the ad-
dition and subtraction. This can be proved using the dynamic ranges of n-bit operands
and (n + 1)-bit result. The dynamic range of (n + 1)-bit result is larger than the value
of the sum of two n-bit operands.
Similarly, the product of two n-bit operands should have 2n bits to prevent over-
flow. The complexity of a multiplier is much larger than that of an adder because the
multiplication needs several adders. As shown in Figure 2.17, the multiplication of
two 4-bit operands, M and Q, requires 3 4-bit adders and the product P has 8 bits.
One vacant bit exists in the MSB of the first adder. It is padded with 0 because of the
unsigned multiplication.
In the above module, if y1 is declared as [5 : 0], two more bits than necessary, the
operands a and b will pad 2’b00 on the left-most two MSBs to produce the result.
On the contrary, if y1 is declared as [1 : 0], two fewer bits than necessary, the 4-bit
result of a + b will be truncated (or discarded) on the left-most two MSBs.
The Verilog introduced power operator, ∗∗. The x ∗ ∗y, for example, means x to
the power of y. The most common use of power operator would be 2 to the power
of N, i.e., 2N . The power operator is non-synthesizable except for x = 2 or y = 2.
When x = 2, it represents the shift operator introduced later. When y = 2, it equals
the multiplication.
Fundamentals of Verilog 61
4 output [2 : 0] y2 ;
5 input [2 : 0] a , b ;
6 reg [3 : 0] y1 ;
7 reg [2 : 0] y2 ;
8 always @ ( a or b )
9 y1 = a + - b ; // The same as a - b
10 always @ ( a )
11 y2 = - a ; // Negate a
12 endmodule
4 input [2 : 0] a ;
5 input [2 : 0] b ;
6 reg [3 : 0] y ;
7 always @ ( a or b ) begin
9 y [0]= a > b ;
10 y [1]= a >= b ;
11 y [2]= a < b ;
12 y [3]= a <= b ;
13 end
14 endmodule
4 input [2 : 0] a ;
5 input [2 : 0] b ;
6 reg [3 : 0] y ;
7 always @ ( a or b ) begin
9 y [0]= a != b ;
10 y [1]= a == b ;
11 y [2]= a !==3 ’ b1X0 ; // Non - synthesizable
12 y [3]= b ===3 ’ bZZZ ; // Non - synthesizable
13 end
14 endmodule
Notably, “===” and “!==” can compare high impedance Z and unknown X, but only
applicable for simulation. You cannot build digital hardware to operate on unknown
or high impedance. Thus, the comparisons, “a!==3’b1X0” and “b===3’bZZZ”, are
non-synthesizable.
4 input sel ;
5 input [2 : 0] a ;
6 reg [1 : 0] y ;
7 always @ ( sel or a )
4 input [2 : 0] a ;
5 input [2 : 0] b ;
6 reg [2 : 0] y1 , y2 , y3 , y4 , y5 ;
7 always @ ( a or b ) begin
8 y1 =~ a ;
9 y2 = a & b ;
10 y3 = a | b ;
11 y4 = a ^ b ;
12 y5 = a ^~ b ;
13 end
14 endmodule
About the bit-wise operator, “∼”, the above statement y1=∼ a is a shorthand of
the following codes.
4 y1 [2]=~ a [2];
4 input sel ;
5 input [2 : 0] a ;
6 reg [2 : 0] y ;
7 parameter B =1;
8 always @ ( sel or a )
In the above module, when SEL is true, y is the result of left shift a by one bit. Vacant
bits will be padded with 0. Therefore, it is a shorthand of the following codes.
4 y [2]= a [1];
When SEL is false, y is the result of right shift a by two bits. Vacant bits will be
padded with 0. Therefore, it is a shorthand of the following codes.
4 y [2]=1 ’ b0 ;
{2{a}}, which is the same as {a, a}, and b, {2{a}}, and c are concatenated to form
a vector and then assigned to y. The evaluation of the Verilog codes is left as an
exercise.
4 output [10 : 0] y ;
5 input [2 : 0] a , b ;
6 reg [10 : 0] y ;
7 parameter C =2 ’ b01 ;
8 always @ ( a or b )
9 y ={ b , {2{ a }} , C };
10 endmodule
4 input [2 : 0] a ;
5 reg [5 : 0] y ;
6 always @ ( a ) begin
7 y [0]=& a ;
8 y [1]=| a ;
9 y [2]=~& a ;
10 y [3]=~| a ;
11 y [4]=^ a ; // XOR , odd parity
12 y [5]=~^ a ; // XNOR , even parity
13 end
14 endmodule
4 input sel ;
5 input [2 : 0] a , b ;
6 reg [3 : 0] y ;
7 always @ ( sel or a or b )
8 y =( sel ==1)? a + b :a - b ;
9 endmodule
We need to design a 3-to-8 decoder with enable control and its testbench. The
truth table of 3-to-8 decoder with enable control is displayed in Table 2.10.
Fundamentals of Verilog 67
Figure 2.19: Gate-level schematic of the 3-to-8 decoder with enable control.
A testbench for the decoder can be written below. The ‘timescale declares the
time unit and its precision, which are significant for the delay modeling. The syntax
of ‘timescale is
‘timescale <time_unit> / <time_precision>
For example, ‘timescale 10 ns/1 ns declares that the time unit is 10 ns and time pre-
cision is 1 ns. Only use timescale in the top (testbench) module, and it is inherited
to all sub-modules. If upper module has no timescale but its submodule has, an er-
ror occurs. The submodule with timescale will overwrite the timescale of its upper
module.
68 Principles of Verilog Digital Design
4 reg enable ;
5 reg [2 : 0] in ;
6 wire [7 : 0] out ;
7 // Module instantiation
9 // Stimulus
10 initial begin
22 initial begin
26 endmodule
The time unit is 10 ns, so #10 delays 10 time units, i.e., 100 ns. As another example,
#10.75 delays 10.75 time units, i.e., 108 ns. The testbench is simple that it gener-
ates stimulus exhaustively. However, when the system is large enough, exhaustive
testing strategy is typically not achievable. In this condition, partial testing strategy
is adopted. Some kinds of randomness must be introduced into the verification pro-
cess. Assertion checking then leverages the confidence level of the test patterns.
Functional simulation for 3-to-8 decoder can dump the waveform for functional
verification using the $dumpfile and $dumpvars system tasks, as shown in Figure
2.20. Verilog provides a set of system tasks to record signal value changes in the
standard value change dump (VCD) format. Most wave display tools read this for-
mat, among others. The $dumpfile system task specifies which file is to be dumped,
and $dumpvars system task specifies which variables are to be dumped.
Fundamentals of Verilog 69
You can specify the levels and scope arguments to $dumpvars using the following
syntax.
The scope arguments are indicated by the hierarical names of the modules. We as-
sume that the instance, top, has three instances u1, u2, and u3 within it. When level
is 0, such as $dumpvars(0, top), it specifies all signals in top and below are dumped.
This is the most commonly used case. When level is 1, $dumpvars(1, top) specifies
the signals in the hierarchy of top are dumped, not including those signals in u1, u2,
u3, and below. When n is not 0 and 1, the level n specifies the signals within the n − 1
levels below the scope arguments to be dumped. For example, $dumpvars(3, top.u1,
top.u2) specifies the signals within the 2 levels below the scopes, top.u1 and top.u2,
are to be dumped.
Timing simulation (pre-sim or post-sim) of gate-level netlist with SDF back an-
notation for the 3-to-8 decoder can dump the waveform for both function and timing
verification, as shown in Figure 2.21. Owing to different gate delays, the path delays
from different inputs to output are generally different.
70 Principles of Verilog Digital Design
PROBLEMS
1. Let a =4’b1011, b = 4’b0010, c = 8’b00000100, d = 8’b00001111, and
e =4’b0000, find the results of following expressions.
5 input [7 : 0] din ;
6 reg [7 : 0] mem [0 : 65535];
9 if ( ce & we )
10 mem [ addr ] <= din ;
11 always @ ( oe or ce )
12 if ( ce & oe )
13 tempQ = mem [ addr ];
14 else
15 tempQ =8 ’ hzz ;
16 always @ ( posedge clk )
where +, (·), and · denote the bitwise OR, NOT, and AND operations, re-
spectively.
Fundamentals of Verilog 73
b.
Out2 = C · (A · D + B) + B · Ā,
(2.9)
Figure 2.22: (a) Block diagram of packet generator and receiver. (b) Timing diagram
of the interface.
b. Write a behavior model of the packet receiver. Check CRC of received pack-
ets and obtain the data statistics for the numbers of high and low packets, and
CRC error packets.
3 Advanced Verilog Topics
In this chapter, advanced Verilog statements, such as if-else, case, for loop, function
and task, are introduced. We emphasize design guidelines for inferring combina-
tional and sequential logics. The timing of a circuit is as important as its function-
ality. Therefore, accurate delay modeling is the key to the success of an ASIC. The
differences between inter-assignment and intra-assignment delays, and blocking and
non-blocking assignments are identified. Finally, the system tasks, approach of tim-
ing simulation, and several advanced Verilog features are presented as well. Notably,
keywords will be marked in boldface in this chapter.
5 [ else begin
6 statements
7 end ]
If the value of the expression is nonzero, then the expression is true and the state-
ment block that follows if will be executed. If the value of the expression is zero, the
expression is false and the statement block following else is executed. If-else state-
ments can cause synthesis of latches, which are not suitable for synchronous design
and static timing analysis. In the following example, if the signal en is logic one,
DOI: 10.1201/9781003187196-3 75
76 Principles of Verilog Digital Design
the output signal, out, will be assigned the result of in1 + in2. Otherwise, out will be
assigned in1.
5 reg [1 : 0] out ;
11 endmodule
The (output) port declaration and data type (reg) declaration can be combined
as follows. Since the output port has a default wire data type. It’s not necessary to
combine an output port with wire data type. Moreover, input ports all have default
wire data types. It’s not necessary to combine an input port with wire data type as
well.
Advanced Verilog Topics 77
10 endmodule
To prevent latches, the outputs MUST be fully specified. This is a basic rule of
combinational circuits. If the outputs are not fully specified, latches in Figure 3.2
will be induced because outputs will keep their original states for all unspecified
conditions, which coincide with the functionality of latches. Latches are sequential
circuits and can be used as memory elements.
In the following example, the else condition is not specified, so latches are in-
ferred.
4 output [3 : 0] out ;
5 input enable ;
6 input [3 : 0] in ;
7 always @ ( in or enable )
8 begin
13 endmodule
78 Principles of Verilog Digital Design
In the following example, signal out is implicitly assigned a default value. If the
signal enable is false, the default value will be assigned to out. Therefore, out is fully
specified, and no latches are inferred.
Rather, the following always block infers the circuit in Figure 3.5. The adder is
shared for either case of the select signal, sel, by multiplexing the operand of the
adder. The multiplexer may degrade performance.
Figure 3.5: Resource sharing must be performed in the same always block.
12 default :
13 begin
14 statements
15 end
16 endcase
4 input [1 : 0] in ;
5 reg [3 : 0] out ;
6 always @ ( in )
7 case ( in )
8 2 ’ b00 : out =4 ’ b0001 ;
9 2 ’ b01 : out =4 ’ b0010 ;
10 2 ’ b10 : out =4 ’ b0100 ;
11 2 ’ b11 : out =4 ’ b1000 ;
12 endcase
13 endmodule
4 input [1 : 0] in ;
5 reg [3 : 0] out ;
6 always @ ( in )
When signal in=2’b11, the condition is not specified, and just as we saw in if-else
statements, a latch will be inferred. To avoid such a mistake, it is always a good idea
to use a default-case-item fore all conditions to ensure that no latch is implied, as
shown below.
82 Principles of Verilog Digital Design
If you are using Synopsys Design Compiler (DC) as your synthesis tool, you can
prevent lathes by “synopsys full_case” directive to specify all possible branches for
if and case statements provided that you know the other branches will never occur.
Additionally, if DC cannot determine that case branches are parallel, a priority de-
coder will be synthesized. By contrast, you can declare a case statement as parallel
case with the “synopsys parallel_case” directive, as shown below.
In a short summary, we MUST completely specify all clauses and all outputs
for every clause in case and if statements. Failure to do so will cause latches to be
synthesized.
If the conditional expression used is parallel and the functional outputs are the
same, as shown in the following two RTL codes, then the hardware synthesized will
be identical, as shown in Figure 3.6. However, it is always preferable to use a case
statement instead of an if-else-if statement to save simulation time and explicitly
infer the parallel multiplexer.
Figure 3.6: Mutually exclusive conditional expressions are synthesized to the same
parallel multiplexer.
The case statements consist of the keywords, casez and casex, which can use (Z
or ?) and (Z, X, or ?) in their case items, respectively. Hence, casex statements are
more commonly used than casez statements. The character ? represents logic 0 or
1. For example, the case item 4’b001? contains 4’b0010 and 4’b0011, and the case
item 4’b01?? contains 4’b0100, 4’b0101, 4’b0110, and 4’b0111. For synthesis, the
characters, Z and X, should be treated as don’t case, i.e., logic 0 or 1, as shown
below.
4 casex ( sel )
5 4 ’ b1XXX : out = a ; // sel is 1 XXX
6 4 ’ b01XX : out = b ; // sel is 01 XX
7 4 ’ b001X : out = c ; // sel is 001 X
8 4 ’ b0001 : out = d ; // sel is 0001
9 4 ’ b0000 : out = e ; // sel is 0000
10 default : out = a ;
11 endcase
12 end
Although in the above Verilog codes casex statement is used, it is still intuitively
not parallel because, for example, 1XXX contains 8 items, including 1000, 1001,
1010, 1011, 1100, 1101, 1110, and 1111, while 0000 contains only 1 item. If your
goal is to minimize the circuit area, the priority-encoded multiplexers in Figure 3.3
suffices. By contrast, if you want to minimize the path delay and maintain a modest
circuit area, the circuit in Figure 3.7 might be synthesized. In Figure 3.7, the path
delays from all inputs, including sel, a, b, c, d, and e, to the output, out, are balanced
as much as possible. However, the circuit area of Figure 3.7 is obviously larger than
that of Figure 3.3.
All of the elements are then synthesized in the manner shown in Figure 3.8.
Figure 3.7: Another implementation of the multiplexers described using casex state-
ment.
86 Principles of Verilog Digital Design
Figure 3.8: The for loop is unrolled and synthesized to five AND gates.
To make the for loop synthesizable (and unrollable), the low_range, high_range
and step must be constants rather than variables. Therefore, the index becomes pre-
dictable so that a fixed number of repeated circuits can be inferred.
The maximum iteration limit of a for loop for synthesis is 4096. If your design
contains a for loop with iteration of 16384, you may separate one big for loop into
several smaller for loops.
When there are many for loops in your codes, the dummy variables of indices used
in for loops may coincide. You should declare different indices for the for loops in
different always blocks. Otherwise, you can declare the same index, say i, used in all
for loops using the named block as follows.
1 // Syntax of function
2 function [ range ] name_of_fun ct io n ;
3 input declaration
4 statements
5 endfunction
Range defines the width of the return value of the function (default is 1 bit). The in-
put declaration specifies the input signals for a function, and the output of a function
is assigned to the function name. For example, the following module demonstrates
three different ways to implement the combinational circuit of an adder, including
continuous assignment, always block, and function. You can call function in an al-
ways block.
4 reg [4 : 0] c1 , c2 ;
5 wire [4 : 0] c3 ;
6 function [4 : 0] fn1 ;
7 input [3 : 0] a ;
8 input [3 : 0] b ;
9 fn1 = a + b ; // Like C language
10 endfunction
11 always @ ( a or b )
14 c2 = a + b ;
15 assign c3 = a + b ;
Functions are defined in the module in which they are used, and they cannot contain
delays.
The concatenation operation is used to bundle several values for multi-outputs of
a function. These outputs are then separated, also using the concatenation operation.
Function can contain one or more procedural assignment statements (enclosed inside
a begin-end pair). Function is also a procedural block. Therefore, the left side of an
assignment in a function can contain only reg and integer variables. You can also call
function in a continuous assignment. In the following example, it should be noted
that y1 and y2 may overflow.
88 Principles of Verilog Digital Design
5 begin
6 y1_1 = f1 + f2 +5;
7 y2_1 = f1 + f2 +2;
8 // Concatenate multi - outputs to a single output
9 fn1 ={ y1_1 , y2_1 };
10 end
11 endfunction
A function can also be nested, that is, a function can contain another function, as
shown in the following example. Consequently, you can call a function in another
function. Notably, Fn1 may overflow.
4 fn2 = f1 + f2 ;
5 endfunction
6 function [4 : 0] fn1 ;
7 input [3 : 0] f1 , f2 ;
A Verilog Task begins with keyword task and ends with keyword endtask. It is
the section of a Verilog code that allows the designer to write more reusable, easier
to read codes. When tasks are placed within a testbench, they can be very handy
because tasks can include time delays. This is one of the main differences between
tasks and functions: functions do not allow time delays. Tasks without timing con-
trols are like functions. However, tasks with timing controls are non-synthesizable,
and therefore, they are typically used in a testbench. The syntax of a task is dis-
played below. The 2-D array declaration is not allowed for the input and output ports
declaration.
Advanced Verilog Topics 89
1 // Syntax of task
2 task name_of_task ;
3 input declaration
5 statements
6 endtask
The input declaration specifies the input signals for a task. A task can have no inputs.
The output and inout declaration will be similar to that in a module. A task can also
have no outputs, although a task can include multiple output, input, and inout ports
and/or delays. You can call tasks in an always block or other tasks.
A task used to generate the out signal at clock edges using a sequence of 10101010
or 10100110 selected by the sel signal is shown below. The Verilog code “@posedge
clk” in the task is used for the timing control.
9 task seq_gen ;
10 input sel ;
11 output out ;
12 begin
22 endtask
The differences between function and task are listed in Table 3.1. Notably, both
functions and tasks can only contain procedural assignment statements. Always
blocks are not allowed in either function or task.
90 Principles of Verilog Digital Design
4 input [ WIDTH -1 : 0] a ;
5 input [ HEIGHT -1 : 0] b ;
6 input [ LENGTH -1 : 0] c ;
11 endmodule
Otherwise, you can change the values of these parameters when instantiating
the module. For example, the following module overrides those values of param-
eters when the above test module is instantiated as WIDTH=5, HEIGHT=4, and
LENGTH=4. Therefore, the parameterized test module does not need to be modi-
fied and is easy to maintain. In this way, you need to specify all parameters and no
parameters can be omitted.
Advanced Verilog Topics 91
4 output [5 : 0] d ;
5 input [4 : 0] a ;
6 input [3 : 0] b ;
7 input [3 : 0] c ;
9 endmodule
6 test u0 (. a ( a ) , . b ( b ) , . c ( c ) , . d ( d ));
7 endmodule
Figure 3.9: (a) A buffer with two inverters. (b) Capacitive effect inside a buffer, where
VX , VA , and VY are voltages at nodes X, A, and Y , respectively. The capacitive effect,
C, at node A is due to the high-frequency parasitic capacitor in the second inverter.
Figure 3.10: (a) The current flow when input voltage, VX , changes from 0 to VDD V.
The charge on the capacitor, C, is sinked through the NMOS. (b) The current flow
when input voltage, VX , changes from VDD to 0 V. The charge on the capacitor, C, is
driven through the PMOS.
Figure 3.11: Rising time, falling time, and propagation delay of an inverter.
the signal switches at a frequency of f = 1/T Hz, the consumed dynamic power is
E 1 2
P= = fCVDD . (3.2)
T 2
There are two main sources of power consumption, i.e., dynamic and static. Static
power, which is needed for analog ICs, may be generated when the direct current is
drawn from the power supply to ground. Ideally, during a steady state, as shown
in Figure 3.10, the static power of digital circuits should be 0 because there is no
direct path between VDD and GND (0 V) when either PMOS or NMOS switches
off. However, advanced semiconductor devices may have a non-negligible leakage
of current when they are in a steady state and not completely turned off because of
their low threshold voltage.
unreliable circuit. By inserting buffers between the output and the inputs of the next
stage, each buffer output is now driving a fraction of the load of the original output.
This separates the loading on the output and each buffer now drives four components
in Figure 3.12.
'[
'[
'[
'[
IDQRXW
įġįġį
6[
'[
'[
'[
'[
Figure 3.12: Using buffers to reduce loading on the output of a component. Notice
that the component marked Sx is the source component, and the component marked
Dx is the driven component.
Similarly, when the number of inputs which must be driven for the next stage is
very large, such as the clock and reset networks, we can repeatedly insert buffers
to the output of another buffer, like a buffer tree. As shown in Figure 3.13, such
as fanout is limited to 4. A two-level buffer tree stemming from the original output
drives the original inputs of the next stage using four inserted buffers. Increasing
the level of a buffer tree exponentially improves the possible number of inputs that
can be driven. To achieve a synchronous design, the delays of each branch must be
balanced so that the clock skew can be reduced as much as possible. Hence, buffers
are inserted, spaced evenly over the chip area.
• Delay can be lumped onto the last gate driving the output. Lumped delay is
easy but does not allow for different delays if there is more than one path to
a single output.
Advanced Verilog Topics 95
Figure 3.13: Using a balanced buffer tree to reduce loading on a high fanout output,
such as a clock network, while keeping the clock skew to a minimum.
• Distributed delays are positioned across the gates. Distributed delays are
more accurate than a lumped delay scheme, but may not allow for different
delays if there is more than one path to a single output. For example, pins A
and B to E have the same path delay, while C to E has a different path delay.
• Specify module path delays in a specify block. Module path delays are easy
to model, and allow different delays for different paths.
A gate delay is also called an inertial or intrinsic delay. The physical behavioral
of a signal transition is said to have an inertial, because every conducting path has
capacitance, as well as resistance, which means that its charge cannot be quickly
changed. If the input signal width is less than the inertial delay, the pulse will not
propagate through the gate.
1 // Syntax of delay
2 #( delay1 , delay2 , delay3 ).
If all are specified, the first delay, delay1, refers to the output rising delay (transition
to 1), the second delay, delay2, refers to the output falling delay (transition to 0), and
the last delay, delay3, refers to the output turn-off (transition to Z) delay. When no
delay specifications are given, the default gate delay is zero. When only one delay
specification is given, the output rising, falling, and turn-off delays are specified by
the same delay. When two delays are given, delay1 and delay2 represent the output
rising and falling delays, respectively. The turn-off delay is the minimum of delay1
and delay2.
In the following Verilog codes, if the timescale is defined by ‘timescale 10 ns/1 ns,
the above delays of NOT, OR, and AND gates will be 20, 25, and 36 ns, respectively.
If a new in[2] is activated at the 10th ns, the simulated output out will be generated
at the 91th ns. If a new in[3] is activated at the 2nd ns, the simulated output out will
be generated at the 63rd ns. Delays are ignored by synthesis.
The delay of a circuit depends on the operating condition of process (P), voltage
(V), and temperature (T). The higher the voltage and the lower the temperature, the
lower the delay, and vice versa. A situation with the extremely worst process, lowest
voltage, and highest temperature is the worst-case operating condition. A situation
with the extremely best process, highest voltage, and lowest temperature is the best-
case operating condition. Therefore, the clock period constraint is the most stringent
in the worst case, while the hold time constraint is the most stringent in the best case.
A situation between the best and worst cases with a common operating condition
is called the typical-case operating condition. Taking the various PVT factors that
can influence the delays (including input rising and falling time) into consideration,
the minimum, typical, and maximum values for each delay can be more precisely
modeled and separated by colons, as shown below.
Advanced Verilog Topics 97
Secondly, delays can be modeled and timing constraints can be checked by using
the specify block indicated by the keywords specify and endspecify. A specify block
adds timing specifications to the paths in a module. Parameters can be declared in a
specify block using the specparam keywords. Tasks which are typically performed
within a specify block are giving descriptions of the (state-dependent) path delays
and (conditional) timing checks. An example is shown below, with further explana-
tion given in the comments. If timing constraints have been violated, for example,
$setup(in1, posedge clk, tS) of the setup time check, the output q of the flip-flop will
become X (unknown) and display in red color on the waveform. Besides, some texts
used to notify the timing violation will be shown in the simulation window as well.
5 output q , q1 ;
7 input [2 : 0] a ;
8 input [5 : 0] b ;
9 reg [2 : 0] out ;
10 reg q , q1 ;
11 always @ (*)
18 q <= in1 ;
19 always @ ( posedge clk or negedge rst_n )
20 if (! rst_n ) q1 <= 0;
21 else q1 <= in2 ;
22 specify
34 specify
45 specify
53 specify
58 endmodule
and the final placement and routing. The gate delays are specified using SDF charac-
terized by the synthesis or layout tool used for timing simulations. The SDF provides
a tool-independent and uniform way to represent timing information, including (con-
ditional and unconditional) module path delays, device delays, interconnect delays,
port delays, timing checks, and path and net timing constraints.
Cell delays are calculated using the tables in the technology library. The tables
are commonly indexed by input transition versus output load, as shown in Table 3.2.
When the input transition is 0.5 and output capacitance is 0.2, the delay is 0.678.
Table 3.2: Delay table.
Input transition
0 0.5 1
Output 0.1 0.345 0.567 0.89
capacitance 0.2 0.456 0.678 0.987
Cell output transition can also be calculated using a table indexed by input transition
and output load (however, it is omitted here).
In advanced technology, wire delays resulting from parasitic capacitance and in-
ductance can be significant, and may take up a relatively large portion of the overall
path delay in modern systems. Minimizing the impacts of wire delays by various
means, such as shortening wire lengths, is an important task of layout tools. Inter-
connect delays or input port delays cannot be specified in a specify block. To simulate
using an interconnect delay, you must annotate the timing using SDF.
None of the delays that we specify in the Verilog code has any effect on the syn-
thesized circuit. A synthesis tool will ignore the delay in the assignment, or perhaps
issue a warning. We usually only write assignments with delays in testbench models
for stimulus generation. Or, for some designs that interact with macros to impose
specific timing constraints and checks. For example, in Figure 3.15, the interface
timing diagram of the read-only memory (ROM) is presented, where ren, addr, and
data denote the read enable, read address, and read data, respectively.
The timing constraints of both ren and addr require a setup time of 2 time unit and
a hold time of 1 time unit. They are checked using the specify block in the behavior
model of ROM as follows.
100 Principles of Verilog Digital Design
The setup and hold times can be checked together using the system task
$setuphold, as shown below.
Since the design is synchronous with the rising edge of clk, if there is no delay
for ren and addr signals, they will violate the hold time constraints. Therefore, at
the RTL simulation stage, we need to add delays using the delay control (introduced
in the next section) to ren and addr signals as follows, where ren_i and addr_i are
internal read enable and read address that are synchronous to clk. At the synthesis
stage, the delays will be ignored and the hold times will be fixed and ensured by
inserting buffers to those paths with hold time violations.
Only the (minimum : typical : maximum) delays of a single delay can be specified
in procedural assignments as follows. In the sequel, the rising, falling, and turn-off
delays must be the same.
4 #(1 : 2 : 3) a =~ b ;
There are two kinds of assignment delays: an inter-assignment (or regular) delay
and an intra-assignment delay. The inter-assignment delay is typically used in the
testbench. In the following initial block, Verilog codes are executed sequentially.
When executing the delay control #2, two time units must be delayed. After the
delay, a + b can be performed, after which, the result will be known as c. Therefore,
it looks like that, at time unit 2, c will become (be assigned) 3. In this manner, the
inter-assignment delay can be simply interpreted: “the delayed execution of c = a + b
by 2 units”. This is also called a delayed evaluation.
Alternatively, the intra-assignment delay for the code “c =#2 a + b” in the follow-
ing initial block takes several steps in Verilog. Firstly, the temp= a + b is executed
“at time 0”. That is, there is an implicit temporary storage, temp, for each right-hand-
side expression. Secondly, the result of temp, i.e., 3, will be assigned to c (c =temp)
until 2 units of time have elapsed. Put another way, the result of c is not affected by
the change of a and b until 2 time units have passed. Therefore, at time unit 2, c will
become (be assigned) 3. This is also called the delayed assignment.
In the above two code segments, essentially, the same results are produced. That
is, at time unit 2, c will become 3. However, if there is another initial block shown
102 Principles of Verilog Digital Design
below, the additional initial block will cause the value of b to change to 4 at time unit
1, which means that the results of inter assignment and intra assignment will be dif-
ferent. Based on their different operations, the inter-assignment and intra-assignment
delays will be 5 and 3, respectively.
4 #1 b =4;
Figure 3.16: Gate delays and transistor counts of several logic gates.
Figure 3.17: Schematic of a full adder. The numbers on the logic gates are the gate
delays in ns.
9 d <=#12 1;
10 e <=#3 0;
11 f <=#2 3;
12 end
The difference between them is shown in the timing diagram in Figure 3.19.
Advanced Verilog Topics 105
Note: “<=” also means less than or equal for relational operators. To distinguish
a relational operator from a non-blocking assignment, the relational operator can be
interpreted as a comparative when it is found in logic statements, such as if-else,
conditional operator, etc.
Figure 3.20 is another example of the differences between blocking and
non-blocking assignments. The key information is displayed in the comments.
Again, changing the order of the statements using the blocking assignment got a
different result, while changing the order of the statements using the non-blocking
assignment did not. So, the blocking assignment is order sensitive while the non-
blocking assignment is order independent.
In the example shown, we can intuitively grasp how a blocking assignment works
in combinational circuits. For the always block with blocking assignments, the sensi-
tivity list of the always block contains all the inputs, a, b, c, and d, of a combinational
circuit. Each time the inputs change, the always block, and hence, the output out,
must be reevaluated. The statements in an always block are sequentially executed.
The up-to-date values of inputs are used to determine t1 and t2, and finally, new t1
and t2 are used to calculate out.
In an always block with non-blocking assignments, the statements are executed
concurrently. Therefore, when the inputs that trigger the always block change, the out
will use the old values of t1 and t2 because their new values are not yet available. To
ensure the same behavior in a combinational circuit, internal signals of the circuit, t1
and t2, should also be put in the sensitivity list in addition to the circuitâĂŹs inputs.
This will re-trigger (reentry) the always block each time the values of t1 and t2 are
updated, enabling output out to finally be able to assess its new value. However, this
model is relatively complex and can potentially be confusing, as the description of
combinational circuits only uses the blocking assignment.
As another example, sequential circuits (triggered by keywords posedge or
negedge) are described by both the blocking and non-blocking assignments in Figure
3.22.
Advanced Verilog Topics 107
Figure 3.22: Sequential circuits described by both blocking and non-blocking assign-
ments.
To understand what circuits these codes describe, we must analyze their behav-
iors. For the always block with blocking assignments, at every posedge of clk, three
assignments are executed sequentially. Hence, t1 is updated using the values of a
and b at posedge of clk, then t2 is updated using the new value of t1 and the value
of c at posedge of clk. Finally, out is evaluated using the new values of t1 and t2. As
can be seen, t1 and t2 are only used for temporary storage to conveniently partition a
complex expression; they do not represent true hardware register outputs and might
even be optimized away. Notably, the combinational circuit has been optimized in
that out = t1&t2 = (a&b)&(t1&c) = (a&b)&(a&b&c) = a&b&c = t1&c = t2.
For the always block with non-blocking assignments, at every posedge of clk,
three assignments are concurrently executed: (1) t1 is updated by the values of a and
b at posedge of clk, (2) concurrently t2 is updated using the old value (its new value
is not available at this moment) of t1 and the value of c at posedge of clk, and (3)
concurrently out is evaluated using the old values (their new values are not available
at this moment) of t1 and t2.
As shown, blocking and non-blocking assignments describe quite different se-
quential circuits. Based on the behaviors of blocking and non-blocking assignments,
they indicate one and three flip-flops, respectively. That is, when t1 and t2 are
described using blocking assignments, they are combinational outputs rather than
sequential outputs. Therefore, this model can be quite confusing so that the descrip-
tion of sequential circuits only uses the non-blocking assignment.
Yet another example is shown in Figure 3.23. As displayed above, it is clear that if
we use blocking assignments to describe sequential circuits, the order of assignments
produces different behaviors, and hence, different circuits. So, blocking assignment
is order sensitive. If we change the assignments to non-blocking, the order of assign-
ments does not affect the designated circuits. That is, non-blocking assignment is
order independent.
To ensure synchronous operations in RTL design and a match between an HDL
model and its synthesized circuit, non-blocking assignments must be used for all
variables that are assigned values with an edge sensitive behavior, i.e., always
clocked. The non-blocking behavior that appears in an edge sensitive always block
108 Principles of Verilog Digital Design
Finally, let us examine the combination of different assignments and delay con-
trols, such as those shown in Figure 3.24. It can be seen that the inter-assignment
delays and blocking assignments have higher priority than event triggers. Therefore,
the always block can be triggered again until the non-blocking assignment with inter-
assignment delay (signal b), blocking assignment with intra-assignment delay (signal
c), and blocking assignment with inter-assignment delay (signal d) have been com-
pleted. It can be seen that reentrance is not allowed for blocking assignments and
inter-assignment delays.
Advanced Verilog Topics 109
It can be observed that the always block with the non-blocking assignment and
intra-assignment delay (signal a) does not miss any events triggered by the input
signal named in. Also, the continuous assignment with an inter-assignment delay
does not miss any events triggered by the input signal in. Therefore, to model the
output delay of an always block with sensitivity list of (posedge clk), we typically
use the intra-assignment delay to model the clock-to-Q delays of sequential circuits.
To model the output delay of a combinational circuit described using the continuous
assignment, the inter-assignment delay is usually adopted.
110 Principles of Verilog Digital Design
The statement of a random system task modulus by s, i.e., $random %s, generates
a random number in the range [(−s + 1)∼(s − 1)]. If you require a positive number,
you must add a pair of braces, i.e., {}, as follows.
1 integer a , b , c ;
2 a = $random %60; // -59 <= a <=59
3 b ={ $random }%60; // 0 <= b <=59
4 c ={ $random }%60+40; // 40 <= c <=99
3.9.2 I/O
The commonly used system tasks for I/O are listed in Table 3.5.
To open a file, you can use the following codes.
A timing check is performed using the specify block. The system tasks $setup,
$hold, and $width can check the setup time, hold time, and signal width, respectively.
Figure 3.25: Timing checks: (a) setup time and hold time and (b) width.
112 Principles of Verilog Digital Design
Considering the setup time requirement, its impact can be written as:
This causes the clock period to increase, which lowers the clock rate. In such a case,
we say that the registers are all clocked synchronously on each clock edge. The
combinational circuits perform their operations in the interval between one clock
edge and the next, a period called a clock cycle. The clocked synchronous timing
design helps us ensure that operations are completed by combinational circuits before
the time their results are clocked by the registers. This simplifies composition of a
whole RTL design via pipelined (registered) stages.
Considering the hold time requirement, its impact can be written as:
A delay between flip-flops has a minimum value. Hold time requirements are
often violated when flip-flops are connected back-to-back. To counteract this, buffers
should be inserted into the paths which do not have a long enough delay. This will
increase the chip area but will have no impact on the critical path or clock period
because the delays of those critical paths are already too large.
To ensure the reliability of the chip, it is imperative that whatever operating condi-
tion it is in, the chip must be able to function well. Hence, the setup time requirement
MUST be verified for the worst case operating condition using the maximum delay,
while the hold time requirement MUST be verified in the best case operating con-
dition using the minimum delay. Provided these timing constraints are met, it will
then be possible to use the timing abstraction of synchronous designs. If a setup time
error occurred during chip testing, it may still be possible to resolve the problem by
extending the clock period or lowering the clock frequency. However, if the chip had
a hold time error, it must be considered a fatal error, indicating that your chip may
fail at some point.
The system task, $recovery, checks the recovery time from deassertion of an asyn-
chronous control signal of sequential circuits, such as a reset, to next valid occurrence
of a clock edge, as shown in Figure 3.26. The recovery time, trec , specifies the amount
of time which must be provided for the deassertion of an asynchronous control sig-
nal to recover the normal function of clocked sequential circuits. The system task,
$removal, checks the removal time of an asynchronous control signal of sequential
circuits after a clock edge, as shown in Figure 3.26. Given that a clock edge has
occurred, the removal time, trem , specifies the time relative to that particular clock
edge which must be provided for the deassertion of an asynchronous control signal.
When trec /trem for the deassertion of an asynchronous control signal is smaller than a
clock period, trec /trem resembles the setup/hold time of the flip-flop input.
Timing information is annotated by the system task, $sdf_annotate, using an ini-
tial block, as demonstrated in the following. The SDF file is chip.sdf, and the scope
at which to perform annotation is chip.
Advanced Verilog Topics 113
Figure 3.26: Recovery and removal time limits, where the reset signal is active high.
5 module test_bench ;
6 reg [ ‘WORD_SIZE -1 : 0] in ;
7 reg en ;
114 Principles of Verilog Digital Design
8 reg clk ;
9 ‘include decode . v
10 initial clk = 0;
12 initial begin
13 ‘ifdef TEST_CONDIT I ON 1
14 en =1;
15 ‘else
16 en =0;
17 ‘endif
18 end
19 endmodule
The decode.v contains a piece of codes that you don’t want to show in the testbench,
such as the instantiation of the decoder module shown below.
4 input en ;
5 input [2 : 0] x ;
7 endmodule
The text macro can also be defined using the simulation command option by
“+define TEST_CONDITION1”.
We can define a new macro using the (‘define) directive. For example, the follow-
ing defined macro, CLOG2, returns the ceiling function of the log2 (x), the logarithm
to the base 2 of x.
1 // Defined macro
2 // Ceiling function of log2 ( x )
3 ‘define CLOG2 ( x ) (x <=2)?1 : ( x <=4)?2 : (x <=8)?3 : (x <=16)?4 :
If you are using Synopsys Design Compiler as your synthesis tool, you can con-
trol code segments that do or do not need to be translated by DC using “synopsys
translate_on” and “synopsys translate_off”, respectively. Such a handy control en-
ables you to build behavioral codes or assertions into your design in a way that can
monitor the design functionality, as follows.
If you do not want to check the timing during the RTL simulation, you can
disable it by:
116 Principles of Verilog Digital Design
4 );
5 ...
6 endmodule
6 endmodule
4 output [ PE_NUM -1 : 0] y ;
5 );
6 genvar i ;
7 generate
11 endmodule
1 module ProcessEleme nt (
2 input a , b ;
3 output y ;
4 );
5 ...
6 endmodule
PROBLEMS
1. The following RTL design implements an accumulator that can add 16 8-bit in-
puts which are given one at a time. Assert a valid signal when the result is ready.
(a) Find all of the bugs and fix them. (b) Verify the design using a functional
simulation.
5 input in_valid ;
6 input [7 : 0] in_data ;
9 wire [7 : 0] in_data ;
10 wire [3 : 0] data_count ;
11 initial begin
12 out_valid =0;
13 sum =0;
14 data_count =0;
15 end
17 if ( in_valid ) begin
18 data_count = data_count +1;
19 sum = sum + in_data ;
20 end
21 always @ ( posedge clk )
2. Design a 4-to-1 multiplexer. The bit width of each input and output must be
configurable or parameterized. Bit width has default value of 8 bits.
a. Using an always block.
b. Using a continuous assignment.
3. Design a 4-to-2 encoder using the case statement. The input A[3 : 0] is a one-hot
signal. That is, only one single bit can be 1 at a time. When A is 4’b0001, output
Y [1 : 0] is 2’d0; when A is 4’b0010, output Y [1 : 0] is 2’d1, and so on. Since
your conditions are not fully specified, you must use the default to specify other
conditions. For the default condition, just output Y [1 : 0] as 2’d0.
4. Design a 4-to-2 priority encoder using a casez or casex statement, together with
a valid output to indicate the conditions that either bits of input A is logic 1. The
truth table is displayed below, where X denotes the “don’t care”.
Advanced Verilog Topics 119
1 x << w ;
1 reg out ;
2 always @ ( posedge clk ) begin
3 if ( sel )
4 out <= a ;
5 else
6 out <= b ;
7 end
1 reg [7 : 0] counter ;
2 always @ ( posedge clk or negedge rst_n ) begin
3 if (! rst_n )
4 counter <=0;
5 else
6 counter <= counter +1;
7 end
9. Plot the architecture of the following piece of code. Please note that this is a bad
design that incurs a combinational loop.
1 reg [7 : 0] counter ;
2 always @ (*) begin
3 if (! rst_n )
120 Principles of Verilog Digital Design
4 counter =0;
5 else
6 counter = counter +1;
7 end
10. Considering the resource sharing, plot the architectures for the following RTL
codes.
a. RTL code 1: We assume that the adders are not shared.
1 module noshare (z , v , w , x , k );
2 output [3 : 0] z ;
3 input [2 : 0] k , v , w ;
4 input x ;
5 wire [3 : 0] y ;
7 assign y = x ? k + w : k + v ;
8 assign z = x ? y + w : y + v ;
9 endmodule
b. RTL code 2: We assume that the adders in the same always block can be
shared.
1 module share (z , v , w , k );
2 output [3 : 0] z ;
3 input [2 : 0] k , v , w ;
4 input x ;
5 reg [3 : 0] y , z ;
7 always @ ( x or k or v or w ) begin
8 if ( x ) y = k + w ;
9 else y = k + v ;
10 end
11 always @ ( y or x or w or v ) begin
12 if ( x ) z = y + w ;
13 else z = y + v ;
14 end
15 endmodule
1 integer i ;
2 for ( i =0; i <=31; i = i +1) begin
3 s [ i ]= a [ i ]^ b [ i ]^ carry ;
Advanced Verilog Topics 121
6 integer i , j ;
7 always @ ( a or b or c or d ) begin
8 temp [0]= a ;
9 temp [1]= b ;
10 temp [2]= c ;
11 temp [3]= d ;
12 for ( i =2; i >=0; i =i -1)
13 for ( j =0; j <= i ; j = j +1)
14 if ( temp [ j ] > temp [ j +1]) begin
15 buffer = temp [ j +1];
16 temp [ j +1]= temp [ j ];
17 temp [ j ]= buffer ;
18 end
19 out = temp [3];
20 end
21 endmodule
b. How are sorting results arranged or assigned? That is, what is the maximum
number, second highest number, and so on?
1 function comp_swap (a , b );
2 input [3 : 0] a , b ;
3 // Larger no is placed at the MSBs
4 if (a > b ) comp_swap ={ b , a };
122 Principles of Verilog Digital Design
5 else comp_swap ={ a , b };
6 endfunction
6 rega = data ;
7 regb = rega ;
8 end
9 endmodule
b. RTL code 2.
9 endmodule
4 reg K , out ;
5 always @ ( en or A or B or C )
6 if ( en ) begin
7 K <=!( A & B );
8 out <=!( K | C );
9 end
10 endmodule
b. RTL code 2.
4 reg K , out ;
5 always @ ( en or A or B or C )
6 if ( en ) begin
7 K =!( A & B );
8 out =!( K | C );
9 end
10 endmodule
4 reg [3 : 0] YA ; reg [3 : 0] PA ;
5 integer N ;
9 PA [ N ] <= PA [ N -1];
10 PA [0] <= Data ;
11 YA <= PA ;
12 end
13 endmodule
b. RTL code 2.
4 reg [3 : 0] YA , PA ;
5 integer N ;
12 endmodule
c. RTL code 3.
4 reg [3 : 0] YA , PA ;
5 integer N ;
12 endmodule
18. There are six different design pieces below. Please plot their waveforms us-
ing the behaviors of blocking and non-blocking assignments. Then, verify your
waveforms using simulation results. Signals a, b, c, and d are 5 bits; E is 1 bit;
out is 7 bits; and e and f are 6 bits. What design pieces have the same function-
ality? Among them, identify those that obey the RTL coding guidelines.
Advanced Verilog Topics 125
a. RTL code 1.
b. RTL code 2.
c. RTL code 3.
d. RTL code 4.
e. RTL code 5.
f. RTL code 6.
1 always @ ( a or b or c or d ) begin
2 e=a+b;
3 f=c+d;
4 end
6 if ( E )
7 out <= e + f ;
19. Timing analysis: Assume that the clock is synchronous without any skew. The
full adder, FA, is implemented using that in Figure 3.18. According to the speci-
fication, the input signals have an input delay of 2 time units. The output signals
have an output delay of 3 time units. The flip-flops have setup time and hold
time of 2.2 and 0.8 time units, respectively. (a) Please identify all paths and find
their path delays in the RTL design. Notice that the registered output, result_r,
and the non-registered output, result, are both the outputs of the module, adder.
(b) Find the critical path and maximum allowable clock frequency.
6 wire [2 : 0] result ;
7 reg [2 : 0] result_r ;
8
18 endmodule
19
26 ...
27 endmodule
Advanced Verilog Topics 127
1 R =R -1;
2 if ( R ==0) E =1;
3 else E =0;
b. RTL code 2.
1 R <= R -1;
2 if ( R ==0) E <=1;
3 else E <=0;
21. Draw a block diagram of the following always block. Does the always block
contain unnecessary codes? If yes, please remove them.
22. If the clock period of a design is 2 ns, the minimum width, recovery time, and
removal time for the reset of sequential circuits in the design are 7, 5, and 9 ns,
respectively. Please design the reset signal and an enable signal for the normal
function that will obey the timing specifications.
4 Number Representation
Application-specific integrated circuits are used to process binary information. How-
ever, in many situations, integers without a fraction cannot meet our needs. For ex-
ample, integers cannot distinguish a fraction, say 0.567, from 0.123. From the above
reason, we will introduce the binary point, which is similar to the decimal point. Sub-
sequently, the fixed-point binary numbers and their operations used in most ASIC
designs will be introduced. To understand the fixed-point number representation, the
conversion between binary and decimal values of fixed-point numbers is what we
must know. Fixed-point number designs, including the bit width and precision (or
resolution) design of fixed-point numbers, for digital signal processing applications
are presented. Dynamic range of fixed-point numbers is the key to fixed-point design
if we want to avoid the overflow. In other situations, we may want to represent data
with a very large dynamic range. Hence, the floating-point binary numbers will be
briefly introduced as well.
ea (x)
e p (x) = . (4.2)
|x|
The quality of a number representation is given by the precision (or accuracy), i.e.,
the maximum error over all inputs x within its range X. Thus, the absolute precision
is given by
Note that the error and precision percentages are not defined near x = 0.
When we want to represent numbers with a given absolute precision, fixed-point
binary numbers are often used. By contrast, when a given precision percentage is
required, floating-point numbers are more adequate.
For example, suppose we represent real numbers over the range X = [0, 100] as
7-bit binary integers by representing each real number with the nearest integer. Pick-
ing the nearest integer to a real number is often referred to as rounding or trunca-
tion. Rounding may require an addition (or increment) while truncation does not. We
would then represent 45.678 as 46 or 1011102 using rounding, and the absolute error
of representing this number by rounding is ea (45.678) = |45.678 − 46| = 0.322. The
absolute precision over the entire range can be found to be pa = 0.5 since a value
halfway between two integers, e.g., 45.500, has the maximum error whether it is
rounded up or down. If the truncation is adopted, ea (45.678) = 0.678 and pa = 1.
We should clarify the difference between precision and resolution. The resolution
of both the rounding and truncation representations discussed above is 1.0 because
integers are uniformly spaced one unit apart. However, the precision of rounding is
0.5, and the precision of truncation is 1.0. Note that the smaller the precision is, the
better the number representation is.
∑n−1
i=0 ai− f 2
i n−1
2f
= ∑ ai− f 2i− f . (4.5)
i=0
As you can see, the first bit position to the right (left) of the binary point has a weight
of 2−1 = 0.5 (20 = 1), the second bit position to the right (left) of the binary point
has a weight of 2−2 = 0.25 (21 = 2), and so on. Additionally, the weights of all bits
have positive values.
We use the format u(p. f ) to refer to the unsigned fixed-point binary number. Us-
ing this shorthand, the system with n = 4 and p = 1 is an u(1.3) fixed-point system.
An example of u(1.2) fixed-point binary numbers and their decimal values is listed in
Table 4.1. In this table, the integer values are the binary numbers without considering
the binary point.
Number Representation 131
Unfortunately, Verilog does not support the data type of fixed-point representa-
tion. To designate an unsigned fixed-point binary number, it can be declared as a[(p−
1) : − f ] in Verilog. For example, an u(1.3) wire, a[0 : −3], is declared as follows. The
negative index clearly indicates the position of the binary point, and the number of
bits to the left and right of the binary point. Even so, Verilog treats the variable a[0 :
−3] as integer. That is, if a[0 : −3] =4’b1001, you will display it on a waveform or
screen as 4’d9 (in decimal) instead of the real fixed-point number 1.125 (in decimal).
Of course, you can also declare the u(1.3) wire as a[3 : 0] below. However, the
position of the binary point is not self-documented using this declaration. Notice
that a number declared as [0 : −3] and [3 : 0] are interpreted by Verilog as the same
value. It just has a negative index. The index range of [0 : −3] is used to document
the fixed-point number with 1-bit integral and 3-bit fractional parts.
If we add an additional sign bit to the left of the integral bits, we will refer to the
n-bit signed fixed-point binary number with p-bit integer (including one bit allocated
for the sign bit) and f -bit fraction, as a s(p. f ) format, where n = p + f . Like integers,
we use 2’s complement numbers in s(p. f ) systems. Similar to unsigned numbers, an
n-bit signed fixed-point binary number is a representation where the value of the
number {a p−1, a p−2 , ..., a0 . a−1 , ..., a− f +1 , a− f } is given by
| {z } | {z }
p f
−a p−12n−1 + ∑n−2
i=0 ai− f 2
i n−2
f
= −a p−12n−1− f + ∑ ai− f 2i− f (4.6)
2 i=0
132 Principles of Verilog Digital Design
where a p−1 is the sign bit. As you can see, similar to the unsigned numbers, the
first bit position to the right (left) of the binary point has a weight of 2−1 = 0.5
(20 = 1), the second bit position to the right (left) of the binary point has a weight of
2−2 = 0.25 (21 = 2), and so on. However, the weight of the sign bit a p−1 is negative
with a value of −2 p−1, and the weights of other bits still have positive values.
An example of s(1.2) fixed-point binary numbers and their decimal values is listed
in Table 4.2. In this table, the integer values are the binary numbers without consid-
ering the binary point. It can be observed that when the sign bit a p−1 of a binary
number is 1, its decimal value is negative.
Example 4.1. Assuming that the binary point is four places from the right. What
number is represented by the 8-bit unsigned u(4.4) fixed-point binary number,
01010010? That is, 0101.0010.
Solution: The number is
0101.00102 = 22 + 20 + 2−3
= 5.125.
Or, since the binary point is four places from the right, you can obtain the result from
the integer by 82/16 = 5.125.
Number Representation 133
Example 4.2. As you can see, the implied binary point is not specified in the hard-
ware. The designer needs to consider an appropriate scaling factor to correctly inter-
pret the result of the calculations. Assuming that the binary point is six places from
the right. What number is represented by the 8-bit unsigned u(2.6) fixed-point binary
number, 01010010, in the previous example? That is, 01.010010.
Solution: The number is
Or, since the binary point is two places left to that in the previous example, you can
obtain the result directly from that in the previous example by 5.125/4 = 1.28125.
Example 4.3. Assuming that the binary point is four places from the right. What
number is represented by the 6-bit signed s(2.4) fixed-point binary number, 101101?
That is, 10.1101.
Solution: The number is
0 ≤ a ≤ 2 p − r.
0 ≤ b ≤ 2n − 1.
The dynamic range of the n-bit unsigned fixed-point number a can also be derived
from the range of n-bit unsigned integer by multiply it with the resolution r as
0 2n − 1
0= f
≤a≤ = r × (2n − 1) = 2 p − r.
2 2f
134 Principles of Verilog Digital Design
For signed s(p. f ) fixed-point number a, the largest number we can represent is
2 p−1 − r when a p−1 = 0 and a p−2 = ... = a− f +1 = a− f = 1 because all of their
weights are positive except that of the sign bit a p−1 . The smallest or most negative
number is −2 p−1 when a p−1 = 1 and a p−2 = ... = a− f +1 = a− f = 0. Therefore, a
s(p. f ) fixed-point number a is in the following range,
−2 p−1 ≤ a ≤ 2 p−1 − r.
−2n−1 ≤ b ≤ 2n−1 − 1.
The dynamic range of the n-bit unsigned fixed-point number a can also be derived
from the range of n-bit unsigned integer by multiply it with the resolution r as
−2n−1 2n−1 − 1
−2 p−1 = ≤ a ≤ = r × (2n−1 − 1) = 2 p−1 − r.
2f 2f
The decimal value of a fixed-point binary number can be easily obtained from that
without the binary point using the resolution. We can rewrite the value of an n-bit
unsigned fixed-point binary number as
∑n−1
i=0 ai− f 2
i
−f
n−1 n−1
2f
= 2 ∑ ai− f 2i = r ∑ ai− f 2i . (4.7)
i=0 i=0
| {z }
integer value
Likewise, we can rewrite the value of an n-bit signed fixed-point binary number as
!
−a p−12n−1 + ∑n−2
i=0 ai− f 2
i
n−1
n−2
i
= r −a p−12 + ∑ ai− f 2 . (4.8)
2f i=0
| {z }
integer value
Consequently, to convert a fixed-point binary number to its decimal value, we just
• Step 1: convert it to an integer,
• Step 2: multiply the result by r.
Example 4.4. Write a Verilog behavioral model to convert a s(2.3) fixed-point num-
ber to its decimal number.
Solution: To derive the decimal value of a fixed-point number a, we can declare
a variable with real type to store the decimal representation of a fixed-point number
by scaling the fixed-point number using the resolution of operands by r = 2− f . In the
following Verilog codes, the symbol, ∗∗, is the power operator. For example, x ∗ ∗y
means x to the power of y, i.e., xy .
The system task, $itor, converts an integer (or signed reg) to a real-number value.
The signed declaration will be introduced in the next chapter. The Verilog does not
support a fixed-point data type, and hence, a[p − 1 : − f ] declared as signed reg is
treated as an integer. For example, if fixed-point number a[1 : −3] is 10.101, $itor(a)
will return −1.1 × 101 (−11) instead of −11/8 = −1.375 because a[1 : −3] and
a[4 : 0] are treated the same except that their indices are different. For a[1 : −3], bit
0 of it is called a[−3], bit 1 of it is called a[−2], and so on.
Therefore, the result of $itor needs to be scaled by the resolution r = 2−3 = 1/8 =
0.125 to obtain exact real value of the fixed-point number a[1 : −3], i.e., −11/8 =
−1.375.
5 // Resolution
6 real r =1/(2** f );
Step 1 obtains the scaling version of the decimal number by dividing r = 2− f or mul-
tiplying it with 1r = 2 f , which right shifts the binary point of the binary representa-
tion of original decimal number by f positions. Step 2 obtains the integer value of the
scaling version of the decimal number by rounding the scaling version of the decimal
136 Principles of Verilog Digital Design
number to the nearest integer to reduce the conversion error. Step 3 just converts the
integer value of the scaling version of the decimal number to a binary format, which
is much simpler than directly converting a decimal number (with fraction part) to its
fixed-point format. Step 4 scales the rounded integer to the fixed-point binary number
by shifting the binary point of the binary format to the left by f positions.
The basic principle of the decimal number (with fraction) to fixed-point binary
number conversion relies on the integer to binary number conversion. To this, it is
therefore multiplied by 1r = 2 f (Step 1) and then rounded to an integer (Step 2). The
conversion of an integer to binary number is relatively simple (Step 3). For example,
if an unsigned integer 87 is to be converted to binary number, it can be simply ex-
pressed by 87 = 64 + 16 + 4 + 2 + 1 = 26 + 24 + 22 + 21 + 20 = (1010111)2. Finally,
it is scaled back to its (approximate) fixed-point binary number by multiplying the
binary integer by r = 2− f (Step 4).
Suppose we want to convert 1.816 to our u(1.3) fixed-point format. We first mul-
tiply 1.816 by 2 f = 23 = 8, giving 14.528. Then, we round 14.528 to integer 15. We
then convert it to a binary integer 11112. Finally, we multiply it by r = 2− f = 1/8 (in
decimal) or shift left its binary point by f = 3 bits, giving 1.1112, which represents
1.875. The absolute error, ea (1.816) = |1.816 − 1.875| = 0.059. It can be shown that
the precision is r/2, or 0.0625 in this case.
Example 4.5. Write a Verilog behavioral model to convert a (decimal) real number
to the s(2.3) fixed-point format.
Solution: To derive the fixed-point format of a decimal number, we can declare
a variable a[p − 1, − f ] with signed reg type to store the fixed-point representation
of a decimal number by scaling the real number using the resolution of operands by
r = 2− f .
In the following Verilog codes, the system task, $rtoi, converts a real to an integer
or signed reg value by “truncation”. To round a decimal number, its fractional part of
the scaled version, a_r_scaled_f, is obtained and compared to 0.5 to see if rounding
is needed.
The Verilog does not support a fixed-point data type, and hence, a[p − 1 : − f ]
declared as signed reg is treated as an integer. Therefore, the rounded (and scaled)
result, a_r_round_i, is the final fixed-point version of the decimal number.
For your convenience, the real-value of the final fixed-point version, which is an
integer or signed reg, is also obtained by scaling the fixed-point version using the
resolution r.
5 // Resolution
6 real r =1/(2** f );
11 real a_r_scaled_i ;
15 real a_r_round_i ;
18 // version
19 // All parts of scaled version
25 // Rounding
27 a_r_scaled_i ;
28 assign a = a_r_round_i ; // Final fixed - point version
30 // version if needed
To decide the fixed-point format, in addition to the dynamic range, its resolution
should also be high enough. In the above example, the resolution of the adopted
s(1.30) is r = 2−30 , which equals that of the ideal format, s(2.30).
To decide the fixed-number format of an application, we need to consider two
main factors:
Example 4.6. Write a Verilog module for a code converter that has an input repre-
senting an unsigned number in the range 0 to 24 with a precision of at least 0.01, and
an output representing a signed number in the range −50 to 50 with a precision of at
least 0.01.
Solution: For the input, we need 5 bits before the binary point, since ⌈log2 24⌉ =
5. We need a precision that is smaller than 0.01, so that the required resolution is
0.02 for rounding. Since ⌊log2 0.02⌋ ≈ ⌊−5.6439⌋ = −6, we need 6 bits for the
fractional part which gives us a resolution of 2−6 = 0.015625 and a precision of
2−7 = 0.0078125. That is, the input is u(5.6) format.
For the output, ⌈log2 50⌉ = 6, so we need 6 bits, plus one for the sign bit, giving
7 bits before the binary point. To give an output with a precision of at least 0.01, we
also need 6 bits for the fractional part. That is, the output is s(7.6) format.
From above, we just need to extend the 5 integral bits of input with 2 zero bits
to get the 7 integral bits of output. Two vacant bits are padded with 00 because the
input is unsigned. Since we need the same output precision as the input, we need the
same number of fractional bits, 6. Verilog codes are given below.
Number Representation 139
6 endmodule
4.2.2 OPERATIONS
4.2.2.1 Addition Operation
We can perform the basic operations on fixed-point binary numbers just as if they
were integers. The position of the binary point must be kept in mind before/after the
operation. For examples, adding two u(p. f ) fixed-point binary numbers gives a result
of u((p + 1). f ) fixed-point number. Therefore, 1 more bit is required if 2 numbers
are added. An example is presented in Figure 4.1, where, to get rid of overflow, two
u(2.3) fixed-point binary numbers are added leading to the sum of u(3.3) fixed-point
format. Generally, if there are M numbers to be added, we should increase ⌈log2 (M)⌉
bits.
Figure 4.1: Adding two u(2.3) fixed-point binary numbers. The sum requires one
more bit which gives us u(3.3) fixed-point format.
adding the u(2.3) format number A =11.101 to the u(4.2) format number B =1111.01
without incurring an overflow. As shown in Figure 4.2, we first align the binary
point of u(4.2) to u(4.3), and then add Y = A + B =11.101 + 1111.010, which gives
Y =10010.111.
Figure 4.2: Adding u(2.3) number to the u(4.2) number. You need to align the binary
point. The sum requires one more bit which gives us u(5.3) fixed-point format.
15 endmodule
Generally, a multiplication with its operands of u(p1 . f1 ) and u(p2 . f2 ) (or s(p1 . f1 )
and s(p2 . f2 )) formats, its product is an u(p1 + p2 . f1 + f2 ) (or s(p1 + p2 . f1 + f2 ))
fixed-point format without any overflow.
Therefore, the operands need to be sign extended to the bit number of product, as
shown below.
14 endmodule
and rounded to the original precision. This sum is then usually scaled and rounded to
get a final result back in u(0.16) format for unsigned numbers or s(1.15) for signed
numbers.
Example 4.7. The inputs x(n), coefficients hm , m = 0, 1, ..., 7, and y(n) of the eight-
tap FIR filter in Figure 4.4 are all s(1.15) numbers. Determine the bit widths and
fixed-point number formats of intermediate variables such that no quantization errors
occur.
Solution: The bit widths and fixed-point number formats of all signals are la-
beled in Figure 4.5. The block Q quantizes input using the truncation. Product of
two s(1.15) numbers needs the s(1.30) format. Accumulation of 8 s(1.30) numbers
requires the s(4.30) format, in which ⌈log2 (8)⌉ = 3 more bits are added. Finally, the
block Q quantizes input with s(4.30) format into s(1.15) format by dropping unnec-
essary bits using truncation without overflow detection.
Example 4.8. Write down the RTL codes of the eight-tap FIR filter.
Solution: After designing the fixed-point system, it is straightforward to write
down its RTL codes as follows. In the Verilog codes, x0, x1,..., and x7 respectively
represent x(n − 7), x(n − 6),..., and x(n).
Number Representation 143
13 always @ (*)
14 tmp_y = x7 * h0 + x6 * h1 + x5 * h2 + x4 * h3 +
15 x3 * h4 + x2 * h5 + x1 * h6 + x0 * h7 ;
16 assign y = tmp_y [0 : -15];
17 endmodule
To verify the digital design of the eight-tap FIR filter, a behavioral model of the
eight-tap FIR filter using the real numbers are developed to check the results of fixed-
point design. It must be emphasized here that the behavioral model as the gold design
should do the same things as the fixed-point design, such as the quantization, so that
they can perfectly match.
Verilog does not provide a data type for fixed-point numbers. To evaluate the
value of the result of fixed-point number operation, we can declare a variable with
real type to store the decimal representation of a fixed-point number by scaling the
fixed-point number using the resolution of operands by r = 2−15 , as shown below.
The scaling is required, since our actual interpretation of the variable is a fixed-point
value with a fraction instead of an integer. Real variables are actually represented
using floating-point format. Then, operations are performed on these real variables,
as shown below, where x0, x1,..., x7 respectively represent x(n − 7), x(n − 6),..., x(n).
In the Verilog codes, the signed declaration is very handy for the signed number
arithmetic because sign extension can be automatically performed by Verilog and
synthesis tool.
5 real r =1/(2** f );
8 real tmp_y_r ;
9 // s (1.15) fixed - point number version
10 wire signed [p -1 : -f ] x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 ;
11 wire signed [p -1 : -f ] h0 , h1 , h2 , h3 , h4 , h5 , h6 , h7 ;
4 // Resolution
5 real r1 =1/(2** f1 );
6 // Integer version of y
8 // version
9 // Gold result : integer version of quantized y
11 // version
12 // Scaled by the resolution r1 of tmp_y_r
13 assign tmp_y_i = $rtoi ( tmp_y_r / r1 );
14 // Quantize y_i
A floating-point number has two components: the exponent e and mantissa m. The
value represented by a floating-point number is given by
v = m × 2e−x (4.9)
where m is a binary fraction, e is a binary integer, and x is a bias on the exponent that
is used to center the dynamic range. The mantissa, m, is a fraction which means that
the binary point is on the left of the most significant bit (MSB) of m. The exponent,
e, is an integer. If the bits of m are {mn−1 , ..., m0 } and the bits of e are {ek−1 , ..., e0 },
the value of the floating-point number is given by
n−1 k−1 k
v= ∑ mi 2i−n × 2∑i=0 ei 2 −x . (4.10)
i=0
PROBLEMS
1. Convert decimal 67 to u(7.0) fixed-point binary format.
2. Convert decimal 0.375 to u(1.3) fixed-point binary format.
3. Convert decimal 67.75 to u(7.2) fixed-point binary format.
4. Convert u(5.2) unsigned fixed-point binary format, 10110.11, to decimal.
5. Convert the number 4.23 into each the following fix-point formats, and then
convert it back to decimal. Obtain the absolute error and error percentage of
each representation.
a. u(4.1);
b. s(5.2);
c. s(5.5);
6. Designing a fixed-point system. The number ranges from 0 to 31 with a precision
of 0.05.
7. Designing a floating-point system to represent a measurement of 1 × 10−6 to
1 × 107 with 2.5% precision. Represent the value 4.5 in this format.
8. Perform the following unsigned binary additions to produce 8-bit results. In each
case, does the addition overflow or not?
a. 00110010 + 10010100.
b. 11110000 + 00110010.
c. 11001100 + 10001111.
9. Write a Verilog code that adds four 12-bit unsigned binary numbers to produce
a 12-bit result with overflow detection.
10. Perform the following unsigned binary subtractions to produce 8-bit results. In
each case, does the subtraction overflows or not?
a. 10111000 − 01010000.
b. 01110000 − 00110010.
c. 01111100 − 10000111.
11. What numbers are represented by the following unsigned u(4.3) fixed-point
binary numbers, 1001001 and 0011110?
12. What is the range and precision of each of the following unsigned fixed-point
representations?
a. 12 bits, with p = 5 and f = 7.
b. 10 bits, with p = 0 and f = 10.
c. 8 bits, with p = 8 and f = 0.
13. How many integral and fractional bits would be required to represent numbers
in the range from 0.0 to 12.0 with a precision of 0.002?
14. Assuming the signed s(5.3) fixed-point binary numbers. What decimal numbers
are represented by the following binary numbers: 00101100 and 11111101?
15. How many integral and fractional bits would be required to represent numbers
in the range from −5.0 to +5.0 with a precision of 0.01?
16. Prove that the sign extension does not affect the original value of a fixed-point
number.
148 Principles of Verilog Digital Design
17. The architecture for 8-point decimation-in-time (DIT) fast Fourier transform
(FFT) is shown in Figure 4.6. The complex inputs and outputs are parallel-in and
parallel-out, respectively. That is, the input data in a block, x[n], n = 0, 1, , 7, and
output blocks, X[k], k = 0, 1, ..., 7, are available in every clock cycle. The real
and imaginary parts of all inputs, outputs, and twiddle factors WNi = e− j2π i/N ,
i = 0, 1, 2, 3, are represented by s(1,15) fixed-point numbers. Please design a pure
combinational circuit for the feed-forward DIT FFT. Determine the bit widths of
intermediate variables such that no quantization errors occur. The final outputs,
X[k], are quantized by the rounding such that they can be represented by s(1,15)
fixed-point numbers.
The bit-wise operators in the RTL codes represent straightforward logic gates shown
in Figure 5.2.
In the Verilog codes, if a, b, c, and out are declared as 2-bit vectors with index
[1 : 0], the logic gates become that shown in Figure 5.3.
Depending on the synthesis tool and design constraints, the arithmetic operator in
the RTL codes may be implemented using logic gates shown in Figure 5.4.
A 3-bit adder is described below. The sum of 3-bit number addition requires 4-bit,
so that there exists no overflow.
Combinational Circuits 151
4 assign sum = a + b ;
The adder might be synthesized to the ripple-carry adder in Figure 5.5, where the full
adder might also be implemented by the logic gates in Figure 5.4. About the ripple-
carry adder, the carry in of the first full adder is tied to a constant 1’b0, the carry out
of the first full adder, c_out[0], is connected to the carry in of the second full adder,
and then the carry out of the second full adder, c_out[1], is connected to the carry in
of the third full adder. As a result, the carry propagates from the least significant bit
to the most significant bit just like a ripple in water. Consequently, such an adder is
named the ripple-carry adder.
The conditional operator in the RTL codes may be implemented using logic gates
shown in Figure 5.6 or a multiplexer gate in a cell library.
You can specify minimum, typical, and maximum values for the rising and falling
delays in continuous assignments, as can be seen in the following. In this example,
152 Principles of Verilog Digital Design
the (minimum : typical : maximum) delays for the rising delay are (1 : 2 : 3), the
(minimum : typical : maximum) delays for the falling delay are (2 : 3 : 4), and the
(minimum : typical : maximum) delays for the turn-off delay are (3 : 4 : 5).
4 sum1 = a + b ;
5 sum2 = c + d ;
6 sum = sum1 + sum2 ;
7 end
The calculation of sum is divided into sum1 and sum2. Therefore, sum1 and sum2
must be available before determining sum. This necessitates that the RTL codes
should be executed in order using the blocking assignment.
Single if-else statement also infers a multiplexer. For example,
For loops are unrolled and can build a cascaded or parallel combinational logic.
Therefore, the index of a for loop commonly declared as integer data type is dummy
and dose not cost any hardware resources. For example, the (cascaded) 8-bit ripple-
carry adder is displayed below.
4 c = c_in ;
154 Principles of Verilog Digital Design
In the preceding example, the variable c is used for all full adders, which is quite
confusing. Moreover, it is also misleading that, for each full adder, c is a carry in
(carry out of previous full adder) and carry out of current full adder. It’s better to
declare c[i − 1] and c[i] as carry in and carry out of full adder i, respectively, as shown
below. Such a description coincides with the physical connection of a multiple-bit
adder.
4 c [ -1]= c_in ;
5 for ( i =0; i <=7; i = i +1)
6 { c [ i ] , sum [ i ]}= a [ i ]+ b [ i ]+ c [i -1];
7 c_out = c [7];
8 end
Functions are synthesized to combinational blocks with one output. For example,
4 wire [3 : 0] a , b , c ;
5 reg [1 : 0] out ;
The synthesized circuit is presented in Figure 5.7. Therefore, the synthesis tool infers
one circuit of sum3 and uses a multiplexer to select different inputs.
Even so, it’s still better to explicitly infer the multiplexer used to select different
inputs, and one combinational circuit of sum3 as follows.
156 Principles of Verilog Digital Design
4 wire [3 : 0] a , b , c ;
6 wire [1 : 0] out ;
Another RTL codes with the same function are written below.
Combinational Circuits 157
1 // Balanced design
2 always @ ( a or b or c or d ) begin
3 if (a >= b ) out1 = a ;
4 else out1 = b ;
5 if (c >= d ) out2 = c ;
6 else out2 = d ;
7 if ( out1 >= out2 ) out = out1 ;
8 else out = out2 ;
9 end
Though the above two pieces of codes achieve the same functionality, they infer
different structures as shown in Figure 5.8.
Figure 5.8: Different structures for finding the maximum of 4 inputs: (a) sequential
comparison and (b) parallel comparison.
Regarding the critical path of two structures, it is clear that a balanced design is
good for timing.
Only the (minimum : typical : maximum) delays of a single delay can be specified
in procedural assignments. Therefore, the rising, falling, and turn-off delays must be
the same. In the following example, if s is true, the (minimum : typical : maximum)
delays of assigning i1 to out is (1 : 2 : 3) time units; and if s is false, the (minimum :
typical : maximum) delays of assigning i0 to out is (2 : 3 : 4) time units.
The delays of Verilog primitives can be modeled when they are instantiated.
1 // Combinational loop
2 always @ ( a )
3 a = a +1;
The always block infers the following combinational circuit with a feedback loop, as
shown in Figure 5.9, where ⊕ represents an adder.
Combinational Circuits 159
1 // No combinational loop
2 always @ ( a )
3 b = a +1;
In Figure 5.10, a combinational loop does not exist, and b is the result of increment.
To avoid the combinational loop, you can also break the feedback path by using
the sequential circuits, which can store or memorize the previous result.
1 // No combinational loop
2 always @ ( posedge clk )
3 a <= a +1;
10 if ( sel )
11 mux_out2 = mux_in [1];
12 else
13 mux_out2 = mux_in [0];
For logic circuits that are naturally described with a table, they can be designed
using a case or casex statement to infer the multiplexer. The following example de-
scribes a 8 × 16 table.
5 case ( addr )
6 3 ’ d0 : out = tab [0];
7 3 ’ d1 : out = tab [1];
8 3 ’ d2 : out = tab [2];
9 3 ’ d3 : out = tab [3];
10 3 ’ d4 : out = tab [4];
11 3 ’ d5 : out = tab [5];
12 3 ’ d6 : out = tab [6];
13 default : out = tab [7];
14 endcase
5.5.2 DEMULTIPLEXER
The demultiplexer (or demux) is the reverse of the multiplexer. A demultiplexer is a
component that takes a single input line and routes it to one of several output lines
according to the select lines, which are used to select which output line to send the
input. A demultiplexer is also called a data distributor. By setting the input to true,
the demux behaves as a decoder.
The following example shows an 1-to-4 demultiplexer using 2 select lines (sel[1 :
0]) to determine which one of the 4 outputs (demux_out[3 : 0]) is routed from the
input (demux_in). Its characteristics can be described using the truth table in Table
5.1.
The symbol and schematic of the demultiplexer are presented in Figure 5.12.
10 demux_out2 =2 ’ b00 ;
11 case ( sel )
12 1 ’ b0 : demux_out2 [0]= demux_in ;
13 1 ’ b1 : demux_out2 [1]= demux_in ;
14 endcase
15 end
5.5.3 COMPARATOR
The truth table for the comparison of two 1-bit binary numbers is displayed in Table
5.2. Compared with b, the comparator determines whether a is greater than, less
than, equal, or not equal to it using outputs y1, y2, y3, and y4, respectively. Notably,
y4 = a ⊕ b = a · b̄ + ā · b can be used to decide whether 2 bits are different, where ⊕,
·, (·), and + denote the bitwise XOR, AND, NOT, and OR operators, respectively.
Similarly, y3 = a ⊕ b = ā · b̄ + a · b can be used to decide whether 2 bits are the same.
In this book, ⊕ symbol in a Boolean equation represents the bitwise XOR; otherwise,
it represents an adder.
4 not ( a_inv , a );
5 and ( y2 , a_inv , b );
6 xnor ( y3 , a , b );
7 xor ( y4 , a , b );
11 assign y3_1 = a == b ;
12 assign y4_1 = a != b ;
15 if (a > b ) y1_2 =1 ’ b1 ;
16 else y1_2 =1 ’ b0 ;
17 if (a < b ) y2_2 =1 ’ b1 ;
18 else y2_2 =1 ’ b0 ;
19 if ( a == b ) y3_2 =1 ’ b1 ;
20 else y3_2 =1 ’ b0 ;
21 if ( a != b ) y4_2 =1 ’ b1 ;
22 else y4_2 =1 ’ b0 ;
23 end
The left rotator and right rotator rotate your input circularly. The left rotator by 1
bit simply connects a[0] with b[2], a[1] with b[0], and a[2] with b[1], as shown below.
If we write the Verilog codes below, where b is a variable, we got the barrel shifter,
which has a layer corresponding to each bit-of-shift.
Consequently, an 8-bit barrel shifter needs 3 bits to indicate how many bits to shift,
and hence 3 layers of multiplexers, as shown in Figure 5.14.
5.5.5 ENCODER
The truth table for the 4-to-2 encoder is displayed in Table 5.3. When {a, b, c, d}=
0001, the output encodes it using y = 00, and {a, b, c, d}= 0010, the output encodes
it using y = 01, and so on.
Combinational Circuits 165
The structural, dataflow, and behavioral descriptions of the 4-to-2 encoder are
given below.
4 not ( c_inv , c );
5 not ( d_inv , d );
22 case ( all )
23 4 ’ b0001 : y2 =2 ’ b00 ;
24 4 ’ b0010 : y2 =2 ’ b01 ;
25 4 ’ b0100 : y2 =2 ’ b10 ;
26 4 ’ b1000 : y2 =2 ’ b11 ;
Combinational Circuits 167
27 default : y2 =2 ’ b00 ;
28 endcase
The structural, dataflow, and behavioral descriptions of the 4-to-2 priority encoder
are given below.
4 or ( y [0] , d , and_out );
5 or ( y [1] , c , d );
We present another behavioral description of the priority encoder using the if-else-
if statement below. As presented, the decoder has priority for d. Also, it’s apparent
that the output y3 is not related to a.
5.5.7 DECODER
The truth table of 3-to-8 decoder with enable control is displayed in Table 5.5.
Figure 5.17: Gate-level netlist of the 3-to-8 decoder with enable control.
4 input e ;
5 input [2 : 0] x ;
6 wire [7 : 0] y ;
19 endmodule
4 input e ;
5 input [2 : 0] x ;
7 endmodule
The behavioral description of the decoder is shown below, which directly realizes
the logic of truth table.
4 input e;
5 input [2 : 0] x ;
6 reg [7 : 0] y2 ;
7 always @ ( e or x )
8 if (! e ) y2 =8 ’ h00 ;
9 else
10 case ( x )
11 3 ’ b000 : y2 = 8 ’ h01 ;
12 3 ’ b001 : y2 = 8 ’ h02 ;
13 3 ’ b010 : y2 = 8 ’ h04 ;
14 3 ’ b011 : y2 = 8 ’ h08 ;
15 3 ’ b100 : y2 = 8 ’ h10 ;
16 3 ’ b101 : y2 = 8 ’ h20 ;
17 3 ’ b110 : y2 = 8 ’ h40 ;
18 default : y2 = 8 ’ h80 ; // 3 ’ b111
19 endcase
20 endmodule
The hierarchical design of the 3-to-8 decoder based on the 2-to-4 decoder is shown
in Figure 5.18.
Combinational Circuits 171
4 output [7 : 0] y ;
5 input e ;
6 input [2 : 0] x ;
7 wire e1 , g1 , g2 ;
8 not u0 ( e1 , x [2]);
9 and u1 ( g1 , e , x [2]);
10 and u2 ( g2 , e , e1 );
11 decoder_2_4 u0 ( y [7 : 4] , g1 , x [1 : 0]);
12 decoder_2_4 u1 ( y [3 : 0] , g2 , x [1 : 0]);
13 endmodule
4 input e ;
5 input [1 : 0] x ;
172 Principles of Verilog Digital Design
5 input [7 : 0] a , b , c , d ;
7 reg [7 : 0] buffer ;
8 integer i , j ;
9 begin
12 temp [3]= a ;
13 temp [2]= b ;
14 temp [1]= c ;
15 temp [0]= d ;
16 for ( i =2; i >=0; i =i -1)
17 for ( j =0; j <= i ; j = j +1)
18 if ( temp [ j ] > temp [ j +1]) begin
19 // Swapping is needed .
20 buffer = temp [ j +1];
21 temp [ j +1]= temp [ j ];
22 temp [ j ]= buffer ;
23 end
24 sort ={ temp [3] , temp [2] , temp [1] , temp [0]};
25 end
26 endfunction
Combinational Circuits 173
The basic unit used to swap 2 numbers, say x and y, is the statements in the begin-
end pair of the if statement, and its schematic is displayed below,
The overall architecture of bubble sorting circuit is presented below, and the in-
dices, i/ j, of the first/second for loops are displayed. As shown, the critical path of the
bubble sorting consists of 5 swapping units, which grows linearly with the number
of sorting numbers.
sum = a ⊕ b
c_out = a · b (5.1)
The functionality of the half adder can be implemented using the structural
description.
4 input a , b ;
5 wire c_out_bar ;
6 xor u0 ( sum , a , b );
7 nand u1 ( c_out_bar , a , b );
Combinational Circuits 175
The functionality of the half adder can also be implemented using the dataflow
description.
4 input a , b ;
6 endmodule
4 input a , b ;
6 always @ ( a or b ) begin
7 sum = a ^ b ;
8 c_out = a & b ;
9 end
10 endmodule
4 input a , b ;
6 always @ ( a or b ) begin
7 case ({ a , b })
8 2 ’ b00 : begin
9 sum = 0; c_out = 0;
10 end
11 2 ’ b01 : begin
12 sum = 1; c_out = 0;
13 end
14 2 ’ b10 : begin
15 sum = 1; c_out = 0;
176 Principles of Verilog Digital Design
16 end
17 default : begin // 2 ’ b11
18 sum = 0; c_out = 1;
19 end
20 endcase
21 end
22 endmodule
sum = (a ⊕ b) ⊕ c_in,
c_out = a · b + (a ⊕ b) · c_in (5.2)
The functionality of the full adder can be implemented using the dataflow descrip-
tion.
4 input a , b , c_in ;
11 sum =( a ^ b )^ c_in ;
12 c_out =( a & b )|(( a ^ b )& c_in );
Combinational Circuits 177
13 end */
14 endmodule
Or, you can use the module add_half to build up the bottom-up design, as shown in
Figure 5.22.
4 input a , b , c_in ;
5 wire w1 , w2 , w3 ;
10 endmodule
4 output c_out ;
5 input [3 : 0] a , b ;
6 input c_in ;
8 endmodule
Comparing the 4-bit adder to the previous (1-bit) full adder designed using the
dataflow description, their RTL codes are almost the same except the bit width dec-
laration. Hence, the RTL codes designed based on the one-bit scalar can be simply
extended to that based on the multi-bit vector. The synthesized circuit is dependent
on the tool you use (might be ripple-carry adder, carry-lookahead, or other adders).
Of course, you can build up the bottom-up design of the 4-bit adder, as shown in
Figure 5.24.
4 output c_out ;
5 input [3 : 0] a , b ;
6 input c_in ;
7 wire [2 : 0] c_out ;
12 endmodule
Combinational Circuits 179
x + y = 2n − 1. (5.3)
For example, the 1’s complement number y of 4-bit binary number x = 00012 is
y = 11102 because 00012 (110 ) + 11102 (1410 ) = 24 − 1 = 15 (11112 = 1510). Alter-
natively, you can obtain the 1’s complement number of x by “inverting every bits of
x”. For example, after inverting every bits of x, you got 11102. The one’s-complement
number system encodes 1’s complement of x as its negative number in a binary num-
ber representation. Therefore, in an one’s-complement number system, the 1’s com-
plement number y = 11102 of x = 00012 is treated as −110 instead of +1410.
There are three popular signed number representations for a 4-bit integer,
{x3 x2 x1 x0 }, as shown in Table 5.8, within which x3 is the sign bit. There are two
zeros for the signed-magnitude and 1’s complement representations.
Addition of positive numbers is the same for three representations, but different
when operands have opposite signs. Under the condition that there is no overflow,
the differences among them are listed below.
If n bits are used to represent signed numbers, then the result must be in the range
−2n−1 to 2n−1 − 1. Otherwise, overflow occurs. Hence, overflow happens when the
result can not be represented because the value is too large or too small. Intuitively,
overflow occurs only when inputs x and y have the same sign. One way to detect
overflow is to check the sign bit of the sum. If the sign bit of the sum does not match
the sign bits of x and y, then there’s overflow. More specifically, its Boolean can be
written as
The following table illustrates the sum of 2 3-bit signed numbers for detecting the
overflow using (5.4). For simplicity, the subtrahend, y2 y1 y0 = 001 (+1), is fixed. The
results of other subtrahends can be similarly shown and omitted here.
Table 5.9: Examples for the overflow detection of sum of 2 3-bit signed numbers
using Equation (5.4).
x2 x1 x0 y2 y1 y0 c2 s2 s1 s0 Overflow
011 (+3) 001 (+1) 0 100 (−4) 1
010 (+2) 001 (+1) 0 011 (+3) 0
001 (+1) 001 (+1) 0 010 (+2) 0
000 (+0) 001 (+1) 0 001 (+1) 0
100 (−4) 001 (+1) 0 101 (−3) 0
101 (−3) 001 (+1) 0 110 (−2) 0
110 (−2) 001 (+1) 0 111 (−1) 0
111 (−1) 001 (+1) 1 000 (+0) 0
Figure 5.28: Another way to detect the overflow of 2’s complement for addition and
subtraction of 3-bit signed integer.
In the examples, when the carry outs of the 3rd and 4th full adders differ, overflow
occurs. That is, Overflow= c3 · c2 + c3 · c2 = c3 ⊕ c2. Generally, for n-bit numbers,
overflow = cn−1 ⊕ cn−2. (5.5)
How to interpret this result? There are two distinctive cases, as displayed below. Let’s
look at the left-most full adder, FAn−1 , as displayed in Figure 5.29. Case 1: 0 carried
in, and 1 carried out. If a 0 is carried in, then the only way that 1 can be carried out
is if xn−1 = 1 and yn−1 = 1. That way, the sum sn−1 is 0, and the carry out is 1. This
is the case when you add two negative numbers, but the result is positive. Case 2: 1
carried in, and 0 carried out. The only way 0 can be carried out if there’s a 1 carried
in is if xn−1 = 0 and yn−1 = 0. In that case, 0 is carried out, and the sum sn−1 is 1.
This is the case when you add two positive numbers and get a negative result.
4 assign c = a + b ;
6 wire [7 : 0] d ;
7 wire [3 : 0] e ;
8 wire [7 : 0] f ;
9 assign f = d + e ;
If you do not want to perform the sign extension manually, you can try signed dec-
laration in Verilog-2001. It’s very convenient that synthesis tools can now synthesize
and optimize the signed number arithmetics automatically.
8 assign f = d + e ;
4 wire signed [7 : 0] d ;
5 wire signed [3 : 0] e ;
6 wire signed [7 : 0] f ;
7 c = $signed ( a )+ $signed ( b );
8 f = $signed ( d )+ $signed ( e );
Input and output ports can also be declared as “signed” ports as follows.
10 assign y = a + b_align ;
11 endmodule
In conclusion, it is the most convenient to infer a signed number adder and sub-
tractor in hardware using the signed number declaration. Just declare the original
number of bits in operands and the required number of bits in results. Tools will take
care of the sign extension. You do not have to manually sign extend the operands nor
do you need to implement them in gate level. That is, we only describe the behavior
of adder/subtractor in RTL. Details are left to synthesis. However, the binary point
of fixed-point numbers should still be aligned.
4 wire signed [9 : 0] c ;
4 wire [9 : 0] c ;
7 assign y = a * b ;
8 endmodule
4 wire [3 : 0] z ;
5 assign z = x + y ;
Similarly, if a 3-bit signed number plus a 3-bit signed number, result should have
4 bits. The RTL codes are written as follows because −4 ≤ x ≤ 3 and −4 ≤ y ≤ 3.
This yields −8 ≤ z ≤ 6, which can be represented by a 4-bit signed number.
4 wire signed [3 : 0] z ;
5 assign z = x + y ;
188 Principles of Verilog Digital Design
If a 3-bit unsigned number multiplies a 3-bit unsigned number, result should have
6 bits. The RTL codes are written as follows. Because 0 ≤ x ≤ 7 and 0 ≤ y ≤ 7, this
yields 0 ≤ z ≤ 49, which can be represented by a 6-bit unsigned number.
4 wire [5 : 0] z ;
5 assign z = x * y ;
If a 3-bit signed number multiplies a 3-bit signed number, result should have 6
bits. The RTL codes are written as follows. Because −4 ≤ x ≤ 3 and −4 ≤ y ≤ 3,
this yields −12 ≤ z ≤ 16, which can be represented by a 6-bit signed number.
4 wire signed [5 : 0] z ;
5 assign z = x * y ;
From above, if a 3-bit signed number multiplies a 3-bit signed number, result x
should have 6 bits. It the results are then accumulated 8 times, the accumulator output
y needs 6 + 3 = 9 bits (rather than 6 + 7 = 13 bits). This can be explained in Figure
5.31.
Figure 5.31: Bit width for the accumulation of 8 multiplication results produced by
3-bit signed number multiplication.
bers, a + b gives 0012 (710 + 210 = 110 (mod 810 )). Even 2’ complement arithmetic is
graceful, i.e., final result can be correct if temporary overflow happens provided that
the final result is guaranteed without an overflow. However, if the final result has an
overflow, a large error may exist. From the above example, the result changes from
910 to 110 .
Therefore, it is typically desirable to have a saturation adder, producing a result
between 2n−1 − 1 and −2n−1 for n-bit signed number in this case on an overflow
condition rather than producing a modular result. To eliminate such a large error of
arithmetic operations, the saturation arithmetic, within which all operations such as
addition and multiplication are limited to a fixed range between a minimum and max-
imum value, is usually adopted. The saturation arithmetic necessitates the overflow
detection.
Overflow detection using carry outs in (5.5) is simple and requires only an XOR
gate. However, carry outs are not readily available using the behavioral description
of arithmetic operator, i.e., “+” in Verilog. Therefore, rather than detecting the dif-
ference between carry outs in (5.5), we can detect the difference between sum bits
for overflow as follows,
where sn and sn−1 are sign bits of the sum of the sign-extended addition before and
after dropping out one bit, respectively, and ⊕ is the bitwise XOR. Intuitively, the
detector output an overflow event when sn is different with sn−1 , i.e., the sign bits are
different.
It can be proved that the detectors in (5.5) and (5.6) are equivalent. Due to sign
extension for obtaining the sum s = x + y, where x and y are n-bit sign numbers, and
s is an (n + 1)-bit sign numbers, we have xn = xn−1 and yn = yn−1 . Therefore,
sn = xn ⊕ yn ⊕ cn−1
= xn−1 ⊕ yn−1 ⊕ cn−1 (5.7)
and
The following table illustrates the sum of 2 3-bit signed numbers for detecting the
overflow using (5.6), which is the same as that using (5.4) in Table 5.9. For simplicity,
the subtrahend, y2 y1 y0 =001 (+1), is fixed. The results of other subtrahends can be
similarly shown and omitted here.
Table 5.10: Examples for the overflow detection of sum of 2 3-bit signed numbers
using Equation (5.6).
x2 x1 x0 y2 y1 y0 s3 s2 s1 s0 Overflow
011 (+3) 001 (+1) 0 100 (−4) 1
010 (+2) 001 (+1) 0 011 (+3) 0
001 (+1) 001 (+1) 0 010 (+2) 0
000 (+0) 001 (+1) 0 001 (+1) 0
100 (−4) 001 (+1) 1 101 (−3) 0
101 (−3) 001 (+1) 1 110 (−2) 0
110 (−2) 001 (+1) 1 111 (−1) 0
111 (−1) 001 (+1) 0 000 (+0) 0
The overflow detection in (5.6) can be generalized for dropping more than one
bits in other arithmetic operators, e.g., multiplication. This becomes a quantization
problem. That is, if the original result s of an arithmetic operation without overflow
has m bits and it is desired to be truncated by p bits, p ≥ 1, so that sm−1 , sm−2 , ..., sm−p
are dropped, and sm−p−1 is the new sign bit of the truncated result, as shown below,
The overflow probably happens and it can be detected by the Boolean equation as
To get rid of the overflow, the dropped bits must be unnecessary sign bits such that
sm−1 , sm−2 , ..., and sm−p−1 are the same. In other words, if the sign bit sm−1 of the
original result does not agree with those dropped sign bits, sm−2 , ..., sm−p and the
remaining sign bit sm−p−1 , the overflow occurs.
We often need to separately detect the “positive” and “negative” overflows for
the saturation arithmetic. The positive (negative) overflow is such that the original
result s is larger (smaller) than the maximum positive (minimum negative) number
that the quantized result can represent. The detection of positive overflow can thus
be expressed by the Boolean equation as
That is, when s is positive (sm−1 = 0), but any bits of sm−2 , sm−3 ,..., or sm−p−1 are
negative (logic 1), positive overflow occurs. The detection of negative overflow can
similarly be expressed by the Boolean equation as
That is, when s is negative (sm−1 = 1), but any bits of sm−2 , sm−3 ,..., or sm−p−1 are
positive (logic 0), negative overflow occurs.
The following table illustrates the overflow detection for truncating a 5-bit sign
number s to a 3-bit sign number z. Those truncated bits are {s4 , s3 , s2 }. The detection
of positive and negative overflow can be written by the Boolean equations as
and
respectively.
Table 5.11: Overflow detection for truncating a 5-bit sign number s to a 3-bit sign
number z.
s4 s3 s2 s1 s0 z2 z1 z0 Overflow
01111 (+15) 111 (−1) 1 (positive)
01110 (+14) 110 (−2) 1 (positive)
01101 (+13) 101 (−3) 1 (positive)
01100 (+12) 100 (−4) 1 (positive)
01011 (+11) 011 (+3) 1 (positive)
01010 (+10) 010 (+2) 1 (positive)
01001 (+9) 001 (+1) 1 (positive)
01000 (+8) 000 (+0) 1 (positive)
00111 (+7) 111 (−1) 1 (positive)
00110 (+6) 110 (−2) 1 (positive)
00101 (+5) 101 (−3) 1 (positive)
00100 (+4) 100 (−4) 1 (positive)
00011 (+3) 011 (+3) 0
00010 (+2) 010 (+2) 0
00001 (+1) 001 (+1) 0
00000 (+0) 000 (+0) 0
10000 (−16) 000 (+0) 1 (negative)
10001 (−15) 001 (+1) 1 (negative)
10010 (−14) 010 (+2) 1 (negative)
10011 (−13) 011 (+3) 1 (negative)
10100 (−12) 100 (−4) 1 (negative)
10101 (−11) 101 (−3) 1 (negative)
10110 (−10) 110 (−2) 1 (negative)
Combinational Circuits 193
The saturation adder of n-bit unsigned numbers is much simpler than that of
signed numbers by investigating the last carry out of the (n − 1)-th full adder, which
is sn and available using the arithmetic operator “+” in Verilog. You can use n-bit
adder and an n-bit multiplexer to generate the final result, sum_q, as shown below.
6 assign sum = x + y ;
That is, any truncated bits that have logic 1 lead to the overflow.
6 assign sum = x + y ;
The meaning of the declaration a[(p − 1) : − f ] for both unsigned and signed number
is {a p−1, a p−2 , ..., a0 . a−1 , ..., a− f +1 , a− f }, where p and f denote the numbers of in-
| {z } | {z }
p f
teger and fraction bits, respectively, and “.” denotes the binary point. However, a p−1
is the sign bit of a signed number.
If we want to design the bit width of the digital signal processing system in
Figure 5.33, the fixed-point number representation and its declaration are labeled
in the block diagram. Notice that, to save one bit of the product, the product c is
represented by s(1.30). The block Q quantizes the number using the rounding. The
block S clamps or saturates the result on an overflow to minimize the error. The block
R denotes the register.
Addition of fixed-point numbers must align their binary points. Therefore, 15
zeros are padded in the LSBs of g[0 : −15] (in s(1.15) format) to make it become
s(1.30) format. Besides, the manual sign extension is needed if the number of bits of
a sliced operand, which will be treated as an unsigned number, is not enough for an
arithmetic operation. Moreover, the most positive value of f with s(1.15) format is
0111_1111_1111_11112 or 16’h7fff, and the most negative value of f with s(1.15)
format is 1000_0000_0000_00002 or 16’h8000. According to the fixed-point design
in Figure 5.33, the RTL model is described below.
9 assign d = c +{ g ,15 ’ b0 };
17 // Saturation operation
18 assign f = of_pos ?16 ’ h7fff :
21 g <= f ;
4 input [3 : 0] sel ;
5 input [7 : 0] a , b ;
6 reg [7 : 0] y ;
7 always @ ( sel or a or b )
8 case ( sel )
9 4 ’ b0000 : y = a ;
10 4 ’ b0001 : y = a +1;
11 4 ’ b0010 : y = a + b ;
12 4 ’ b0011 : y = b +1;
13 4 ’ b0100 : y = a +~ b ;
14 4 ’ b0101 : y = a +~ b +1;
15 4 ’ b0110 : y =a -1;
16 4 ’ b0111 : y = b ;
17 4 ’ b1000 : y = a & b ;
18 4 ’ b1001 : y = a | b ;
19 4 ’ b1010 : y = a ^ b ;
20 4 ’ b1011 : y =~ a ;
21 4 ’ b1100 : y =~ b ;
22 4 ’ b1101 : y =a > >1;
23 4 ’ b1110 : y =a < <1;
24 4 ’ b1111 : y =0;
25 default : y =8 ’ bX ;
26 endcase
27 endmodule
As shown, the carry generate gi produces a carry ci of 1 when both ai and bi are
1, and the carry propagate pi determines whether a carry into stage i, i.e., ci−1 , will
propagate into stage (i + 1) by ci .
The carry out ci in Equation 5.17 is an iterative equation. We now write the carry
output ci of each stage, i = 0, 1, ..., and substitute ci−1 from the previous stage until
reaching the input carry c−1 :
c0 = g0 + p0 · c−1 ,
c1 = g1 + p1 · c0 = g1 + p1 · g0 + p1 · p0 · c−1 ,
c2 = g2 + p2 · c1 = g2 + p2 · g1 + p2 · p1 · g0 + p2 · p1 · p0 · c−1 .
...
The process can continue until all carries have been expressed by gi , pi , and c−1 .
As presented, ci now depends on g j and p j , j = i, i − 1, ..., 0, and c−1 , and is not
related to ci−1 . The carry of CLA needs not to propagate like that of carry-ripple
adder. Therefore, the CLA is faster than the traditional carry-ripple adder. However,
expanding the iterative equation in Equation 5.17 makes the common term, ci−1 ,
cannot be shared for ci and subsequent carry outs. Hence, the circuit area of the CLA
is larger than that of carry-ripple adder.
Similarly, the output sum si can also be expanded and expressed using gi , pi , and
c−1 below.
s0 = p0 ⊕ c−1,
s1 = p1 ⊕ c0 = p1 ⊕ (g0 + p0 · c−1 ),
s2 = p2 ⊕ c1 = p2 ⊕ (g1 + p1 · g0 + p1 · p0 · c−1 ).
...
The schematic of the CLA is presented in Figure 5.34. The CLA can add in less
time than carry-ripple adder because c3 does not have to wait for c2 and c1 to prop-
agate. Compared to the carry-ripple adder in Figure 5.5 and the full adder in Figure
5.4, the gain in speed of operation is achieved at the expense of additional hardware
complexity.
The block S clamps or saturates the result on an overflow to minimize the error. Note
that it is also possible to build this circuit using three multipliers for obtaining ad, bc,
and (a − b)(c+ d), but we focus on the intuitive implementation with four multipliers
here.
Our complex multiplier uses an signed s(1.15) fixed-point format, which is
common in many signal processing tasks. To avoid incurring an overflow error
or losing precision until after the final summation, we keep intermediate values
at full bit width. For a signed s(p. f ) fixed-point number a, its dynamic range is
−2 p−1 ≤ a ≤ 2 p−1 − r, where r = 2− f . For example, the dynamic range of a signed
s(1.15) fixed-point number is −1 ≤ a ≤ 1 − 2−15 . Multiplying two s(1.15) numbers,
a and b, gives a dynamic range of −(1 − 2−15) ≤ ab ≤ 1, which requires a signed
s(2.30) fixed-point number c with dynamic range of −2 ≤ c ≤ 2 − 2−30 . However,
if the maximum value 1 of ab, which is obtained by a = −1 and b = −1, can be
ignored, a signed s(1.30) fixed-point number d suffices and it has a dynamic range of
−1 ≤ d ≤ 1−2−30 . Doing so can save one bit in the product of two signed fixed-point
numbers.
Then, adding two of these products with signed s(1.30) format gives an s(2.30)
result. We then quantize the s(2.30) result, i.e., the real part s_rnd_r and the imagi-
nary part s_rnd_i, using the rounding to s(3.15). The final stage, i.e., limiter, checks
overflow to saturate the s(3.15) result back to an s(1.15) number.
A positive overflow of s_rnd_r has occurred when the sign bit of s_rnd_r[17] is
0 (positive), but it does not agree with that of s_rnd_r[16] (negative) or s_rnd_r[15]
(negative). For example, if a 4-bit signed number, a, in Table 5.8 is to be truncated to
Combinational Circuits
Figure 5.35: Complex multiplier, where x_r, x_i, y_r, y_i, z_r, and z_i are real part (a) of x, imaginary part (b) of x, real part (c) of y,
imaginary part (d) of y, real part of z, and imaginary part of z, respectively.
199
200 Principles of Verilog Digital Design
a 3-bit signed number, the 4-bit numbers 4’b0100, 4’b0101, 4’b0110, and 4’b0111
are larger than the maximum positive value, 3’b011, that a 3-bit signed number can
represent. In these cases, a positive overflow occurs, which can be detected using
a[3] and a[2] by a[3] · a[2]. Generally, if an m-bit signed number a is desired to be
truncated by p bits, p ≥ 1, the positive overflow can be detected by
The overflow detection of s_rnd_i can be similarly performed. Our complex mul-
tiplier uses the saturation arithmetic to clamp the result on an overflow to minimize
the error.
19 // Check overflow
24 // Output of Limiter
4 initial begin
5 a = -4 ’ d6 ;
6 int32 = -4 ’ d6 ;
7 end
In the first assignment, 4-bit unsigned number (10102) is assigned to a 8-bit unsigned
number, −4’d6 is zero extended as 0000_10102, which is a potential problem. In the
second assignment, 4-bit unsigned number (10102) is assigned to a 32-bit signed
number, −4’d6 is sign extended as 1111_..._1111_10102, which is fine for signed-
magnitude representation, but is still not good for 2’s complement representation.
This is quite confusing. Hence, do not use base when you refer to a negative number.
In another example,
4 reg [15 : 0] c ;
5 initial begin
6 a = -1;
7 b =8;
8 c =8;
9 #10 b = b + a ;
10 #10 c = c + a ;
11 end
Firstly, assign −110 to a 4-bit unsigned reg, a, which gives a 4-bit 2’s comple-
ment of −110 , which is 11112. So, a is 11112. Secondly, assign 8 to a 4-bit un-
signed reg, b. So, b is 10002. Thirdly, assign 8 to a 16-bit unsigned reg, c. So,
c is 0000_0000_0000_10002. Fourthly, 4-bit unsigned a and 4-bit unsigned b are
added, the result is assigned to 4-bit unsigned b. The addition result is 1_01112,
which is truncated to 01112 and then assigned to b. Finally, 16-bit unsigned c
and 4-bit unsigned a are added, the result is assigned to 16-bit unsigned c. The
4-bit unsigned a is padded to 0000_0000_0000_11112. So, the addition result is
0000_0000_0001_01112, which is then assigned to c.
In another example,
202 Principles of Verilog Digital Design
5 initial begin
6 b =32 ’ hffff_fff0 ;
7 #10 a = b +1;
8 end
• If all RHS operands are signed, the result is signed, regardless of operator.
• If any RHS operand is unsigned, the result is unsigned.
• If RHS operands are constant decimal numbers (e.g., –12), they are treated
as signed numbers. But constant based numbers (e.g., –12) are unsigned.
Step 2: evaluate the RHS expression, producing a result of the type (i.e., sign)
found in Step 1. The result has a size of the largest RHS operand.
Step 3: assign RHS to LHS according to the size of the LHS.
• If the bit width of RHS is smaller than LHS and the result of RHS is signed,
signed number is sign extended.
• If the bit width of RHS is smaller than LHS and the result of RHS is un-
signed, unsigned number is zero extended.
• If the bit width of RHS is larger than LHS, RHS is truncated.
PROBLEMS
1. Design the combinational shifter with the function table in Table 5.15.
1 wire [4 : 0] a , b ;
2 reg [4 : 0] c ;
3 integer i ;
4 always @ (*)
1 wire a , b , c ;
2 reg d , e ;
3 always @ (*) begin
4 d =( a & b )&( a | c );
5 e=c^d;
6 end
6. Detect the overflow for the 3-bit result of signed addition and subtraction of 3-bit
operands.
7. Prove that the sign extension does not affect the original value of both signed
and unsigned integer numbers.
8. We are familiar with addition using signed-magnitude representation. However,
it is most convenient to implement addition using 2’s complement representa-
tion. If we want to design a circuit for calculating
Y = A + B −C (5.21)
$
>@
$>@ 6LJQPDJQLWXGHWR 7KUHH2SHUDWLRQV
VFRPSOHPHQWWR
%
>@ < <
%>@
VFRPSOHPHQW $GGLWLRQ VLJQPDJQLWXGH
FRQYHUVLRQ &
>@ < $
%
&
FRQYHUVLRQ
&>@
a. The first block converts one signed-magnitude number to one 2’s complement
number. To design the first block, i.e., signed-magnitude to 2’s complement
conversion, write down the one-to-one relation between signed-magnitude
and 2’s complement representations. Based on the relation, write down your
RTL code based on a lookup table. Since the conversion should be instanti-
ated 3 times, please design the first block using a module.
b. A possible implementation of the second block is shown in Figure 5.37. To
prevent overflow, what’s the bit width of Y ? Write down your RTL code in a
module. You are not allowed to use the “signed” declaration. Remember to
do sign extension before the addition.
–
Figure 5.37: Three-number 2’s complement addition.
c. The third block is the “reverse process” of the first block. Based on the re-
lation of signed-magnitude and 2’s complement representations, write down
your RTL code in a module.
d. Write down your RTL code of the top module by instantiating above 3 mod-
ules.
e. Can you reduce the area of the design by combining the conversion of C[2 : 0]
in the first block and negating C[2 : 0] in the second block?
f. Rewrite (a) using a function.
g. Can you design a smaller signed-magnitude to 2’s complement conversion
circuit using an adder instead of lookup table in item a. of this problem?
What’s the advantage of your new design?
9. a. Using the Boolean algebra to prove that
c. From above, prove that the overflow detectors in (5.4) and (5.5) are equiva-
lent.
10. Design a 6-bit signed-magnitude comparator.
11. Plot the architecture of the ALU in Table 5.14.
12. Prove that the multiplication of two n-bit numbers gives a product of width less
than or equal to 2n bits.
13. Prove that the additon of one n-bit number and one m-bit number gives a sum of
width less than or equal to n + m bits.
14. Write an RTL behavioral description for adding two 8-bit signed numbers in
signed-magnitude representation and verify it.
15. Please redesign the module complex_mul such that the maximum value 1 of the
product of two s(1.15) numbers cannot be ignored.
16. An approximation for finding the square root of a number about 1 can be found
by computing its Taylor series as
√ (x − 1)2 (x − 1)3
x ≈ x+ + . (5.24)
2 6
Design a Verilog module that computes the approximate square root of x using
the u(1.8) format. The output is also assumed to be an u(1.8) number. Your
design must not suffer any intermediate precision loss. What is the worst case
error for all x between 0.5 and 1.5?
17. Plot the architectures of the following two RTL codes using 2-to-1 multiplexers.
Subsequently, analyze the critical paths of them.
RTL codes 1:
1 always @ (*)
2 if ( sel ==2 ’ b00 ) out = a ;
3 else if ( sel ==2 ’ b01 ) out = b ;
4 else if ( sel ==2 ’ b10 ) out = c ;
5 else out = d ;
RTL codes 2:
1 always @ (*)
2 case ( sel )
3 2 ’ b00 : out = a ;
4 2 ’ b01 : out = b ;
5 2 ’ b10 : out = c ;
6 default : out = d ;
7 endcase
Combinational Circuits 207
18. An approximation for finding the reciprocal of a number between 0.3 and 0.8 is
given by
1
≈ 1 + (1 − x) + (1 − x)2 + (1 − x)3. (5.25)
x
The interface of the decoder is displayed below. When all sums of elements in
every rows and columns are divisible by 2, the output r[1 : 0]=2’b00; when there
is only one single element causing one row and one column not to be divisible
by 2, the output r[1 : 0]=2’b01; otherwise, the output r[1 : 0]=2’b10. Besides,
when r[1 : 0]=2’b01, output the row and column indices, i.e., row[1 : 0] and
col[1 : 0], respectively, of the element causing the sums not to be divisible by 2;
when r[1 : 0]6=2’b01, row[1 : 0] and col[1 : 0] are “don’t care”.
a. Design a module, called add4, that can sum four elements in a row or column.
b. Write the RTL codes that can decide if the output of an add4 is divisible
exactly by 2.
208 Principles of Verilog Digital Design
c. Instantiate add4 8 times. Based on the decision results of every rows and
columns, output r[1 : 0], row[1 : 0], and col[1 : 0].
24. For the half adder circuit in Figure 5.21, complete the timing diagrams with
respect to the following timing models.
a. Gate delays of XOR, NAND, and NOT gates are 2.5, 1.5, 0.5 time units,
respectively.
b. In the typical-case operating condition, rising time delays of XOR, NAND,
and NOT gates are 2.5, 1.5, 0.5 time units, respectively. Falling time delays
of XOR, NAND, and NOT gates are 2.2, 1.2, 0.4 time units, respectively.
c. In the worst-case operating condition, all delays are 1.6 times those in the
typical-case operating condition.
25. Redesign the fixed-point addition in Section 4.2.2 using the signed declaration.
26. Redesign the fixed-point multiplication in Section 4.2.2 using the signed decla-
ration.
6 Sequential Circuits
Versatile and highly complicated functions have been achieved through sequential
circuits. This chapter introduces two sequential circuits, such as asynchronous latch
and synchronous flip-flop. The rationales for timing constraints, including the re-
quirements of setup time, hold time, and clock-to-Q delay, of flip-flops are illustrated
in details. We then give examples of behavioral and structural descriptions of sequen-
tial circuits. Basic but essential building blocks of sequential circuits, such as regis-
ters, shift registers, register files, state machines, (synchronous and asynchronous)
counters, and FIFO buffer (or queue), are presented, together with their RTL codes.
Finally, the way to solve the race condition in Verilog codes is briefly described.
C
R F
6.1.1 LATCH
Recall that the output of a combinational circuit depends only on its current inputs.
Also, combinational circuits are acyclic. If we add a feedback path to a combina-
tional circuit, the circuit might become sequential, which allows the circuit to store
information of its past inputs. That is, the output of a sequential circuit can depend
not only on its current inputs, but also rely on its previous inputs. The information
stored on the feedback signals are referred to as the state of a sequential circuit.
Latches are level-sensitive sequential circuits since they respond to input changes
during clock width, as shown in Figure 6.2. Consequently, a latch may change many
times during a clock width. Latches are difficult to work with for this reason.
All latches are constructed from the SR (set-reset) latch consisting of NOR gates
introduced here, as shown in Figure 6.3, where (·) denotes the bitwise NOT opera-
tion. Set state of the latch output is logic 1, and reset state of the latch output is logic
0. A latch can maintain a binary state indefinitely until directed by an input signal to
switch states.
The truth table of a latch is presented in Table 6.1. Notice that the input, S = 1 and
R = 1, is forbidden because if the next input, S = 0 and R = 0, is applied, the output
Q(t + 1) may oscillate between 0 and 1 states.
Sequential Circuits 211
are fed back to the inputs as state variables, where · and + denote the bitwise AND
and OR operations, respectively. The equation clearly tells us how to derive the new
state of Q(t + 1) as a function of the input and its old state Q(t). From the equations,
it is easy to see that if R = 1 and S = 0, the Q(t + 1) is reset and Q(t + 1) is set which
is a stable state; if S = 1 and R = 0, Q(t + 1) is set and Q(t + 1) is reset which is also
a stable state; if S = 0 and R = 0, the Q(t + 1) = Q(t) and Q(t + 1) = Q(t) which
stay in whatever states they were in. This is still a stable state; if R = 1 and S = 1,
Q(t + 1) = 0 and Q(t + 1) = 0, which is an unstable state. If then R = 0 and S = 0,
Q(t + 1) and Q(t + 1) may oscillate between 0 and 1.
Figure 6.4 presents the SR latch with enable control. To this, first, the NOR-gate
SR latch is converted to the NAND-gate implementation. Second, u1 and u4 are
merged. So are u2 and u3. Third, an enable signal, E, is added. When E = 1, the SR
latch works as it was. When E = 0, the SR latch stays in whatever states they were
in.
212 Principles of Verilog Digital Design
However, the indeterminate state still makes the SR latch difficult to use. To elim-
inate the undesirable condition of the indeterminate state in the SR latch, the D latch
(transparent latch) in Figure 6.5(a) is designed, where S and R pins of SR latch are
connected so that S is directly tied to D pin and R pin is tied to the inverse of D pin,
as shown in Figure 6.5(b).
Its functional table is displayed in Table 6.3.
I O
The D-latches introduced before use static CMOS gates. CMOS technology,
however, also permits us to construct a D-latch, as shown in Figure 6.7(b), with a
transmission gate, a tristate inverter, and inverters. The tristate inverter is equivalent
to an inverter followed by a transmission gate in Figure 6.7(a), where, when E = 1,
the output Y = A by turning on both NMOS and PMOS in the transmission gate;
when E = 0, the output Y is in tristate by turning off NMOS and PMOS, such that Y
is isolated from A. Most CMOS latches use transmission gates in this style because
this results in a latch that is both smaller and faster than the static CMOS gates.
Compared to the static CMOS D-latch gate in Figure 6.5(b) with 18 transistors, the
D-latch using the transmission gate requires only 12 transistors.
214 Principles of Verilog Digital Design
Figure 6.7: (a) Tristate inverter. (b) CMOS D-latch using transmission gate and tris-
tate inverter.
When enable E is high (and E is low), the transmission gate formed by NMOS m1
and PMOS m2 is on, allowing the value on input D to pass to storage node S. Output
Q follows storage node S buffered by inverter u2 and u4. Thus, when E is high, the
output Q follows input D. When enable E goes low, the transmission gate formed by
m1 and m2 turns off isolating storage node S from the input. At this time, the input
is sampled onto the storage node. Concurrently, tristate inverter u3 turns on, closing
a storage loop from S back to itself through two inverters, u2 and u3. This feedback
loop reenforces the stored value, allowing it to be retained indefinitely.
6.1.2 FLIP-FLOP
Flip-flops respond to input changes only during the change in clock signal (the rising
edge or the falling edge). They are easy to work with though more expensive than
latches. The state of edge-triggered flip-flop changes during a clock-pulse transition.
A D-type positive-edge-triggered flip-flop is shown in Figure 6.8. It has three SR
latches.
We will utilize the truth table of the last SR latch, as shown in Table 6.4.
Figure 6.10: (a) Combinational loop. (b) Feedback of flip-flop output is not a prob-
lem.
Figure 6.11: Rising setup and clock-to-Q delay times requirement. Signal values with
an emphasis on their transitions for 0 to 1 transition in Q are labeled in the schematic.
Figure 6.12: Falling setup and clock-to-Q delay times requirement. Signal values
with an emphasis on their transitions for 1 to 0 transition in Q are labeled in the
schematic.
Sequential Circuits 219
Moreover, the hold time is defined as that D input must not change after the appli-
cation of the positive-edge clock pulse (the 2nd transition). As shown in Figure 6.13,
the signal marked by the dashed arrow must be stable before the change of D (the 3rd
transition). Therefore, it is the propagation delay of gate u3, i.e., clock to the internal
latch.
Figure 6.13: Hold time requirement. Signal values with an emphasis on their transi-
tions are labeled in the schematic.
The clock-to-Q delay, tCQ , is defined as the delay that it takes for the register output
to be in a stable state after a clock edge occurs. We can use the intra assignment delay
to model clock-to-Q delay, as shown below, where a clock-to-Q delay of 1 time unit
is added. The delay will be removed during synthesis. It can be used to distinguish the
events of clock rising edges and those for the evaluation of the outputs of sequential
circuits. In this example, at rising edges of clk, D input (or A + B) of flip-flops is
evaluated. Then, the result is delayed and finally assigned to Q after 1 time unit.
Only the (minimum : typical : maximum) delays of a single delay can be specified
in procedural assignments. Therefore, the rising, falling, and turn-off delays must be
the same. To distinguish the rising and falling delays, the continuous assignment is
used as follows. In the following example, Q models the real output of the flip-flops.
The (minimum : typical : maximum) delays of rising, falling, and turn-off delays are
(1 : 2 : 3), (2 : 3 : 4), and (3 : 4 : 5), respectively.
220 Principles of Verilog Digital Design
M S
4 // differentiat ed .
The setup time (2 time units) and hold time (1 time unit) constraints of a flip-flop
are checked by the specify block.
L P N
L E
At every positive edge of clk, q is set as d. Therefore, when the input d changes,
the registered output q reflects (and memorizes) the result of d at the next rising
edge of clk. If we change the posedge to negedge, we get the negative edge-triggered
D-FF. We use the non-blocking assignment to describe the behavior of sequential
circuits. For example, if there are three FFs to be inferred, they can be written in an
always block as follows. The variables, a, b, and c, are outputs of hardware registers
implemented by D-FFs. They are executed concurrently and, hence, order indepen-
dent.
The always block can also infer level sensitive latch using the sensitive list similar
to that of a combinational circuit. For example, the output is not fully specified, latch
is inferred.
4 if (! reset_n ) q <=0;
5 else q <= d ;
There are two functions mixed in the always block: reset and normal functions. The
reset is asynchronous because it is put in the sensitivity list. That is, whenever the
negative edge (negedge) of reset_n happens, the output q of FF is cleared (assigned
logic 0) because the reset_n= 0 and has the highest priority than normal function
(assigned d). Hence, the FF responds to asynchronous reset immediately. Otherwise,
the normal function takes action. The above RTL codes exactly describe the behav-
ioral of an edge-triggered D-FF with asynchronous reset in Figure 6.18. The timing
diagram is also presented.
Figure 6.18: D-FF with asynchronous reset: (a) schematic and (b) timing diagram.
Typically, the power-on-reset (POR) and hardware reset (asserted by the reset
push button) are applied using the asynchronous reset, while normal function uses
the synchronous reset to reset (or clear) a block or a portion of the digital circuits.
Specifically, for active-low POR and hardware reset, they are AND-ed to generate
the asynchronous reset, as shown below. The POR can be generated by the voltage
regulator once the supply voltage is stable.
224 Principles of Verilog Digital Design
Figure 6.19: D-FF with synchronous reset: (a) schematic and (b) timing diagram.
There are two typical ways to describe the outputs of sequential circuits.
Method 1: one always block combining both combinational and sequential cir-
cuits.
5 d=a+b;
6 // Sequential circuits
7 always @ ( posedge clk or negedge reset_n )
8 if (! reset_n ) q <=4 ’ d0 ;
9 else q <= d ;
6.4.1 REGISTERS
An n-bit register is a group of n binary sequential cells (or flip-flops). A binary se-
quential cell can store one bit of information, or two states: 0 (reset) and 1 (set) state.
The state of a register is an n-tuple of 1’s and 0’s. The registers with synchronous
enable or load are presented below.
226 Principles of Verilog Digital Design
5 input [3 : 0] d ;
6 reg [3 : 0] q ;
8 if ( enable ) q <= d ;
9 endmodule
mode transfers all the bits of the register at the same time to achieve a high bit rate
of data transmission.
A serial transfer is demonstrated in Figure 6.22. The transmitter/receiver converts
a parallel/serial input to serial/parallel output.
S S
Figure 6.22: A serial transfer mode with shift control: (a) block diagram and (b)
timing diagram of gated clock.
We assume that the parallel input and output are 4 bits. When PI_VALID is true,
the input PI_A is loaded into register A. When PO_VALID is true, the output PO_B
is output from register B. The serial output (SO) of register A is connected to the
serial input (SI) of register B. That is, registers A and B are transferred in parallel-in
to serial-out (PISO) and serial-in to parallel-out (SIPO) modes, respectively.
The shift control input (shift_en) determines when and how many times the reg-
ister A are loaded or shifted. This is done with an AND gate that allows clock pulses
to pass into the clock terminals of registers only when the shift control signal is high.
Gating the clock signal is called clock gating. This practice may be problematic be-
cause it may influence the clock path of the circuit so that glitches may be produced
on the tx_clk_gated signal. The functionality of the shift register can fail owing to
extra edges on the tx_clk_gated signal. Therefore, the control signal, shift_en, should
be carefully designed so that no glitch can be produced in the gated clock.
The signal, rx_clk_gated, determines how many times the register B are shifted.
When a complete word has been shifted in, the signal PO_VALID becomes true.
The RTL codes are written below.
228 Principles of Verilog Digital Design
1 // ********** ** * ** **
2 // * Transmitter
3 // ********** ** * ** **
8 // Shift register A
10 if ( PI_VALID )
11 A_reg <= PI_A ;
12 else begin
13 for ( i =0 :i <=2; i = i +1) A_reg [ i +1] <= A_reg [ i ];
14 end
15 end
17 // ********** ** * ** **
18 // * Receiver
19 // ********** ** * ** **
20 // Shift register B
8 input clk ;
N C
N
O O
I
Next, it is more convenient to apply the state table for the state reduction shown
in Figure 6.25, where x is the single bit input. Two states are said to be equivalent if,
they depend on the same input, give exactly the same output, and transits to the same
state or equivalent state. When two states are equivalent, such as g and e, and d and f ,
one of them can be removed without altering the input-output relationships. The state
diagram for the reduced table shown in Figure 6.25 finally consists of only five states.
In this example, even the number of flip-flops is not reduced, the combinational logic
still lowers due to a fewer number of states.
5 parameter e =3 ’ b100 ;
6 // Combinational logic
7 always @ (*) begin
8 state_ns = state_cs ;
9 case ( state_cs )
10 a : state_ns = x ? b : a ;
11 b : state_ns = x ? d : b ;
12 c : state_ns = x ? d : b ;
13 d : state_ns = x ? d : e ;
14 e : state_ns = x ? d : a ;
15 endcase
16 end
17 // Sequential logic
18 always @ ( posedge clk or negedge rst_n )
Remember that power is consumed when a bit toggles between 0 and 1. The Gray
code changes one bit between adjacent numbers as shown in Table 6.6. This code
group can be used to save the number of bit transitions. Gray code can also be used
to detect an error or ambiguity during the transition from one number to the next
when multiple numbers of bits change.
The state machine of the Gray encoding is presented below. Since the RTL codes are
the same as those of the traditional binary encoding, they are omitted here except the
parameter definition.
Sequential Circuits 233
4 parameter e =3 ’ b110 ;
Another possible assignment often used in the design of state machines to con-
trol datapath units is the one-hot assignment, which can reduce the critical path
potential in datapath units because the decoder of the state machine can be elimi-
nated. The state machine of the one-hot encoding is presented below. For example,
if we want to determine whether the state state_cs is state a by the Verilog code
“state_cs==a”, it will be optimized by simply “state_cs[0]==1’b1”, and the de-
coder is not needed. Therefore, the one-hot encoding can achieve a faster circuit
because combinational circuits of decoders are not required to generate the control
signals resulted from FSM. However, more registers, which equals the number of
states instead of ⌈log2 (·)⌉ of it, are needed for the one-hot encoding.
5 parameter e =5 ’ b10000 ;
6.4.5 COUNTER
A counter is essentially a register that goes through a predetermined sequence of
binary states. The gates in the counter are connected in such a way as to generate the
specified sequence of states.
1 // Synchronous counter
2 module counter1 ( out , enable , clk , reset );
3 output [2 : 0] out ;
5 reg [2 : 0] out ;
11 end
12 endmodule
If the counter counts till 3’d5, the comparison to 3’d5 is required as follows.
1 // Synchronous counter
2 module counter1 ( out , enable , clk , reset );
3 output [2 : 0] out ;
5 reg [2 : 0] out ;
14 endmodule
Example 6.1. A digital alarm clock needs to generate a periodic signal at a frequency
of approximately 500 Hz to drive the speaker for the alarm tone. Use a counter to
divide the system’s master clock signal with a frequency of 1 MHz to derive the 500
Hz alarm tone.
Solution: The RTL codes of the alarm tone is presented below. The counter ticks
every positive edge of master clock until it reaches 1000. When the counter is 1000,
the alarm tone toggles. Consequently, the frequency of the alarm tone is 1/2000 that
of the master clock, i.e., 1 MHz × 1/2000 = 500 Hz.
19 if (! rst_n ) Y =0;
20 else Y = div16 ;
21 endmodule
Figure 6.27: Asynchronous counter: (a) architecture and (b) timing diagram.
The purpose of register Y at the last stage is to synchronize the output Y to clk.
Each single flip-flop stage divides the frequency of its input signal by two. So, the
asynchronous counter can also be used for the frequency divider. This circuit divides
the clock frequency by 16, and it can be used for a clock divider.
The main advantages of a ripple counter are that it uses much less circuitry in
its implementation (since an increment is not required) and that it consumes less
power. However, an important timing issue arises from the fact that the flip-flops
in a ripple counter are not all clocked together. Each flip-flop has a propagation
delay between a rising edge occurring on its clock input and the outputs changing
values. Since each flip-flop is clocked from the output of the previous flip-flop, the
propagation delays accumulate. The length of the counter should be considered. For
longer counters, there are more flip-flops through which changes have to propagate.
The accumulated delay may exceed the clock period. For shorter counters, the delay
may be acceptable.
The following RTL codes describe a clock divider divided by 13, instead of power
of 2, using an asynchronous (ripple) counter.
6 wire [3 : 0] counter ;
Sequential Circuits 237
33 if (! rst_n ) Y <=0;
34 else if ( counter ==11)
35 Y <=1;
36 else Y <=0;
37 always @ ( div16_b or div8_b or div4_b or div2_b )
38 if ( counter ==12)
39 clear =1;
40 else clear =0;
41 endmodule
6 wire [3 : 0] counter ;
8 or posedge clear )
9 if (! rst_n ) div2 <=0;
10 else if ( clear ) div2 <=0;
11 else div2 <=~ div2 ;
12 assign div2_b =! div2 ;
33 if (! rst_n ) Y <=0;
34 else if ( counter ==10)
35 Y <=1;
36 else Y <=0;
37 always @ ( negedge clk or negedge rst_n )
6.4.6 FIFO
As shown in Figure 6.30, the first-in-first-out (FIFO) buffer can be used to store
elements (or data) when the service is temporarily unavailable. One new element can
arrive at a time, and one element can be served and then depart at a time as well.
The terms FIFO buffer and queue are interchangeable. FIFO is a common term in
hardware, while queue is more common in most programming languages. Queue
and stack terms refer to FIFO and last-in-first-out (LIFO) buffers, respectively.
FIFO finds its applications in many areas of hardware design, such as queue or
synchronizer. A FIFO is typically implemented using a circular buffer structure, as
shown below.
We want to design a 10 × 8 FIFO whose memory is realized using flip-flops. The
I/O interface of the FIFO is presented in Table 6.7. There are a read port and a write
port.
240 Principles of Verilog Digital Design
Figure 6.31: FIFO memory with a depth (or address space) of 10 elements: (a) phys-
ical structure and (b) logical circular buffer structure.
Sequential Circuits 241
As presented in Figure 6.32, the FIFO write (or read) pointer, wr_ptr (or rd_ptr),
is the address a new element will be written (or read). Besides read pointer, rd_ptr,
and write pointer, wr_ptr, the queue length is counted by queue_length so that all
entries of the fifo memory can be fully utilized.
As shown below, the FIFO buffer is a parameterized design. The parameter
DEPTH_BITS is the ceiling function of the log2 (DEPTH) defined by the macro,
CLOG2, in Chapter 3, where log2 (x) denotes the logarithm to the base 2 of x. When
the FIFO write (or read) enable, fifo_wr (or fifo_rd), is asserted, an element is writ-
ten (or read) into (or from) the fifo memory indexed by the wr_ptr (or rd_ptr), and
the wr_ptr (or rd_ptr) automatically increments. The fifo memory is implemented
as a circular buffer. Therefore, when wr_ptr (or rd_ptr) reaches the end of the fifo
memory, i.e., 9, it goes back to address 0. When fifo_wr is asserted and fifo_rd is
not, the queue_length increments; when fifo_rd is asserted and fifo_wr is not, the
queue_length decrements; otherwise, the queue_length keeps its value because, in
this situation, either both fifo_wr and fifo_rd are not asserted or both fifo_wr and
fifo_rd are asserted.
Figure 6.32: The FIFO memory is indexed by write and read pointers. (a) When
wr_ptr= 2 and rd_ptr= 8, queue_length= 4. (b) When wr_ptr= 8 and rd_ptr= 2,
queue_length= 6.
242 Principles of Verilog Digital Design
8 if (! rst_n )
9 wr_ptr <= 0;
10 else if ( fifo_wr && wr_ptr == DEPTH -1)
11 wr_ptr <= 0;
12 else if ( fifo_wr )
13 wr_ptr <= wr_ptr +1 ’ b1 ;
14 // FIFO read pointer
15 always @ ( posedge clk or negedge rst_n )
16 if (! rst_n )
17 rd_ptr <= 0;
18 else if ( fifo_rd && rd_ptr == DEPTH -1)
19 rd_ptr <= 0;
20 else if ( fifo_rd )
21 rd_ptr <= rd_ptr +1 ’ b1 ;
22 // FIFO length , so that all entries can be fully utilized
24 if (! rst_n )
25 queue_length <= 0;
26 else if ( fifo_wr && ! fifo_rd )
27 queue_length <= queue_length +1 ’ b1 ;
28 else if ( fifo_rd && ! fifo_wr )
29 queue_length <= queue_length -1 ’ b1 ;
The FIFO status is indicated using fifo_full and fifo_notempty signals. When
fifo_full is true, the fifo_wr signal must not be asserted to prevent FIFO overrun.
Similarly, when fifo_notempty is false, the fifo_rd signal must not be asserted to
prevent FIFO underrun.
1 // FIFO status
2 assign fifo_full = queue_length == DEPTH ;
3 assign fifo_notempt y =~( queue_length ==0);
Finally, the FIFO memory, fifo_mem, is declared as a 2-D array and imple-
mented using flip-flops. At rising edge of clk, if fifo_wr is asserted, the write data,
fifo_wdata, is written into fifo_mem indexed by the write pointer, wr_ptr. The read
data, fifo_rdata, is output through a combinational circuit of multiplexer indexed by
the read pointer, rd_ptr.
Sequential Circuits 243
4 // FIFO controller
9 always @ (*)
Another popular FIFO controller is realized using only read and write point-
ers without the queue_length counter. When the FIFO is full, wr_ptr and rd_ptr
are the same. Therefore, anther way to detect the FIFO full status is to see if
wr_ptr==rd_ptr. However, when the FIFO is empty, wr_ptr and rd_ptr are also the
same. To differentiate the FIFO full status from the FIFO empty status, the FIFO full
is asserted when the next write pointer equals the current read pointer. That is, the
FIFO full is asserted when rd_ptr==wr_ptr+1 or when rd_ptr is 0 and wr_ptr is 9
(end of the physical structure). In the sequel, doing so intends to leave one element
unoccupied, and a buffer space is wasted.
1 initial begin
2 #1 wait (! rst_n ); // Wait for assertion of reset
3 wait ( rst_n ); // Wait for de - assertion of reset
4 @ ( posedge clk ) x =1; // Assert x for one cycle
5 @ ( posedge clk ) x =0;
6 end
However, the race condition might occur leading to different results when the or-
der of statement execution, or specifically, the order of active events, is changed. In
Figure 6.33, we assume that, at the third rising edge of clk, three active events may
be scheduled by the Verilog simulator using the sequence: x = 1, update state_ns due
to change of x, and evaluation of state_cs. That is, first, 1 will be assigned to x; then
the always block of the state machine for determining its (combinational) next state
will be triggered (causing state_ns to be state b), finally, state b will be assigned to
temporary storage of the non-blocking assignment of state_cs.
After processing active events, the non-blocking assign update event causes
state_cs to be state b of temporary storage. The change of state_cs causes an ad-
ditional active event that triggers the update of state_ns due to change of state_cs,
which causes state_ns to become state d when x is 1. Similar events happen at the
fourth edge of clk except that an additional active event is not triggered because
state_cs does not change its value after the non-blocking assign update event.
It’s apparent that the interaction between signals generated by blocking (initial
blocks in testbench) and non-blocking (always blocks triggered by posedge clk in
RTL codes) assignments may produce a wrong result, as displayed in Figure 6.33.
To solve the race condition, we can either assign the primary input x at a time instance
other than clock edges or try to use non-blocking assignments in testbench.
The following example assigns the primary input x at a time instance of 1 time
unit after the clock edges.
Sequential Circuits 245
1 initial begin
2 #1 wait (! rst_n ); // Wait for assertion of reset
3 wait ( rst_n ); // Wait for de - assertion of reset
4 @ ( posedge clk ) #1 x =1; // Assert x for one cycle
5 @ ( posedge clk ) #1 x =0;
6 end
1 initial begin
2 #1 wait (! rst_n ); // Wait for assertion of reset
3 wait ( rst_n ); // Wait for de - assertion of reset
4 @ ( posedge clk ) x <=1; // Assert x for one cycle
5 @ ( posedge clk ) x <=0;
6 end
PROBLEMS
1. Redesign the bubble sorting problem using only one processing element (PE),
i.e., comparison-and-swap unit, and a suitable state machine.
a. Plot datapath of your architecture.
b. Specify the critical path in your design.
c. Write down your complete RTL codes (including FSM) and verify it. The bit
width is programmable using the parameterized design and it is assumed to
be 3 bits.
2. Design a state machine that can detect the bit sequence "1011". For example, if
input is "0011_1011_0110", the output is "0000_0001_0010".
3. Plot the architecture of the following RTL codes.
1 wire a , b , c ;
2 reg d , e , f ;
3 always @ ( posedge clk ) begin
4 d <= a ^ b ;
5 e <= c | d ;
6 f <= d & e ;
7 end
4. Write down the RTL codes for the 1-bit D-FF with asynchronous set.
5. Write down the RTL codes for the 1-bit D-FF with synchronous set.
6. Write down the RTL codes for the 1-bit D-FF with synchronous enable, as shown
in Figure 6.35. That is, when enable is true, x is assigned to y.
7. Write down the RTL codes for the 1-bit D-FF with synchronous load, as shown
in Figure 6.36. That is, when load_en is true, load_data is assigned to y; other-
wise, x is assigned to y.
248 Principles of Verilog Digital Design
4 always @ ( Sel or A )
5 if ( Sel )
6 if ( A ==1) begin
7 B1 =0; B2 =0;
8 end
9 else begin
10 B1 =1; B2 =1;
11 end
12 else begin
13 B1 =2; B2 =2;
14 end
15 endmodule
4 reg [1 : 0] B1 , B2 ;
5 always @ ( Sel or A )
6 if ( Sel )
7 if ( A ==1) begin
8 B1 =0; B2 =0;
9 end
10 else
11 B1 =1;
12 else begin
13 B1 =2; B2 =2;
Sequential Circuits 249
14 end
15 endmodule
4 reg [1 : 0] B1 , B2 ;
5 always @ ( Sel or A )
6 if ( Sel )
7 if ( A ==1) begin
8 B1 =0;
9 B2 =0;
10 end
11 else begin
12 B1 =1;
13 B2 =1;
14 end
15 endmodule
4 reg [1 : 0] B1 , B2 ;
6 if ( Sel ) begin
7 if ( A ==1) begin
8 B1 =0; B2 =0;
9 end
10 else B1 =1;
11 end
12 end
13 endmodule
1 if ( Load )
2 Out = Data_In ;
b. With the control pin, up_down, to control the counter to count up or count
down.
c. With the control pin, count_mode, to control the counter to count with an
increment or decrement by 1 or 2.
14. Redesign the asynchronous count-down counter as count-up counter.
15. Redesign the asynchronous count-up counter from 0 to 12, such that the clear
signal is a registered output.
16. A counter can be used for the timer. Please use a 1 MHz clock to generate a
pulse every 1 ms.
17. Design a pseudo-random binary sequence (PRBS) generator, as shown in
Figure 6.37.
18. Design a left- or right-shift 8-bit register (controlled by signal left_shift_en and
right_shift_en) that can be loaded a 8-bit value by port, load_value[7:0] (loaded
and enabled by load_en).
19. Clock divider problem: it is simple to derive a divide-by-N clock with non 50%
duty cycle, where N is an integer. If N is an even number, it is intuitive to obtain
the derived clock by using the ripple clock. Since clock is a very important sig-
nal, you should guarantee that the derived clock is glitch-free.
a. Design the clock divider for N = 2 using ripple and non-ripple clocks.
b. Design the clock divider for N = 5 using ripple and non-ripple clock. Notice
that, for ripple clock, the clear signal should be the flip-flop output, i.e., glitch-
free.
c. Design the clock divider for N = 1.5 using non-ripple clock. A fractional
clock divider, for N = 1.5, 2.5, 3.5, etc., is not supposed to be obtained di-
rectly using the output of a single flip-flop. In contrast, you can only obtain it
using a combinational circuit with inputs from several flip-flops. The duty cy-
cle of derived clock needs not be 50%. Hint: You can use the combination of
two counters, each having two bits and counting with sequence 2’b00, 2’b01,
Sequential Circuits 251
2’b11, 2’b00, 2’b01,..., etc., i.e., they count three times and then reset. The
intent of the sequence is to avoid unnecessary glitch. These two counters are
triggered by positive edge and negative edge of clock, respectively.
20. Design a register file with 32 64-bit registers using flip-flops with 4 read ports
and 2 write ports.
21. Design the FIFO without the queue_length indicator. In this design, use the
rd_ptr and wr_ptr to determine the fifo_full and fifo_notempty statuses. To accu-
rately indicate the status of queue, an entry of the FIFO could be wasted. What
are the pros and cons of this design compared to that using the queue_length
indicator?
22. Design the stack by modifying the FIFO module. A stack has only a write
pointer, wr_ptr. Its read pointer is implicitly pointed to by wr_ptr−1.
23. A flip-flop has a clock-to-Q delay, tCQ , of 1-ns delay. What is the delay in a 10-
bit binary ripple counter that uses this type of flip-flop? What is the maximum
frequency the counter can operate on?
24. a. Design a counter with the following repeated binary sequence: 0, 1, 2, 3, 4,
5.
b. Design a counter with the following repeated binary sequence: 0, 1, 4, 6.
c. Design a 8-bit counter with the following repeated binary sequence: 1, 2, 1,
4, 1, 8, 1, 16, 1, 32, 1, 64, 1, 128.
25. Write a testbench to verify the 10-bit ripple counter.
26. Redesign the edge-triggered flip-flop using the NOR-gate implementation.
27. Please redesign the clock divider divided by 13 such that the clear signal is a
registered output.
28. Write a testbench to verify the 4-bit shift register.
29. We can use two D-latches to establish a flip-flop, as shown in Figure 6.38. We
call it master-slave D flip-flop. The first latch, called the master latch, passes
input d to x (the master is transparent) and output q is not affected by x when clk
is low. When clk goes high, d is sampled to x which is then passed to output q
(the slave is transparent). When clk goes low, x is sampled to q. The slave holds
the value of q when the clock is low.
The gate-level schematic of a master-slave D flip-flop is shown in Figure 6.39.
a. Verify its functionality using the waveform in Figure 6.38(b).
b. The tS (setup time) is just the setup time of the master latch. Determine the
rising setup time of flip-flop.
c. For correct operation of the master-slave flip-flop, it is critical that the output
of the master does not change until tH (hold time) after the slave clock falls.
Determine the hold time of flip-flop.
d. Determine its rising clock-to-Q delay time, tCQ .
252 Principles of Verilog Digital Design
30. Repeat the above problem for the master-slave flip-flop shown in Figure 6.40.
31. Similar to the previous problem, we can calculate the timing constraints of the
CMOS master-slave flip-flop constructed from two CMOS latches, as shown in
Sequential Circuits 253
Figure 6.41. The CMOS master-slave flip-flop can be derived by substituting the
CMOS D-latch in Figure 6.7(b) into the master-slave flip-flop in Figure 6.14.
M S
Figure 6.41: A CMOS master-slave flip-flop constructed from two CMOS latches.
33. Develop a sequential circuit with a single data input D and a single output Q.
The output is high when the input value in the current clock cycle is different
from the input value in the previous clock cycle, as shown in Figure 6.43.
34. Write the RTL codes of a circuit for a free-running counter that counts 30 clock
cycles and produce a control signal that is 1 during every 4th, 18th, and 21th
cycle.
254 Principles of Verilog Digital Design
35. Write the RTL codes of a circuit that uses counters to divide a master clock of
20.48 MHz to generate a signal with 50% duty cycle and a frequency of exactly
10 kHz.
36. The schematic in Figure 6.44 shows a ripple counter connected to a decoder.
Plot the outputs of decoder when the ripple counter increments.
39. Write the Verilog codes of the eight-tap FIR filter in Example 4.7.
7 Digital System Designs
Several important system-level hardware design issues, including the pipelining and
parallelism techniques, FIFO and its use for buffering data, arbiter, interconnect,
and memory system, will be presented in this chapter. To derive an efficient and
robust design, we suggest that readers should plot the architecture and timing dia-
grams of the designs before writing their RTL codes. Several examples following
this guideline are exemplified, such as complex multiplier, two additions, and FIR
filter. The architectural diagram lets you understand what components are required
in the design, and the timing diagram governs the operating sequence of the design
and, if necessary, enables you to fine tune the system performance. Finally, a digital
design of Huffman encoding is illustrated from the algorithm design aspect to its
RTL code.
that meets your specifications, you must also guarantee the timing of your circuits.
RTL is ideal for this because its descriptions are a natural fit for the type of pipeline
design that inserts additional registers into a critical path to reduce its depth.
Module instances are also examples of synthesizable RTL statements. However,
one reason for using synthesis technology is to take advantage of its ability to de-
scribe the design at a higher level of abstraction than is possible when using a collec-
tion of module instances or low-level binary operators in a continuous assignment.
The most satisfactory approach is to describe what the design does and trust the
synthesis tool to make all of the correct decisions regarding how the design is imple-
mented. This is the first step on the road to successful high-level design.
As mentioned, RTL design contains behavioral (always block), dataflow (contin-
uous assignment), and structural descriptions. Ideally, the output of a system can be
completely specified using a state table. However, the state table for a large digital
system can become huge and unwieldy due to the enormous number of potential
states of current and previous inputs. To overcome this problem, digital systems are
usually designed using a modular approach: the system is partitioned into modular
subsystems, as shown in Figure 7.3, each of which performs some function. Data
are then exchanged using an interconnect, such as a bus. The problem remains to
partition the system at a level in which the design becomes manageable. Once it has
been successfully done, the rest of the task becomes much simpler and relatively
straightforward. Establishing a stable, workable system-level design is one of the
most interesting and challenging aspects of digital design.
The digital design process begins with a specification, as shown in Figure 7.4.
Major steps are listed and described below.
Figure 7.5: (a) A combinational circuit composed of 4 subcircuits. (b) Four pipeline
stages containing each subcircuit.
The throughput, Θ, of a module is the number of tasks a module can complete per
unit time. For example, if we have an adder that is able to perform one add operation
every 10 ns, we say that the throughput of the adder is 100 MOPS (million operations
per second). The latency, T , of a module is the amount of time it takes the module
to complete one task from beginning to end. For example, if the adder takes 10 ns
to complete an addition from the time the inputs are applied to the time the output
is stable, its latency T is 10 ns. For a simple module, throughput and latency are
reciprocals of one another: Θ = 1/T .
If we accelerate modules through pipelining or parallelism, however, the relation
becomes more complex. For example, suppose we use pipelining or parallelism to
increase the throughput of a module, in which T = 10 ns, Θ = 100 MOPS, by a factor
of 4. If we are using a parallel design, we could build four copies of our module, as
shown in Figure 7.6(a). Modules A-D are identical copies of our original module.
The fork block distributes tasks to the four modules, and the join block combines
the outcomes. Using such a structure, we can start four tasks in parallel. Our latency
is still T = 10 ns because it still takes 10 ns to complete one task. Our throughput,
however, has been increased to Θ = 400 MOPS since we are able to solve four tasks
every 10 ns.
An alternative method of increasing throughput is to pipeline a single copy of the
module, as shown in Figure 7.6(b). Here, we have taken a single module, A, and
divided it into four subtasks, A1, . . . , A4. For this example, we assume that we are
able to partition the tasks evenly so that the delay of each of the four submodules,
Ai, i = 1, 2, 3, 4, is TAi = 2.5 ns. When pipeline registers are between stages, they
hold the result of the preceding submodule, freeing that submodule to begin working
on the next task. Thus, as shown, this pipeline can operate on four tasks at once in
a staggered fashion. As soon as submodule A1 finishes work on problem Task1, it
starts working on Task2, while Task2 continues work on Task1. Each task continues
Digital System Designs 261
Figure 7.8: Pipelined design for computing the average value of five inputs, a, b, c,
d, and e.
down the pipeline, advancing one stage each clock cycle, until it is completed by
module A4. If we ignore register overhead, our latency is still T = 2.5 ns×4 stages =
10 ns and our throughput has been increased to 400 MOPS. The system completes a
task every 2.5 ns.
The throughput can be further enhanced by combining the parallelism and pipelin-
ing techniques, as shown below. Independent tasks can be fed into modules A, B,
C, and D in every clock cycle. The submodules Ai , Bi , Ci , and Di , i = 1, 2, 3, 4,
complete each subtask in one cycle. Hence, the overall throughput can now achieve
4 × 400 = 1600 MOPS.
Example 7.1. Design a Verilog model for a pipelined circuit that computes the av-
erage value of five inputs, a, b, c, d, and e, as that shown in Figure 7.8. The pipeline
consists of three stages. The first stage separately adds values of a and b, and c and d
and then register the results. Because the value of e must also be registered, the sec-
ond stage adds the stored value of e and the sums calculated in the first stage. Finally,
in the third stage the results are divided by 5. The inputs and output are all signed
numbers with fixed-point format s(6.8). Please design the circuit such that there is no
overflow in the intermediate results. For your convenience, the fixed-point number
formats are also presented in the figure, where block Q quantizes a number using the
truncation and D denotes the D-type flip-flop.
Solution: Since a multiplication is generally simpler than division, we express the
division by 5 as a multiplication by 1/5 using approximately the binary fixed-point
262 Principles of Verilog Digital Design
5 output out_valid ;
6 input in_valid ;
8 input clk ;
19 assign c_plus_d = c + d ;
35 if ( in_valid_r )
36 sum_reg <= sum ; // Pipeline register 2
37 always @ ( posedge clk )
Figure 7.9: (a) Interface for the flow-controlled timing. (b) Timing diagram.
Figure 7.10: Pipeline under the condition of variable delay: (a) without FIFO and (b)
with FIFO.
The scheduling and timing diagram of the pipeline without FIFO are shown in
Figures 7.11(a) and 7.11(b), respectively. The numbers in parenthesis represent the
task identifications (IDs). As displayed, it takes 16 cycles to finish total 6 tasks in
this case.
Figure 7.11: Pipeline without FIFO: (a) scheduling and (b) timing diagram.
Digital System Designs 265
Figure 7.12: Pipeline with FIFO: (a) scheduling and (b) timing diagram.
The scheduling and timing diagram of the pipeline with FIFO are shown in Fig-
ures 7.12(a) and 7.12(b), respectively. As displayed, the pipeline stage A1 of the
pipeline with FIFO has no stalling issues and A2 always has data to process. Be-
sides, A1 and A2 can operate at their full speeds because the FIFO can temporarily
buffer data from A1 and provide date to A2 when needed. Hence, the processing
proceeds smoothly and the overall time to complete 6 tasks is 15 cycles and reduced.
7.1.3 ARBITER
Sometimes, a resource (bus or slave) may be shared between modules. In such cases,
an arbiter is used to prevent more than one master from occupying the resource at
any given time. At each resource cycle, if the master needs the resource, it sends
a request signal. The arbiter will grant the resource to only one master. A master
266 Principles of Verilog Digital Design
įġįġį
$UELWHU
0DVWHU
$YHUDJH
įġįġį
YDOXH
0DVWHU
Figure 7.13: Architecture for 4 masters requesting one slave, i.e., the average value
module.
which receives no response must wait for a later cycle. Therefore, the arbiter resolves
the bus contention whenever more than one masters request the bus ownership. To
prevent a master from being starved, a fair arbiter, such as a round-robin arbiter,
should be used. A priority arbiter can also be used if it has been determined that one
master is more critical than the others.
1 // Arbiter of 4 masters
2 // Master 0 has the highest priority .
3 module arbiter ( gnt , req , clk , rst_n );
6 input clk ;
7 input rst_n ;
8 reg [3 : 0] gnt ;
7.1.4 INTERCONNECT
Simple modules are connected with direct point-to-point connections, while a large
and complex system can be more flexibly organized using an interconnect, such as
268 Principles of Verilog Digital Design
a bus or a network, as shown in Figure 7.14. The links may be parallel or serial-
ized. The flow control mechanism is often required to back pressure the clients when
a contention occurs. The interconnect may or may not permit multiple simultane-
ous operations. To achieve a high throughput, an interconnect that supports multiple
concurrent transactions is required under the situation without any conflicts.
A transaction may be fulfilled using a packet format, including, at minimum, a
destination device address D and a payload P of arbitrary length. Because the inter-
connect has been addressed, any client i can communicate with any client j while
requiring only a single pair of unidirectional links for input and output to each client
module. A packet (D, P) sent from i to j, i.e., D = j, may result in j sending a re-
sponse or reply packet (S, Q), i.e., S = i, with payload Q back to i. The payload may
contain a request type (e.g., read or write), a (memory-mapped) address in D, and
data or other arguments for a remote operation.
7.1.4.1 Buses
A bus interconnect, such as the one shown in Figure 7.15, is a general-purpose in-
terconnect which is widely used in applications that have modest performance re-
quirements. We use the term bus to refer to the collection of signals that form the
interconnect. A bus has the advantages of simplicity, a broadcast nature, and the
ability to serialize and order all transactions. The major disadvantage of a bus is its
performance: it can only allow one transaction to be sent at a time. There are two
masters and two slaves in the system. The master granted by the arbiter connects to
the slave it requests as if they communicate using a point-to-point connection. The
signals from the master/slave are selected by the multiplexer u0/u3 and subsequently
routed to the slave/master by the demultiplexer u1/u2. If there is only a slave, the
demultiplexer u1 and the multiplexer u3 might be saved.
A bus interface can convert the module’s valid-ready flow control to bus arbitra-
tion, as shown in Figure 7.16, where detailed interface signals are displayed. Each
module’s connection to the interface may include device address, data, and read-
Digital System Designs 269
/write signals. In Figure 7.16, we assume that there are 4 clients. Notably, the demul-
tiplexers for s_rw, s_wdata0,..., s_wdata3, and m_rdata0,..., m_rdata3 can be saved
because s_rw and s_wdata for a destined slave can be indicated by s_valid. Similarly,
m_rdata for a destined master can be indicated by s_ready.
Figure 7.16: Modules connected to the bus interconnect. A source module arbitrates
for access to the bus and then drives its transaction onto the bus. The destined client
receives or transmits the data to the source client depending upon whether it performs
a write or read transaction.
270 Principles of Verilog Digital Design
Each client has two interfaces, one master and one slave. A client uses its mas-
ter interface to communicate with the slave interface of another client on the bus.
The granted master will seem to directly connect to its destined slave through the
interconnect implemented using the multiplexers and demultiplexers.
The bus protocol is defined below. When a client wishes to begin a transaction on
the bus, it inserts the address of the destination client into its address field, m_addr,
the data to be communicated into its data field, m_wdata, the read/write control sig-
nal, m_rw, and asserts the validity of its signal, m_valid. A tristate drive is sometimes
used in an off-chip bus. However, on-chip buses are usually implemented by the mul-
tiplexer and demultiplexer. This type of bus interface connects the valid signal from
the source client to the bus arbiter, which performs an arbitration and sends a grant
signal, arb_gnt, to multiplex the signals of a requesting master, including m_valid,
m_rw, m_addr, and m_wdata. The multiplexed m_valid, m_rw, and m_wdata are
demultiplexed to the corresponding slave through the multiplexed deviceâĂŹs ad-
dress, m_addr. Similarly, the slave signals, s_ready and s_rdata, are multiplexed by
the multiplexed device address, m_addr, and then demultiplexed through the arb_gnt
to the corresponding master, including the signals m_ready and m_rdata.
The Verilog codes for a bus interconnect of 4 clients are shown below.
1 // Interconnect of 4 clients
2 module bus_interc on n ec t (
3 // Master interface
7 // Slave interface
8 s_valid , s_rw ,
11 clk , rst_n
12 );
13 // Master interface
14 output [3 : 0] m_ready ; //
19 input [3 : 0] m_valid ; //
20 input [3 : 0] m_rw ; //
21 input [1 : 0] m_addr0 ; //
22 input [1 : 0] m_addr1 ; //
23 input [1 : 0] m_addr2 ; //
24 input [1 : 0] m_addr3 ; //
30 output [3 : 0] s_valid ; //
31 output [3 : 0] s_rw ; //
36 input [3 : 0] s_ready ; //
42 wire [3 : 0] arb_gnt ;
48 mux_demux mux_demux (
49 // Master interface
52 . m_rdata3 ( m_rdata3 ) ,
55 . m_addr3 ( m_addr3 ) ,
58 // Slave interface
64 );
65 endmodule
272 Principles of Verilog Digital Design
8 // Slave interface
11 s_rdata2 , s_rdata3
12 );
13 // Master interface
14 output [3 : 0] m_ready ;
16 m_rdata2 , m_rdata3 ;
17 input [3 : 0] gnt ; // Grant
21 m_wdata2 , m_wdata3 ;
22 // Slave interface
25 s_wdata2 , s_wdata3 ;
26 input [3 : 0] s_ready ;
28 s_rdata2 , s_rdata3 ;
29 reg sel_m_valid ;
30 reg sel_m_rw ;
31 reg [1 : 0] sel_m_addr ;
33 reg [3 : 0] m_ready ;
38 reg [3 : 0] s_valid ;
39 reg [3 : 0] s_rw ;
44 reg sel_s_ready ;
49 case ( gnt )
50 4 ’ b0001 : begin
51 sel_m_valid = m_valid [0];
52 sel_m_rw = m_rw [0];
53 sel_m_addr = m_addr0 ;
54 sel_m_wdata = m_wdata0 ;
55 m_ready ={1 ’ b0 ,1 ’ b0 ,1 ’ b0 , sel_s_ready };
56 m_rdata0 = sel_s_rdata ;
57 m_rdata1 ={ DATA_WIDTH {1 ’ b0 }};
58 m_rdata2 ={ DATA_WIDTH {1 ’ b0 }};
59 m_rdata3 ={ DATA_WIDTH {1 ’ b0 }};
60 end
61 4 ’ b0010 : begin
62 sel_m_valid = m_valid [1];
63 sel_m_rw = m_rw [1];
64 sel_m_addr = m_addr1 ;
65 sel_m_wdata = m_wdata1 ;
66 m_ready ={1 ’ b0 ,1 ’ b0 , sel_s_ready ,1 ’ b0 };
67 m_rdata0 ={ DATA_WIDTH {1 ’ b0 }};
68 m_rdata1 = sel_s_rdata ;
69 m_rdata2 ={ DATA_WIDTH {1 ’ b0 }};
70 m_rdata3 ={ DATA_WIDTH {1 ’ b0 }};
71 end
72 4 ’ b0100 : begin
73 sel_m_valid = m_valid [2];
74 sel_m_rw = m_rw [2];
75 sel_m_addr = m_addr2 ;
76 sel_m_wdata = m_wdata2 ;
77 m_ready ={1 ’ b0 , sel_s_ready ,1 ’ b0 ,1 ’ b0 };
78 m_rdata0 ={ DATA_WIDTH {1 ’ b0 }};
79 m_rdata1 ={ DATA_WIDTH {1 ’ b0 }};
80 m_rdata2 = sel_s_rdata ;
81 m_rdata3 ={ DATA_WIDTH {1 ’ b0 }};
82 end
83 default : begin // Also for 4 ’ b1000
84 sel_m_valid = m_valid [3];
85 sel_m_rw = m_rw [3];
86 sel_m_addr = m_addr3 ;
87 sel_m_wdata = m_wdata3 ;
88 m_ready ={ sel_s_ready ,1 ’ b0 ,1 ’ b0 ,1 ’ b0 };
89 m_rdata0 ={ DATA_WIDTH {1 ’ b0 }};
90 m_rdata1 ={ DATA_WIDTH {1 ’ b0 }};
91 m_rdata2 ={ DATA_WIDTH {1 ’ b0 }};
92 m_rdata3 = sel_s_rdata ;
274 Principles of Verilog Digital Design
93 end
94 endcase
95 always @ (*)
96 case ( sel_m_addr )
97 2 ’ b00 : begin
98 sel_s_ready = s_ready [0];
99 sel_s_rdata = s_rdata0 ;
100 s_valid ={1 ’ b0 ,1 ’ b0 ,1 ’ b0 , sel_m_valid };
101 s_rw ={1 ’ b0 ,1 ’ b0 ,1 ’ b0 , sel_m_rw };
102 s_wdata0 = sel_m_wdata ;
103 s_wdata1 ={ DATA_WIDTH {1 ’ b0 }};
104 s_wdata2 ={ DATA_WIDTH {1 ’ b0 }};
105 s_wdata3 ={ DATA_WIDTH {1 ’ b0 }};
106 end
107 2 ’ b01 : begin
108 sel_s_ready = s_ready [1];
109 sel_s_rdata = s_rdata1 ;
110 s_valid ={1 ’ b0 ,1 ’ b0 , sel_m_valid ,1 ’ b0 };
111 s_rw ={1 ’ b0 ,1 ’ b0 , sel_m_rw ,1 ’ b0 };
112 s_wdata0 ={ DATA_WIDTH {1 ’ b0 }};
113 s_wdata1 = sel_m_wdata ;
114 s_wdata2 ={ DATA_WIDTH {1 ’ b0 }};
115 s_wdata3 ={ DATA_WIDTH {1 ’ b0 }};
116 end
117 2 ’ b10 : begin
118 sel_s_ready = s_ready [2];
119 sel_s_rdata = s_rdata2 ;
120 s_valid ={1 ’ b0 , sel_m_valid ,1 ’ b0 ,1 ’ b0 };
121 s_rw ={1 ’ b0 , sel_m_rw ,1 ’ b0 ,1 ’ b0 };
122 s_wdata0 ={ DATA_WIDTH {1 ’ b0 }};
123 s_wdata1 ={ DATA_WIDTH {1 ’ b0 }};
124 s_wdata2 = sel_m_wdata ;
125 s_wdata3 ={ DATA_WIDTH {1 ’ b0 }};
126 end
127 default : begin
128 sel_s_ready = s_ready [3];
129 sel_s_rdata = s_rdata3 ;
130 s_valid ={ sel_m_valid ,1 ’ b0 ,1 ’ b0 ,1 ’ b0 };
131 s_rw ={ sel_m_rw ,1 ’ b0 ,1 ’ b0 ,1 ’ b0 };
132 s_wdata0 ={ DATA_WIDTH {1 ’ b0 }};
133 s_wdata1 ={ DATA_WIDTH {1 ’ b0 }};
134 s_wdata2 ={ DATA_WIDTH {1 ’ b0 }};
135 s_wdata3 = sel_m_wdata ;
136 end
137 endcase
138 endmodule
Example 7.3. Please design the bus used in Figure 7.13 and integrate arbiter and
avg_value modules.
Digital System Designs 275
įġįġį
$UELWHU
UHT>@
JQW>@ įġįġį JQW>@
LQBYDOLG
įįį
0DVWHU DaH
įġįġį
$YHUDJH
UHT>@
įġįġį DYJ
YDOXH
0DVWHU
Solution: The bus is implemented using multiplexers for selecting input signals,
a, b, c, d, and e, of the avg_value module, as that shown in Figure 7.17. The gnt
signal produced by the arbiter is used for both select signals of multiplexers and
OR-ed to generate the in_valid signal of the avg_value module. Since there is only a
slave module, the demultiplexers of the master-to-slave bus and multiplexers of the
slave-to-master bus are saved. Besides, the gnt signal can be used to the indicator
of the bus or slave owner, and the demultiplexers of the slave-to-master bus are also
omitted. Therefore, the output signal, avg, of the avg_value module is broadcast to
all masters. The output signal, out_valid, of the avg_value module is not used. Owing
to the 3-stage pipeline of the avg_value module, the master obtains the output result,
avg, 3 cycles later after it is granted.
Assuming that masters 0 and 1 send their requests to the arbiter at the same time
but other masters (not shown) don’t. The timing diagram is demonstrated in Figure
7.18. When gnt[i] of mater i is true, mater i is the bus (or slave) owner and a data
output is valid after 3 cycles. The output data lasts for 2 cycles due to handshaking
overhead when the master is granted; hence, any avg signals in these 2 cycles can be
used. As displayed, the throughput is quite low if the request signal, req, de-asserts 3
cycles later after it is granted. Therefore, the master will occupy the bus (indicated by
the grant signal) by 5 cycles due to the pipelined architecture of avg_value module.
Example 7.4. Please modify the bus protocol in Example 7.3 using the split trans-
action to enhance the throughput.
276
Solution: Assuming that masters 0 and 1 send their requests to the arbiter at the
same time but other masters (not shown) don’t. The request signal de-asserts after
receiving its grant signal instead of waiting for its avg result. Due to the split trans-
action, when the output data is valid, the master’s grant signal has already gone low.
The timing diagram is demonstrated in Figure 7.19. As displayed, a master now
occupies the bus by 2 cycles. The avg signal destined to mater i lasts for 2 cycles
due to handshaking overhead; hence, any avg signals in these 2 cycles can be used.
To receive the first output data for a master, the master must wait 3 cycles after it is
granted.
Example 7.5. Please modify the bus protocol in Example 7.4 using the split trans-
action to further enhance the throughput.
Solution: We design a new request signal, req_i, used by the state machine of
the arbiter, as shown below. When gnt[i] is true, it will mask req[i] and the internal
request req_i[i] used by the state machine will become false. Consequently, the bus
can be handed over to the next master earlier than before.
The timing diagram is demonstrated in Figure 7.20. Due to the split transaction,
when output data is ready, the master’s grant signal has already gone low. As dis-
played, a master now occupies the bus by 1 cycle because the request signal used by
the state machine of the arbiter shortens by qualifying the grant signal. The pipelined
avg_value module now can process input data of masters at its maximum speed. To
receive the first output data for a master, the master must wait 3 cycles after it is
granted.
non-conflicting slave such that the performance can be enhanced fourfold. The mod-
ules, arbiter and mux_demux, reuse those defined in the previous sections. The
throughput of a crossbar can be increased by providing a buffer at the crosspoint,
which decouples input and output scheduling.
7 // Slave interface
8 s_valid , s_rw ,
11 clk , rst_n
12 );
13 // Master interface
14 output [3 : 0] m_ready ; //
21 input [1 : 0] m_addr0 ; //
22 input [1 : 0] m_addr1 ; //
23 input [1 : 0] m_addr2 ; //
24 input [1 : 0] m_addr3 ; //
29 // Slave interface
30 output [3 : 0] s_valid ; //
31 output [3 : 0] s_rw ; //
36 input [3 : 0] s_ready ; //
42 // Slave 0
43 wire [3 : 0] m_ready_s0 ; //
48 wire [3 : 0] m_valid_s0 ; //
49 wire [3 : 0] s_valid_s0 ; //
50 wire [3 : 0] s_rw_s0 ; //
55 // Slave 1
56 wire [3 : 0] m_ready_s1 ; //
61 wire [3 : 0] m_valid_s1 ; //
62 wire [3 : 0] s_valid_s1 ; //
63 wire [3 : 0] s_rw_s1 ; //
68 // Slave 2
69 wire [3 : 0] m_ready_s2 ; //
74 wire [3 : 0] m_valid_s2 ; //
75 wire [3 : 0] s_valid_s2 ; //
76 wire [3 : 0] s_rw_s2 ; //
81 // Slave 3
82 wire [3 : 0] m_ready_s3 ; //
87 wire [3 : 0] m_valid_s3 ; //
88 wire [3 : 0] s_valid_s3 ; //
89 wire [3 : 0] s_rw_s3 ; //
94 wire [3 : 0] arb_gnt_s0 ;
95 wire [3 : 0] arb_gnt_s1 ;
96 wire [3 : 0] arb_gnt_s2 ;
97 wire [3 : 0] arb_gnt_s3 ;
127 m_ready_s3 ;
128 assign m_rdata0 = m_rdata0_s0 | m_rdata0_s1 |
137 s_valid_s3 ;
138 assign s_rw = s_rw_s0 | s_rw_s1 | s_rw_s2 | s_rw_s3 ;
165 );
184 );
203 );
222 );
223 endmodule
does not require data to be refreshed. The term “static” (the S of SRAM) indicates
that the stored data persists indefinitely so long as power is applied to the memory
component.
A memory system in a large digital system is often composed of multiple memo-
ries with different characteristics. For example, on-chip SRAM is characterized by its
low latency and high throughput, while DRAM is characterized by its high capacity.
Moreover, DRAM is often external to an ASIC due to a different process technology.
The number of memories used to realize a memory system is governed by its capac-
ity and throughput. If a memory does not have sufficient capacity, multiple memories
must be used, with just one memory being accessible at any given time. Similarly,
if one memory does not have sufficient bandwidth to sustain the required through-
put, multiple memories must be used in parallel. Notably, the memory bandwidth is
usually expressed in units of bits/sec, instead of hertz in communication systems.
The access policy of a memory can be either random access or sequential. Exam-
ples of random-access memories include SRAM, DRAM, ROM, etc. Examples of
sequential access memories include FIFO and stack. Non-volatile memory systems
such as storage disks used for persistent storage lie outside of the scope of this book.
The process of storing information into memory is referred to as a memory write
operation. The process of transferring the stored information out of memory is re-
ferred to as a memory read operation. SRAM and DRAM can perform both write and
read operations, whereas ROM can perform only the read operation. ROM is part of a
programmable logic device (PLD). A PLD is an integrated circuit with internal logic
gates connected through configurable paths. Its contents are written through a pro-
gramming process and are part of a hardware procedure used to specify the bits that
are inserted into the hardware configuration of the device. The programming of ROM
determines the fuses which are to be connected or disconnected. Other PLDs include
programmable logic array (PLA), programmable array logic (PAL), and FPGA.
and write data (wdata) have setup time (tS ) and hold time (tH ) constraints. The read
data (rdata) signal has access time (tA ) constraint. A single SSRAM array typically
operates in one clock cycle.
An SSRAM can be burst read and written. Every cycle, the address (and data for a
write access) can change to a new and random one without incurring any interruption
or overhead, as shown in Figure 7.24.
Example 7.6. The behavior model of a 512×16 single-port SSRAM is written be-
low, where rdata, cen, wen, ren, addr, and wdata are read data, chip enable, write
enable, read enable, address, and data input, respectively. The SSRAM is on-chip so
288 Principles of Verilog Digital Design
that read (rdata) and write (wdata) data buses are separate. For off-chip single-port
SSRAMs, read and write data buses are usually shared, i.e., bidirectional, to save the
pin counts. If cen and ren are true, the read operation is performed; if cen and wen
are true, the write operation is performed. The delayed assignment using the timing
control, #tA , models the access time. The setup time tS and hold time tH are checked
using a specify block.
6 input [7 : 0] wdata ;
7 reg [7 : 0] mem [0 : 65535] , tempQ , rdata ;
8 parameter tA =3;
32 endmodule
SRAMs are organized as arrays of cells with row decoders and column multiplex-
ers. Depending on the multiplexing factor, various SRAMs with different numbers
of entries and bit widths can be realized. If we need a RAM with a larger capacity or
higher bandwidth, we can combine multiple RAM arrays via bit-slicing or banking.
The bit-slicing technique can be used to design a memory system with a capacity of
Digital System Designs 289
64 K×32-bit using four 64 K×8-bit memories to broaden the data width, as shown
in Figure 7.25. All 4 memory arrays are all accessed in parallel at a time. If cen is
true and wen is false, the read operation is performed; if cen and wen are true, the
write operation is performed.
The memory space of the RAM arrays using the bit-slicing technique is organized
in Figure 7.26.
We can also adopt the banking technique to design a 64 K×32-bit memory
system using four 16 K×32-bit components to increase the capacity, as shown in
Figure 7.27. A decoder with enable control and a multiplexer are required. When
290 Principles of Verilog Digital Design
Figure 7.26: Memory space of the RAM arrays using the bit-slicing technique.
addr[15:14] is 00, the first memory array (counted from top to bottom) is enabled
via the decoder and its rdata is selected via the multiplexer, and so on. Notably,
addr[15:14] used to select the rdata[31:0] needs to be pipelined because read data
are commonly available one cycle later than the read command.
The memory space of the RAM arrays using the banking technique is organized
in Figure 7.28.
Both configurations have the same capacity (2 Mb) and bandwidth (4 bytes per
cycle). In a bit-sliced memory, all memory arrays must be accessed to complete
an operation, since each provides a portion of the final result. In a banked memory,
however, only one array needs to be accessed such that power can be saved. However,
extra decoder and multiplexer are required using a banked memory.
We can simplify the connection of memory components to form a larger memory
system by using a tristate buffer for each of the data outputs, as shown in Figure 7.29.
To drive the output, the enable signal must be true. If the enable signal is false, the
input of the tristate buffer is effectively isolated from its output and, of course, the
component the output is connected to as well. If we use memory components with
tristate data outputs to construct a larger memory system, the output multiplexers,
such as the one shown in Figure 7.27, can be omitted.
Digital System Designs 291
Figure 7.28: Memory space of the RAM arrays using the banking technique.
292 Principles of Verilog Digital Design
I O
Therefore, we can combine both bit-slicing and banking techniques to create the
memory architecture shown in Figure 7.30. Each of the 16 memory units is 16K×8
bits, requiring four memory units (one row) to access 32-bit data at a time, while the
Figure 7.31: Memory space of the RAM arrays using both bit-slicing and banking
techniques.
other 12 units remain idle to save power. In this manner, the clock speed can also
be increased owing to the smaller size of SRAM chips. Four rows, known as banks,
are needed to give the required memory capacity of 2 Mb. Only the read data bus is
displayed in Figure 7.30. The selected bank (by the decoder) will drive the data bus,
while other non-selected banks will stay in tristate so that they cannot affect the read
data of a selected bank.
The memory space of the RAM arrays using both bit-slicing and banking tech-
niques is organized in Figure 7.31.
Allowing multiple requests to access multiple banks simultaneously with an arbi-
trated crossbar increases the aggregate memory bandwidth from one word per cycle
to min(N, M) words per cycle, where M is the number of requesters and N is the
number of interleaved banks. Each of the multiple instances of memory access for
each bank is decoded based on its memory address. This enables multiple requests
to be granted every cycle. Of course, these banks can be further bit-sliced and/or
banked, however, if two requests require access to the same bank at the same time, a
conflict occurs and one request must be postponed.
294 Principles of Verilog Digital Design
Example 7.7. On-chip SRAMs are the most popular method to realize a large data
storage because the area size of a bit implemented using an SRAM is much smaller
than that of a flip-flop. For example, we can design a 1 K×8-bit FIFO using a FIFO
controller with a dual-port SSRAM, as shown in Figure 7.32.
Solution: The RTL codes of the FIFO controller are presented below. For sim-
plicity, one port of the SSRAM is dedicated to the write access, while another is
dedicated to the read access. To get rid of any possible timing issues, the output
signals, including wen, waddr, wdata, ren, and raddr, of memory interface are regis-
tered outputs, and input signal, rdata, is directly latched by flip-flops without going
through a combinational circuits. An output signal, fifo_rdata_valid, of the fifo read
interface is added to indicate the validity of fifo_rdata.
The SSRAM interface typically has the hold time constraints, including wen,
waddr, wdata, ren, and raddr signals. Hold time violations will be fixed during the
synthesis stage by inserting buffers on those timing paths.
6 // SSRAM interface
9 clk , rst_n
10 );
Digital System Designs 295
11 // FIFO interface
12 output fifo_full ; //
13 input fifo_wr ; //
14 input [7 : 0] fifo_wdata ; //
15 output fifo_notempty ; //
16 output fifo_rdata_ v al id ; //
17 output [7 : 0] fifo_rdata ; //
18 input fifo_rd ; //
19 // SSRAM interface
20 output wen ; //
21 output [9 : 0] waddr ;
22 output [7 : 0] wdata ;
23 output ren ; //
24 output [9 : 0] raddr ;
25 input [7 : 0] rdata ;
30 reg [7 : 0] fifo_rdata ;
31 // FIFO controller
35 if (! rst_n )
36 wr_ptr <=0;
37 else if ( fifo_wr )
38 wr_ptr <= wr_ptr +1 ’ b1 ;
39 always @ ( posedge clk or negedge rst_n )
40 if (! rst_n )
41 rd_ptr <=0;
42 else if ( fifo_rd )
43 rd_ptr <= rd_ptr +1 ’ b1 ;
44 always @ ( posedge clk or negedge rst_n )
45 if (! rst_n )
46 queue_length <=0;
47 else if ( fifo_wr &&! fifo_rd )
48 queue_length <= queue_length +1 ’ b1 ;
49 else if ( fifo_rd &&! fifo_wr )
50 queue_length <= queue_length -1 ’ b1 ;
51 // SSRAM controller , write port
Figure 7.37: Timing for write and read operations in an asynchronous SRAM.
to generate signals of memory interface is smaller than the timing constraints, the
signals should be lengthened to meet the specification.
Example 7.8. We want to design a controller in Figure 7.38 for an off-chip 1024×32
bits asynchronous SRAM. Therefore, the data signal is a bidirectional bus to save
the pin counts. For simplicity, the chip enable ties to low so that the asynchronous
Solution: Since tW = 5 time units, the control signals of write and read enables
should last for 2 clock cycles of asynchronous SRAM controller. Similarly, the setup
time tS = 5 and hold time tH = 2 require 2 and 1 clock cycles, respectively. The
timing diagram is presented in Figure 7.39. The state machine used to generate the
control sequence is also presented.
Digital System Designs
13 state_ns = state_cs ;
14 case ( state_cs )
15 ST_IDLE : state_ns = cmdi == WR_CMD ? ST_WR1 :
16 cmdi == RD_CMD ? ST_RD1 : ST_IDLE ;
17 ST_WR1 : state_ns = ST_WR2 ;
18 ST_WR2 : state_ns = ST_WR3 ;
19 ST_WR3 : state_ns = ST_WR4 ;
20 ST_WR4 : state_ns = cmdi == WR_CMD ? ST_WR1 :
21 cmdi == RD_CMD ? ST_RD1 : ST_IDLE ;
22 ST_RD1 : state_ns = ST_RD2 ;
23 ST_RD2 : state_ns = ST_RD3 ;
24 ST_RD3 : state_ns = ST_RD4 ;
25 ST_RD4 : state_ns = ST_RD5 ;
26 ST_RD5 : state_ns = ST_RD6 ;
27 ST_RD6 : state_ns = ST_RD7 ;
28 ST_RD7 : state_ns = cmdi == WR_CMD ? ST_WR1 :
29 cmdi == RD_CMD ? ST_RD1 : ST_IDLE ;
30 default : ST_IDLE ;
31 endcase
32 end
33 // Sequential logic
34 always @ ( posedge clk or negedge rst_n )
The RTL codes for generating the signals of asynchronous SRAM and internal
command interfaces are written below.
4 addr ,
5 );
7 input [8 : 0] addr ;
9 always @ (*)
10 case ( addr )
11 9 ’ d0 : rdata =16 ’ h0123 ;
12 9 ’ d1 : rdata =16 ’ h4567 ;
13 9 ’ d2 : rdata =16 ’ h89AB ;
14 9 ’ d3 : rdata =16 ’ hCDEF ;
15 ...
16 default : rdata =16 ’ h0123 ;
17 endcase
18 endmodule
The behavior model of a 512×16 ROM is written below. We use the $readmemh
or $readmemb system task to load the memory content.
4 input ren ;
5 input [8 : 0] addr ;
6 input clk ;
10 parameter tA =3;
Digital System Designs 305
The $readmemh system task expects the content of the named file to be a sequence
of hexadecimal numbers, separated by spaces or line breaks. Thus, the file rom.data
specified in the above example could contain the data:
1 // ROM data
2 0123 4567 89 AB CDEF
3 1009 266 A 3115 5435
4 ...
Values are read from the file, rom.data, into successive elements of the variable,
data_array, until either the end of the file reached or all elements of the variable
are loaded. Similarly, $readmemb expects the file to contain a sequence of binary
numbers.
The timing diagram of a table implemented using a combinational circuit and
ROM is displayed in Figure 7.40. The table implemented by combinational circuits
typically does not need the read enable, ren, and its output, rdata, is selected through
a combinational logic of multiplexer. By contrast, The table implemented by ROM
usually requires an access time, tA , to obtain the read data, rdata1. Therefore, their
outputs are available in different clock cycles.
The contents of ROMs should not need to be changed over the lifetime of the
product. ROMs tend to be used for applications in which the number of manufactured
parts is high. For some applications, it might be preferable to occasionally be able
to update the ROM contents, especially for low-volume production. To accomplish
this, a programmable ROM (PROM), an off-the-shelf chip with no contents stored
in its memory cells, can be used. The memory contents of PROM are programmed
into the cells after manufacturing, either using a special programming device before
the chip is inserted into the system, or using special programming circuits when the
chip has already been installed.
PROMs come in a number of forms. Early PROMs used fusible links to program
the memory cells. Once a link was fused, it could not be replaced, so programming
could only be done once. These devices are now largely obsolete. They were replaced
by PROMs that could be erased, either with ultraviolet light (so called EPROMs), or
electrically using a higher voltage than a normal one (so-called electrically erasable
PROMs, or EEPROMs).
6 sum_ab <= a + b ;
7 sum_cd <= c + d ;
8 y2 <= sum_ab + sum_cd ;
9 end
Solution: Their architecture and timing diagrams are displayed in Figure 7.42(a) and
7.42(b), respectively. As presented in Figure 7.42(a), the components of y1 requires
3 adders, whereas y2 requires 3 adders and 3 more registers. Additionally, the critical
paths of y1 and y2 have two and one adders, respectively. About the timing in 7.42(b),
the result y1 is available at the same cycle whenever a, b, c, and d are provided.
Rather, y2 is available 2 cycles after giving a, b, c, and d.
However, if inputs are continuously fed into the circuits, one output is available on
both y1 and y2 in every cycle; and therefore, they achieve the same throughput under
the situation of the same clock period. Since the critical path of y2 is half that of y1,
the maximum clock frequency of y2 can ideally be twice that of y1. Consequently,
the throughput of y2 is twice that of y1 at their maximum clock frequencies as well.
From above, the architecture and timing diagrams can give us insights on choos-
ing the most suitable design in terms of area, speed, and even power consumption.
Most importantly, the performance can be assessed in an earlier design stage.
Figure 7.42: (a) Architecture and (b) timing diagrams of addition of 4 numbers with
and without pipeline.
There are usually many alternative datapaths capable of meeting the functional re-
quirements of the system, but some will have advantages over others. Choosing
among them usually involves a tradeoff between area and performance.
We demonstrate another example of the complex multiplier designed using the
architecture and timing diagrams here.
Example 7.11. The design of a module, including the datapath and control unit, to
perform a complex multiplication of two complex numbers, is shown in Figure 7.43.
The operands and product are all in Cartesian form. The real and imaginary parts
of the operands are represented as 16-bit signed s(4.12) fixed-point binary numbers.
The real and imaginary parts of the product are represented as 32-bit signed s(8.24)
fixed-point binary numbers.
Solution: The complex multiplication is sequenced by the timing diagram in
Figure 7.44.
Digital System Designs
The RTL codes are presented below. There √ are two complex operands, op1= a =
ar + jai and op2= b = br + jbi , where j = −1, and the result is the output signal,
prod= a × b = ar br − ai bi + j(ar bi + ai br ). The real and imaginary parts are indicated
by the suffix r and i, respectively. Consequently, the real and imaginary parts of the
result, prod, both require two real multiplications and one real addition/subtraction.
Notably, the real/imaginary parts of op1, i.e., ar /ai , are represented by the signals,
op1_r/op1_i, and the real/imaginary parts of op2, i.e., br /bi , are represented by the
signals, op2_r/op2_i in Figure 7.43.
After plotting the architectural diagram and deriving the fixed-point design, the
datapath unit design is quite straightforward. To guarantee that right operations are
taken at right times, we use a state machine as the control unit to govern the operation
sequence and generate corresponding control signals.
8 input in_valid ;
13 reg is_MUL3_cs_r ;
20 // Control unit
29 state_ns = state_cs ;
30 case ( state_cs )
31 ST_IDLE : state_ns = in_valid ? ST_MUL0 : ST_IDLE ;
32 ST_MUL0 : state_ns = ST_MUL1 ;
33 ST_MUL1 : state_ns = ST_MUL2 ;
312 Principles of Verilog Digital Design
46 if ( in_valid ) begin
47 op1_r_r <= op1_r ;
48 op1_i_r <= op1_i ;
49 op2_r_r <= op2_r ;
50 op2_i_r <= op2_i ;
51 end
52 assign mul_op1 =( is_MUL0_cs | is_MUL2_cs )? op1_r_r : op1_i_r ;
60 if ( is_MUL1_cs | is_MUL3_cs )
61 mul_prod_r2 <= mul_prod ;
62 assign sum_op1 = mul_prod_r1 ;
Figure 7.45: Unfolding and folding architecture of datapath for two additions.
two adders seems to have a higher resource cost (two adders) and higher speed (one
result per clock cycle). By contrast, the folding one using one adder seems to have
a lower resource cost (one adder, ignoring the cost of multiplexers) and lower speed
(one result in two clock cycles).
However, a closer examination of the critical paths gives us a different view, as
shown in Figure 7.46.
The implementation using two adders has a longer critical path, so its clock period
is longer and clock rate is slower. By contrast, the other implementation method,
using one adder, has a shorter critical path, so its clock period is shorter and clock
rate is faster. If the delay of multiplexers (and a little bit wider adder) can be ignored,
314 Principles of Verilog Digital Design
the critical path delay of one adder is half that of two adders, so the clock rate of one
adder implementation can be doubled.
The timing diagram is very simple and omitted here. One result of the architecture
using two adders can be produced in one clock cycle. Though only one result of the
architecture using one adder can be produced in two clock cycles, the throughput of
one adder implementation can be comparable to that of two adder implementation,
but with the benefit, mentioned above, of having a lower resource cost than two adder
implementation. Hence, which architecture is better for any specific case should be
carefully analyzed before deciding on its RTL design.
1 // FIR 1
2 module fir1 (y , x0 , x1 , x2 , x3 , h0 , h1 , h2 , h3 );
3 output [12 : 0] y ;
4 input [7 : 0] x0 , x1 , x2 , x3 ;
5 input [2 : 0] h0 , h1 , h2 , h3 ;
6 reg [12 : 0] y ;
7 always @ (*)
8 y = h0 * x3 + h1 * x2 + h2 * x1 + h3 * x0 ;
9 endmodule
Another factor to keep in mind when deciding upon implementation is that the
output is a registered one, so the critical path of this module does not influence those
using the filter output, as shown in FIR 2 below.
1 // FIR 2
2 module fir2 (y , x0 , x1 , x2 , x3 , h0 , h1 , h2 , h3 , clk );
3 output [12 : 0] y ;
4 input [7 : 0] x0 , x1 , x2 , x3 ;
5 input [2 : 0] h0 , h1 , h2 , h3 ;
6 input clk ;
7 reg [12 : 0] y ;
9 y <= h0 * x3 + h1 * x2 + h2 * x1 + h3 * x0 ;
10 endmodule
The direct-form FIR filter can be constructed by inserting more registers such that
one item of input data enters the filter every clock cycle, which is more suitable for
limited memory access and pin number reduction, as shown in Figure 7.49, where
ports x and y are x(n) and y(n), respectively. The critical path is one multiplier as
316 Principles of Verilog Digital Design
well as two adders. The area complexity of the circuit is 4 multipliers, 3 adders, and
5 registers.
RTL codes of FIR 3 are written below. It should be noted that correct results start
after the 5th clock. After that, one output is available per clock cycle.
1 // FIR 3
2 module fir3 (y , x , h0 , h1 , h2 , h3 , clk );
3 output [12 : 0] y ;
4 input [7 : 0] x ;
5 input [2 : 0] h0 , h1 , h2 , h3 ;
6 input clk ;
7 reg [12 : 0] y ;
8 reg [7 : 0] x0 , x1 , x2 , x3 ;
10 x3 <= x ;
11 x2 <= x3 ;
12 x1 <= x2 ;
13 x0 <= x1 ;
14 y <=( h0 * x3 + h1 * x2 )+( h2 * x1 + h3 * x0 );
15 end
16 endmodule
If we further pipeline the filter, the critical path is further shortened, as shown in
Figure 7.50. Since the complexity of a multiplier is typically much higher than that
of an adder (provided that the coefficients have a non-negligible number of bits), the
critical path is one multiplier. The area complexity of the circuit is 4 multipliers, 3
adders, and 9 registers.
Digital System Designs 317
1 // FIR 4
2 module fir4 (y , x , h0 , h1 , h2 , h3 , clk );
3 output [12 : 0] y ;
4 input [7 : 0] x ;
5 input [2 : 0] h0 , h1 , h2 , h3 ;
6 input clk ;
7 reg [12 : 0] y ;
8 reg [7 : 0] x0 , x1 , x2 , x3 ;
9 reg [10 : 0] y0 , y1 , y2 , y3 ;
11 x3 <= x ;
12 x2 <= x3 ;
13 x1 <= x2 ;
14 x0 <= x1 ;
15 y3 <= h0 * x3 ;
16 y2 <= h1 * x2 ;
17 y1 <= h2 * x1 ;
18 y0 <= h3 * x0 ;
19 y <=( y3 + y2 )+( y1 + y0 );
20 end
21 endmodule
Another equivalent FIR filter structure uses a transposed form that can be con-
structed from the direct-form FIR filter by exchanging the input and output and in-
verting the direction of the signal flow, as shown in Figure 7.51. The critical path
318 Principles of Verilog Digital Design
now becomes one adder plus one multiplier. The area complexity of the circuit is 4
multipliers, 3 adders, and 3 registers.
The RTL codes of FIR 5 are written below.
1 // FIR 5
2 module fir5 (y , x , h0 , h1 , h2 , h3 , clk );
3 output [12 : 0] y ;
4 input [7 : 0] x ;
5 input [2 : 0] h0 , h1 , h2 , h3 ;
6 input clk ;
7 reg [10 : 0] y3 ;
8 reg [11 : 0] y2 ;
9 reg [12 : 0] y1 ;
10 reg [12 : 0] y0 ;
14 assign y = y0 ;
16 y3 = h3 * x ;
17 y2 = h2 * x + y3_r ;
18 y1 = h1 * x + y2_r ;
19 y0 = h0 * x + y1_r ;
20 end
22 y3_r <= y3 ;
23 y2_r <= y2 ;
24 y1_r <= y1 ;
25 end
26 endmodule
Digital System Designs 319
The results of different architectures of the FIR filter are summarized in Table 7.3,
where ⊗, ⊕, and R represent the multiplier, adder, and register, respectively. When
the number of coefficient taps increases, the advantages of FIR 5 will become clear
that it has a fixed and the (almost) shortest critical path as well as the (almost) small-
est area.
In this section, several architectures which can be used to implement the FIR
filter have been demonstrated, each with its own specific pros and cons. Designers
must take a number of factors into consideration and carefully explore, analyze, and
optimize different architectures before writing their RTL codes.
The combination stage merges the two symbols in Huffman table with the lowest
probabilities of occurrence and adds their probabilities of occurrence, and then sorts
the remaining probabilities of occurrence again, as shown in Figure 7.52. The tree
structure used for the splitting stage introduced later is also displayed. In the merged
symbol set, the symbol, A2 , which has a lower probability of occurrence, is put onto
the left subtree, while the symbol, A1 , with a higher probability of occurrence, is put
onto the right subtree. The sum of the probabilities of occurrence of A1 and A2 is
0.11.
Similarly, the second round of the combination stage is displayed in Figure 7.53.
Figure 7.52: The first round of the combination stage: (a) table and (b) tree.
Digital System Designs 321
Figure 7.53: The second round of the combination stage: (a) table and (b) tree.
In the newly merged symbol set, the previously merged symbol set, {A1 , A2 }, with
a lower probability of occurrence is put onto the left subtree, while the symbol, A4 ,
which indicates a higher probability of occurrence is put onto the right subtree.
The third and the fourth (final) rounds of the combination stage are presented in
Figure 7.54. After the fourth round, the combination stage has completed.
An overview of the Huffman table is presented in Figure 7.55.
The last stage is the splitting stage, which is used to encode the symbols into the
tree structure, as shown in Figure 7.56. Here, the symbol A3 has a higher probability
of occurrence than the symbol set {A1, A2, A4, A5}, so it is assigned bit 0, while the
symbol set {A1, A2, A4, A5} is assigned bit 1. This means that when reading the MSB
of a code with bit 0, it must be the symbol A3. However, when reading the MSB of
a code with bit 1, it could be any one of the symbols in {A1, A2, A4, A5}. Therefore,
extra bits must be used for decoding so that the correct symbol can be selected. The
process continues until all symbols have been assigned a unique code, as displayed
in Figure 7.56.
The final Huffman codes are shown in Table 7.6.
322 Principles of Verilog Digital Design
S
M
Figure 7.54: (a) The third and (b) the fourth (final) rounds of the combination stage.
Digital System Designs 323
M
M
M
S
S
S
S
Figure 7.56: The splitting stage: (a) table and (b) tree.
324 Principles of Verilog Digital Design
T H
Huffman coding is a variable length encoder, which uses masks to indicate code
length. For example, if the binary Huffman code of A5 is 10, then HC5 = XX10 and
M5 = 0011, which indicates the least significant two bits of the code, HC5, are valid,
while the most significant two bits are “don’t care”.
the position left to the left-most bit 1 of the symbol masks sym_mask. Similarly, the
Huffman codes of all symbols belonging to the symbol set with the second lowest
number of occurrences are prepended bit 0. The members of the i-th symbol set are
indicated by its bit mapping, sym_bmap[i]. The position to the left of the left-most
bit 1 of the symbol mask sym_mask can be determined by adding all the bits of the
symbol mask. In addition, to derive a new symbol mask, the symbol masks of the
symbols in the lowest two symbol sets left shift with bit 1 shifted in.
In the sorting state, the numbers of occurrences for all alive symbol sets are sorted
just like the state ST_SORT1, which enables the same sorting circuit to be shared.
The merging and sorting states interleave until only one symbol set remains. Finally,
the state ST_DONE outputs the Huffman codes and their masks.
Figure 7.59 is an example, as that shown in Table 7.4, of the Huffman encoding
performed by the proposed algorithm. In the state, ST_CNT, the numbers of occur-
rences of the 5 symbols are counted and the sym_bmap is initialized for each symbol,
1 at bit 0 for A1, 1 at bit 1 for A2, etc.
In the state, ST_SORT1, the symbols are sorted according to the numbers of
times they occur, i.e., sym_cnt. The bit mappings of them, sym_bmap, are also
reordered accordingly. In the state, ST_MERG1, the two symbol sets, {A1} and
{A2}, which have the two lowest numbers of occurrences, are merged. A new sym-
bol set, {A1, A2}, is then formed and its number of occurrences is calculated by
adding all the occurrences of A1 and A2. Simultaneously, the symbol sets, {A2} and
{A1}, with the lowest and the second lowest numbers of occurrences are respectively
prepended bits 1 and 0 to their Huffman codes, sym_code2 and sym_code1. The
members of the symbol sets with the lowest and the second lowest numbers of oc-
currences are indicated by the bit mappings (after ST_SORT1 state), sym_bmap[4]
and sym_bmap[3], respectively. Accordingly, the symbol masks of the symbols in
the lowest (for sym_mask[1]) and the second lowest (for sym_mask[0]) numbers of
occurrences left shift with bit 1 shifted in.
In the state, ST_SORT2, the symbols are sorted again according to the new
sym_cnt’s. The bit mappings of them, sym_bmap, are also reordered accordingly.
In the state, ST_MERG2, the two symbol sets, {A1, A2} and {A4}, which have
the two lowest numbers of occurrences, are merged. A new symbol set, {A1, A2,
A4}, is then formed and its number of occurrences is calculated by adding all the
occurrences of {A1, A2} and {A4}. Simultaneously, the symbol sets, {A1, A2} and
{A4}, with the lowest and the second lowest numbers of occurrences are respec-
tively prepended bits 1 (for sym_code2 and sym_code1) and 0 (for sym_code4) to
their Huffman codes. The members of the symbol sets with the lowest and the second
lowest numbers of occurrences are indicated by the bit mappings (after ST_SORT2
state), sym_bmap[3] and sym_bmap[2], respectively. Accordingly, the symbol masks
of the symbols in the lowest (for sym_mask[0] and sym_mask[1]) and the second
lowest (for sym_mask[3]) numbers of occurrences left shift with bit 1 shifted in.
This process continues until all symbols have been encoded.
Consequently, the average code length of the Huffman encoding is 4 × 0.09 + 4 ×
0.02 + 1 × 0.51 + 3 × 0.13 + 2 × 0.25 = 1.84 bits. In comparison to a system which
Digital System Designs 327
does not use Huffman encoding, which requires 3 bits for every 5 symbols, the saved
bit width when using the Huffman code for each symbol is 3 − 1.84 = 1.16 bits.
Figure 7.60 presents another example. The average code length of the Huffman
encoding is 2 × 0.2 + 3 × 0.2 + 3 × 0.2 + 2 × 0.2 + 2 × 0.2 = 2.4 bits. Compared to
that without the Huffman encoding, the saved bit width for the Huffman code of each
symbol is 3 − 2.4 = 0.6 bits.
328 Principles of Verilog Digital Design
1 // State machine
2 reg [3 : 0] state_ns , state_cs ;
3 parameter ST_IDLE =4 ’ b0000 ; parameter ST_CNT =4 ’ b0001 ;
4 parameter ST_SORT1 =4 ’ b0011 ; parameter ST_MERG1 =4 ’ b0010 ;
5 parameter ST_SORT2 =4 ’ b0110 ; parameter ST_MERG2 =4 ’ b0111 ;
6 parameter ST_SORT3 =4 ’ b0101 ; parameter ST_MERG3 =4 ’ b0100 ;
7 parameter ST_SORT4 =4 ’ b1100 ; parameter ST_MERG4 =4 ’ b1101 ;
8 parameter ST_DONE =4 ’ b1111 ;
9 always @ (*) begin
10 state_ns = state_cs ;
11 case ( state_cs )
12 ST_IDLE : if ( gray_valid ) state_ns = ST_CNT ;
13 ST_CNT : if ( CNT_valid ) state_ns = ST_SORT1 ;
14 ST_SORT1 : state_ns = ST_MERG1 ;
15 ST_MERG1 : state_ns = ST_SORT2 ;
16 ST_SORT2 : state_ns = ST_MERG2 ;
17 ST_MERG2 : state_ns = ST_SORT3 ;
18 ST_SORT3 : state_ns = ST_MERG3 ;
19 ST_MERG3 : state_ns = ST_SORT4 ;
20 ST_SORT4 : state_ns = ST_MERG4 ;
21 ST_MERG4 : state_ns = ST_DONE ;
22 ST_DONE : state_ns = ST_IDLE ;
23 endcase
24 end
The remaining RTL codes when following the state machine precepts and the
using proposed timing diagram are illustrated below. The register, all_cnt, counts
the total number of occurrences until 8’d100 has been reached. When gray_valid
is true, the register, sym_cnt[i], counts the number of occurrences of the i-th sym-
bol. In ST_SORT1 state, the original symbol counts, sym_cnt[i], i = 0, 1, ..., 4, are
sorted and the sorted results, sort_sym_cnt, obtained by the function, sort_result,
are stored in sym_cnt again. Therefore, sym_cnt[0] has the maximum symbol count,
sym_cnt[1] has the second maximum symbol count, and so on. In other sorting states,
sym_cnt latches the sorting results of the merged numbers of occurrences.
To find the maximum value of 5 numbers, sym_cnt[4], sym_cnt[3],...,sym_cnt[0],
the function sort_result compares sym_cnt[4] and sym_cnt[3], and places their max-
imum in sym_cnt[3]; then sort_result compares sym_cnt[3] and sym_cnt[2], and
places their maximum in sym_cnt[2]; then sort_result compares sym_cnt[2] and
sym_cnt[1], and places their maximum in sym_cnt[1]; finally, sort_result compares
sym_cnt[1] and sym_cnt[0], and places their maximum in sym_cnt[0]. Therefore,
the final maximum value is stored in sym_cnt[0]. To find the maximum value of re-
maining 4 numbers, sym_cnt[4], sym_cnt[3],...,sym_cnt[1], similar procedures are
performed and the second maximum value is stored in sym_cnt[1], and so on.
Digital System Designs 331
5 integer i ;
11 if ( reset )
12 all_cnt <=0;
13 else if ( gray_valid )
14 all_cnt <= all_cnt +1 ’ b1 ;
15 else if ( CNT_valid )
16 all_cnt <=0;
17 assign CNT1 = sym_cnt [0]; assign CNT2 = sym_cnt [1];
21 if ( reset )
22 for ( i =0; i <=4; i = i +1)
23 sym_cnt [ i ]=0;
24 else if ( gray_valid )
25 case ( gray_data )
26 SYM0_PAT : // Incrementer can be shared
27 sym_cnt [0] <= sym_cnt [0]+1 ’ b1 ;
28 SYM1_PAT : // Incrementer can be shared
29 sym_cnt [1] <= sym_cnt [1]+1 ’ b1 ;
30 SYM2_PAT : // Incrementer can be shared
31 sym_cnt [2] <= sym_cnt [2]+1 ’ b1 ;
32 SYM3_PAT : // Incrementer can be shared
33 sym_cnt [3] <= sym_cnt [3]+1 ’ b1 ;
34 SYM4_PAT : // Incrementer can be shared
35 sym_cnt [4] <= sym_cnt [4]+1 ’ b1 ;
36 endcase
37 else
38 case ( state_ns )
39 ST_SORT1 , ST_SORT2 , ST_SORT3 , ST_SORT4 :
40 for ( i =0; i <=4; i = i +1)
41 sym_cnt [ i ] <= sort_sym_cnt [ i ];
42 ST_MERG1 : begin
43 // Adder can be shared
44 sym_cnt [3] <= sym_cnt [3]+ sym_cnt [4];
45 sym_cnt [4] <=0;
46 end
332 Principles of Verilog Digital Design
47 ST_MERG2 : begin
48 // Adder can be shared
49 sym_cnt [2] <= sym_cnt [2]+ sym_cnt [3];
50 sym_cnt [3] <=0;
51 end
52 ST_MERG3 : begin
53 // Adder can be shared
54 sym_cnt [1] <= sym_cnt [1]+ sym_cnt [2];
55 sym_cnt [2] <=0;
56 end
57 ST_MERG4 : begin
58 // Adder can be shared
59 sym_cnt [0] <= sym_cnt [0]+ sym_cnt [1];
60 sym_cnt [1] <=0;
61 end
62 endcase
63 function [59 : 0] sort_result ; // Function call definition
64 input [6 : 0] sym_cnt0 , sym_cnt1 , sym_cnt2 , sym_cnt3 ,
65 sym_cnt4 ;
66 input [4 : 0] sym_bmap0 , sym_bmap1 , sym_bmap2 , sym_bmap3 ,
67 sym_bmap4 ;
68 reg [6 : 0] sort_sym_cnt [0 : 4];
69 reg [4 : 0] sort_bmap [0 : 4];
70 reg [6 : 0] tmp_cnt ;
71 reg [4 : 0] tmp_map ;
72 integer i , j ;
73 begin
74 sort_sym_cnt [0]= sym_cnt0 ; sort_sym_cnt [1]= sym_cnt1 ;
75 sort_sym_cnt [2]= sym_cnt2 ; sort_sym_cnt [3]= sym_cnt3 ;
76 sort_sym_cnt [4]= sym_cnt4 ;
77 sort_sym_bmap [0]= sym_bmap0 ;
78 sort_sym_bmap [1]= sym_bmap1 ;
79 sort_sym_bmap [2]= sym_bmap2 ;
80 sort_sym_bmap [3]= sym_bmap3 ;
81 sort_sym_bmap [4]= sym_bmap4 ;
82 for ( i =3; i >=0; i =i -1)
83 for ( j =3; j >=3 - i ; j =j -1)
84 if ( sort_sym_cnt [ j + l ] > sort_sym_cnt [ j ]) begin
85 tmp_cnt = sort_sym_cnt [ j ]; // Sym count swapped
86 sort_sym_cnt [ j ]= sort_sym_cnt [ j +1];
87 sort_sym_cnt [ j +1]= tmp_cnt ;
88 tmp_map = sort_sym_bmap [ j ]; // Bitmap swapped
89 sort_sym_bmap [ j ]= sort_sym_bmap [ j +1];
90 sort_sym_bmap [ j +1]= tmp_map ;
91 end
92 sort_result ={ sort_sym_bmap [0] , sort_sym_bmap [1] ,
93 sort_sym_bmap [2] , sort_sym_bma p [3] , sort_sym_bma p [4] ,
Digital System Designs 333
The sorting states, ST_SORT1, ST_SORT2, ST_SORT3, and ST_SORT4, sort the
remaining 5, 4, 3, and 2 symbol sets, respectively. In the sorting states, the bit map-
ping, sym_bmap, of each symbol set is determined according to the sorting results.
That is, if symbol counts are swapped, corresponding bit mappings are swapped
accordingly.
During the merging states, ST_MERG1, ST_MERG2, ST_MERG3, and
ST_MERG4, the symbol sets with the two lowest numbers of occurrence are merged
by adding their number of occurrences and OR-ing their bit maps. Notice that, in
the RTL codes, integer variables in different always blocks should be designated as
different variables; otherwise, local variables of a named block can also be used.
4 integer i1 ;
5 assign
12 if ( reset )
13 for ( i1 =0; i1 <=4; i1 = i1 +1)
14 sym_bmap [ i1 ] <=1 ’ b1 < < i1 ;
15 else
16 case ( state_ns )
17 ST_SORT1 , ST_SORT2 , ST_SORT3 , ST_SORT4 :
18 for ( i1 =0; i1 <=4; i1 = i1 +1)
19 sym_bmap [ i1 ] <= sort_sym_bma p [ i1 ];
20 ST_MERG1 : begin
21 sym_bmap [3] <= sym_bmap [3]| sym_bmap [4];
22 sym_bmap [4] <=0;
23 end
24 ST_MERG2 : begin
25 sym_bmap [2] <= sym_bmap [2]| sym_bmap [3];
26 sym_bmap [3] <=0;
27 end
28 ST_MERG3 : begin
29 sym_bmap [1] <= sym_bmap [1]| sym_bmap [2];
334 Principles of Verilog Digital Design
The Huffman codes, sym_code, of all symbols belonging to the symbol set with
the lowest number of occurrences are prepended 1 at the bit location to the left of the
mask, sym_mask, while the Huffman codes, sym_code, of all symbols belonging to
the symbol set with the second lowest number of occurrences are prepended 0 at the
same bit location. The bit location that is to the left of the first bit with logic 1 of
the mask can be calculated by adding all the bits in the mask, sym_mask. The mask,
sym_mask, then shifts left with one additional 1 shifted in.
6 wire [3 : 0] M1 , M2 , M3 , M4 , M5 ;
7 integer i2 , i3 ;
14 if ( reset )
15 for ( i2 =0; i2 <=4; i2 = i2 +1)
16 sym_mask [ i2 ] <=0;
17 else
18 case ( state_ns )
19 ST_MERG1 : for ( i2 =0; i2 <=4; i2 = i2 +1) begin
20 if ( sym_bmap [4][ i2 ]==1 ’ b1 ) begin
21 sym_code [ i2 ][ sym_mask_0_l oc [ i2 ]] <=1 ’ b1 ;
22 sym_mask [ i2 ] <={ sym_mask [ i2 ][2 : 0] ,1 ’ b1 };
23 end
24 if ( sym_bmap [3][ i2 ]==1 ’ b1 ) begin
25 sym_code [ i2 ][ sym_mask_0_l oc [ i2 ]] <=1 ’ b0 ;
26 sym_mask [ i2 ] <={ sym_mask [ i2 ][2 : 0] ,1 ’ b1 };
27 end
28 end
29 ST_MERG2 : for ( i2 =0; i2 <=4; i2 = i2 +1) begin
Digital System Designs 335
72 input [2 : 0] val ;
73 sum_bits = val [0]+ val [1]+ val [2];
74 endfunction
336 Principles of Verilog Digital Design
PROBLEMS
1. Develop the testbench of Example 7.1 and a behavior model as a gold result to
verify the design output.
2. In Example 7.2, whenever the current state stays in the ST_M0 state, the arbiter
will check the requests of master 1, master 2, and then, master 3. Hence, the
arbitration is not truly equal to masters 1, 2, and 3 in this state because master 0
always has the highest priority. It is best to redesign the arbiter such that a truly
fair arbiter can be obtained.
3. Write the Verilog codes for an arbiter that takes four high-priority requests and
four low-priority requests and outputs the eight grant signals.
a. Write a baseline module that low-priority requests can be starved.
b. Write a module that after 4 cycles of granting high-priority requests will grant
a low-priority request. A static tie-breaking scheme for requests with equal
priority is adopted.
c. Modify the above module to implement a round-robin way for breaking ties
within a class with the same priority. That is, among four requests with the
same priority, each request will be granted sequentially. For example, request
0 has the highest priority until it has been granted, after which request 1 will
have the highest priority, etc.
4. Implement a system with master, slave, and arbiter.
a. Please write a master module that can send a request for calculating the aver-
age of 5 numbers with fixed-point number format s(6.8).
b. Instantiate 4 masters and master 0 has the highest priority. The slave and
arbiter use the designs in Examples 7.1 and 7.2, respectively.
c. Analyze the throughput of your system. Improve the throughput by modi-
fying the handshake protocol and designs if necessary so that the pipelined
slave can be fully utilized in every cycle.
d. Verify the above designs by simulations.
5. Modify the module of the average of 5 numbers, avg_value, by moving the two
additions to the first pipeline stage so that the second pipeline stage has only one
addition. Identify advantages of the new design, and then verify your design.
Then develop a behavioral model for the module of the average of 5 numbers,
avg_value, as the gold result to verify the result of avg_value module.
6. The memory units that follow are specified by the number of words multiplied
by the number of bits per word. How many address lines and input-output data
lines will be needed in each case?
a. 8K×16 bits,
b. 2G×8 bits,
c. 16M×32 bits,
d. 256K×164 bits.
7. Implement a 8 × 16 ROM using a case statement.
8. For the RAM module in Example 7.6,
338 Principles of Verilog Digital Design
a. add the access time, 1.2 time units, of rdata, i.e., the time required for the
rising edge of the clock to output the available data.
b. add the setup time check for 1 time unit and hold time check of 0.2 time units
for all input ports of RAM, including cen, wen, ren, addr, and wdata.
c. write a testbench to verify the above timing constraints.
9. Model a 256 × 16 bits single-port memory array with 4 signals: data bus,
data[15 : 0], address bus, addr[7 : 0], active-low output enable, ren, and active-
low write enable, wen. The memory stores data[15 : 0] on the falling edge of the
wen. The memory should drive the data bus whenever ren is low.
10. For the ROM module in Section 7.2.2,
a. add the access time, 1.2 time units, of data, rdata, i.e., the time required for
the rising edge of the clock to output the available data.
b. add the setup time check for 1 time unit and hold time check of 0.2 time units
for all input ports of ROM, including ren and addr.
c. write a testbench to verify the above timing constraints.
11. Write a Verilog model which can implement the following:
a. A memory of eight bit-sliced arrays, each with 1K×16 bits.
b. A memory of 16 banked arrays, each with 512×128 bits. Only the necessary
bank should be activated.
12. Please design a 50-entry FIFO. Each entry must have 16 bits.
a. Using dual-port SSRAM.
b. Using single-port SSRAM.
c. Please redesign the FIFO using flip-flops. What are the pros and cons for the
implementations using either SSRAM or flip-flops?
13. Please design an interleaved memory system in which M = 8 requesters and
N = 4 memory banks by modifying the crossbar interconnect. Each memory
bank must have 16 K×32 bits.
14. Design a filter that can detect the bit sequence "1011". For example, if the input
is "0011_1011_0110", the output will be "0000_0001_0010".
15. Design a filter that can detect the bit sequence "1011" and its inverse
"0100". For example, if the input is "0100_1011_0110", the output will be
"0001_0001_0010".
16. Design two filters that can detect the bit sequence "1011" and "1101". For exam-
ple, if the input is "0011_1011_0110", the output will be "0000_0011_0110".
17. Design a module that can add 3 numbers, as shown in Figure 7.62. Two adders
must be implemented to produce one result at a time. Inputs a, b, and e are
signed numbers. No overflow is allowed. We must have flip-flopped (or regis-
tered) output. Datapath typically does not need to be reset. Please add control
signals in_valid and out_valid to start the operation and indicate the valid result,
respectively.
Digital System Designs 339
18. Design another module that can add 3 numbers, as shown in Figure 7.63. One
adder must be implemented to produce one result every 2 cycles. Inputs a, b,
and e are signed numbers. No overflow is allowed. We must have flip-flopped
(or registered) output. Datapath typically does not need to be reset. Please add
control signals in_valid and out_valid to start the operation and indicate the valid
result, respectively.
19. Develop a sequential circuit that has a single-bit data input S, and produces an
output Y . The output is 1 whenever S has the same value over three successive
clock cycles; otherwise, output is 0.
20. Use a state machine to design a divide-by-3 pulse width reducer with a single
input in and a single output out. The output is asserted once after every three
(nonconsecutive) cycles that the input has been asserted.
21. Write a Verilog model of a circuit that calculates the average of four 16-bit 2’s
complement signed numbers, without checking for overflow.
22. Design an arithmetic unit to implement 1) when cmd is 0, accumulation of 4
8-bit unsigned numbers, and 2) when cmd is 1, multiplication of two 8-bit un-
signed numbers. When cmd_valid and data_in_valid are asserted (true), cmd
and data_in are valid, respectively. After the command is issued, corresponding
operands are input one at a time. For cmd 0/1, four/two operands take four/two
340 Principles of Verilog Digital Design
cycles to input. When complete, output the result (data_out) and indicate it
through data_out_valid. Both commands share the output data bus data_out. Pad
zeros at the most significant bits if the bit width of the result is less than 16 bits.
23. Develop a testbench model for a sequential multiplier. Verify that the results
computed by the multiplier are the same as those produced using multiplication
with real numbers.
24. Develop a Verilog model of a pipelined circuit that compares the maximum of
corresponding values in three inputs, a, b, and c. The pipeline should have two
stages: the first stage determines the larger of a and b and saves the value of c;
the second stage finds the larger of c and the maximum of a and b. The inputs
and outputs are all 14-bit signed integers.
25. Draw a datapath for a pipelined complex multiplier that takes five cycles to do
each multiplication action; the pipelined multiplier should take just two cycles
for each pair of complex operands: one cycle for the four multiplications and
one cycle for the subtraction and addition. The input streams are also pipelined.
26. If the delays for a multiplier, adder, and register clock-to-Q are 7.3, 2.6, and 1.2
ns, find the critical path delays of the various FIR filters in Section 7.3.3.
27. Typically, the coefficients of FIR filters are symmetric. If the coefficients of fir1
are h3 = h0 and h2 = h1 , redesign the fir1 by minimizing its area.
28. Redesign the FIR filter 5 by inserting a new pipeline such that its critical path
has only one multiplier. Compare your design to FIR filter 4.
29. If we want to save the cost of a FIR filter, the folding technique in Figure 7.65
can be adopted to design a single processing element FIR filter. As displayed,
the multiply-accumulate (MAC) operation is fundamental for DSP. Please write
down its RTL codes.
Digital System Designs 341
30. Design a 4-bit SISO shifter as shown in Figure 7.66. This module must be serial
in, serial out. Clear signal is asynchronous. An output valid indicator is needed.
31. Design a 4-bit SIPO shifter as shown in Figure 7.67. This module is a serial in,
parallel out. Clear signal is asynchronous. An output valid indicator is needed.
32. Design a 4-bit PISO shifter. This module must be parallel in, serial out. Clear
signal is asynchronous. The data input should arrive at most every 4 clock cycles.
An output valid indicator is needed.
33. MAC design.
342 Principles of Verilog Digital Design
The two operands of the multiplier may not arrive at the same time, which is
indicated by control signals in_valid0 and in_valid1 (not shown in the figure).
The bit widths of the integer operands are 16-bit. When both operands have
arrived, the MAC operation is performed. After 16 MACs are done, the result
is output and the out_valid signal (not shown in the figure) asserts. Please
design the circuit without overflow. Note that the two operands arrive in a
one-to-one manner, so you do not have to buffer the operands.
b. If you want to decrease the critical path by inserting a re-timing D-FF (in-
troducing one more pipeline stage) at the output of the multiplier, it will be
necessary to redesign the circuit.
34. Please implement the matrix multiplication
a11 a12 b11 b12 c11 c12
= (7.3)
a21 a22 b21 b22 c21 c22
using the processing element of a systolic array in Figure 7.69. Determine the
bit width by yourself.
Figure 7.69: Processing element of the matrix multiplication, where R denotes the
register.
35. Suppose a system includes a data source that provides a stream of 16-bit data
values and a processing unit that operates on the stream, as shown in Figure
7.70. The source provides successive values at irregular intervals, sometimes
faster than they can be processed, and sometimes slower. It has a valid output
that is 1 during a clock cycle when a data item is available. The processing unit
Digital System Designs 343
has a “start” control input to initiate processing and a “done” output that is set to
1 for a cycle when a data item is processed. Show how the source and processing
unit can be connected using the FIFO, including any control sequences required.
Assume that if the FIFO is full when a new data item is provided by the source,
the data item is dropped from the stream.
36. Please design a 64-entry stack using dual-port SSRAM. The stack only has a
write pointer that specifies the write address, and the value of the write pointer
minus one indicates the read address. Initially, the pointer addresses the bottom
(address 0) of the stack. The pointer increases or decreases by one automatically
upon the write or read operation, respectively.
37. Please redesign the complex multiplier in Example 7.11 by
a. two real multipliers,
b. four real multipliers.
38. Please redesign the complex multiplier in Example 7.11 for a complex multi-
plication of a complex number and a complex conjugation of another number.
That is, if op1= a = ar + jai and op2= b = br + jbi , the result is prod= a × b∗ =
ar br + ai bi + j(−ar bi + ai br ).
39. Please design the datapath and control unit for two kinds of implementations of
two additions, as displayed in Figure 7.45.
40. Please redesign the complex multiplier in Example 7.11 using 3 real multi-
plications. That is, if op1= a = ar + jai and op2= b = br + jbi , the result
is prod = (prod1 − prod2) + j(prod3 − prod2 − prod1), where prod1= ar br ,
prod2= ai bi , and prod3= (ar + ai )(br + bi).
41. Please redesign the FIR filter 1 using single processing element of MAC and
the coefficients in Table 7.2. In this design, one valid output should be produced
every 4 clock cycles.
42. Please redesign the Huffman encoder for 8 symbols, with 128 as the total number
of occurrences.
43. Please design the Huffman decoder based upon the Huffman code in Table 7.4.
44. Design the Huffman code generator using the table lookup (TLU) based on the
Huffman coding illustrated in this Chapter.
344 Principles of Verilog Digital Design
45. Design the Huffman decoder using the TLU based on the Huffman coding illus-
trated in this Chapter.
46. Rewrite the Verilog codes of Huffman encoder using the named block for all for
loops.
47. Design a save-our-soul (SOS) detector of Morse code using a FSM. Morse code
is a method used in telecommunication to encode text characters, like alpha-
bet, numbers, and a few punctuation marks, using on/off signals as standardized
sequences of two different signal durations, called dots and dashes. The SOS
encoded in Morse code is three dots (S), a space, three dashes (O), a space, and
three dots again (the second S). In a symbol, a dot and dash are a short and long
periods of an on signal, respectively. Dots and dashes within a symbol are sep-
arated by short periods of an off signal, while symbols are separated by a space
encoding by a long period of an off signal. We assume that a dot is represented by
the input being high for exactly one cycle, a dash is represented by the input be-
ing high for exactly three cycles, dots and dashes within a symbol are separated
by the input being low for exactly one cycle, and that a space is represented by
the input being low for three or more cycles. Note that the input going either high
or low for exactly two cycles is an illegal condition. When an illegal condition
happens, previously detected characters must be dropped and ignored. With this
set of definitions, one legal SOS string is 101010001110111011100010101000.
48. The architecture for 8-point DIT-FFT is shown in Figure 7.71. The complex in-
puts and outputs are parallel-in and parallel-out, respectively. Please design a
pipelined FFT such that consecutive blocks can be input continuously. That is,
the input data in a block, x[n], n = 0, 1, , 7, and output blocks, X[k], k = 0, 1, ..., 7,
are available in every clock cycle. Input data x[n] and twiddle factors WNi =
e− j2π i/N , i = 0, 1, 2, 3, are all 8-bit numbers. Determine the bit widths of inter-
mediate variables such that no quantization errors occur. Your design must be
correct for all kinds of random input.
49. A block diagram is shown in Figure 7.72. Please finish the design named chip.
The masters are behavioral models, which will be provided together with the
testbench. There are three masters, two slaves (MAC and FFT accelerators),
and one arbiter in the system-on-a-chip (SoC). The data bus, arbitrated by the
arbiter, is shared among all masters and slaves. The timing diagram and interface
protocol are shown in Figure 7.73. The master i, i =0, 1, 2, requests the data
bus and slave j, j =0, 1, by the signals, req[i] and slave_id[i], respectively, to
the arbiter. If slave_id[i]= j, once granted by the arbiter via the signal, gnt[i],
the master can send data via the signal, m_data_out_i, to slave j via the data
bus. Meanwhile, the slave j will be selected by the signal, sel[ j], to receive data.
After the slave finishes receiving data, the operation will start immediately. After
the operation is done, the signal ack[ j] will be asserted until the data has been
transferred to the master i via the signal, s_data_out_ j. After the deassertion of
ack[ j], req[i], gnt[i], and sel[ j] will de-assert, too, after which the next master
will be granted and the above handshake protocol will repeat again.
346 Principles of Verilog Digital Design
4 parameter M2 =2 ’ b11 ;
5 always @ (*)
6 begin
7 state_ns = state_cs ;
8 case ( state_cs )
28 state_ns = M0 ;
29 else if ( cmd_done & req [1])
30 state_ns = M1 ;
31 else if ( cmd_done )
32 state_ns = IDLE ;
33 endcase
34 end
35 always @ ( posedge clk or negedge rst_n )
4 always @ (*)
5 begin
6 state_ns = state_cs ;
7 case ( state_cs )
30 else if ( cmd_done )
31 state_ns = IDLE ;
32 endcase
33 end
b. Please design the interface of slaves conforming to the above handshake pro-
tocol.
c. Integrate the whole chip using some glue logics shown in the block diagram.
d. How to improve the efficiency of the protocol? Can you improve the effi-
ciency by re-designing the handshake protocol?
50. Design for testability is required for the production of all commercial chips. Its
goal is to make your design controllable and observable. The methodology of
DFT has been well established. To test the digital circuits, all combinations of
possible inputs must be evaluated. For example, to test the two-input NAND
gate shown in Figure 7.74, all possible inputs are “00”, “01”, “10”, “11”, which
should be controllable. Its corresponding outputs are “1”, “1”, “1”, “0”, which
should be observable. If all results are correct, no defects will be induced in the
NAND gate.
To control the inputs of the NAND and NOT gates, the scan D-FF in Figure 7.75
is required.
Besides, all the FFs in the design must be chained together to control the out-
puts of all FFs. The scan input data (scan_data_in), i.e., scan output from the
previous FF of the scan chain, is shifted in the scan chain by the control signal
âĂIJscan_enableâĂİ. That is, the normal function input is bypassed. The reset
signal is not shown here. The clock for the normal function is generated by the
PLL. To control the clock, the scan_clock is selected when scan_mode is as-
serted. Other circuits in Figure 7.74 are unknown. Please manually design the
DFT for the circuits including NAND and NOT gates in the box in Figure 7.74.
You need to replace the D-FF with the scan D-FF and chain all of the FFs. Then,
write the test pattern for the circuits under the test.
51. a. Figure 7.76 shows three different structures for implementing a 5-tap deci-
mation filter with decimation factor M = 2: (a) original, (b) generalized No-
ble identity-derived, and (c) folded FIR structures. Write RTL codes and the
testbench to verify them. What are the pros and cons of the three different
architectures?
b. Please redesign the above decimation filters using the transposed form.
52. Unknown parameter estimation problem: If received samples, y1 and y2 , are re-
lated to known transmitted data, x1 and x2 , by
y1 x1 x2 a1
=
y2 x2 x1 a2
x1 x2
= a1 + a2 (7.4)
x2 x1
i.e., x∗1 x2 + x1 x∗2 = 0, where (·)∗ denotes the complex conjugate. The solution of
unknown parameters, a1 and a2 , can be obtained by
∗ ∗ y1 ∗ ∗ x1 ∗ ∗ x2
x1 x2 = a 1 x1 x2 + a 2 x1 x2
y2 x2 x1
∗ ∗ y1 x1
⇒ x1 x2 = a1 x∗1 x∗2 (By orthogonality)
y2 x2
y1
⇒ a1 = x∗1 x∗2 = x∗1 y1 + x∗2 y2 . (By unity) (7.5)
y2
where xr /yr and xi /yi are respectively real and imaginary parts of x/y, is assumed
to require 4 real multiplications.
a. You can design a circuit using 16 multipliers and obtain the result in one
cycle. Plot your architecture and write down your RTL codes in a module.
Please use parameter to define the bit widths of x and y.
b. Identify the critical path in the above architecture.
c. If our goal is to design a circuit with a small area and satisfactory perfor-
mance, i.e., 4 multipliers, plot your architecture with datapath only. You need
not show the control signals.
d. Identify the critical path using the above architecture.
e. Plot your timing diagram and show how you can obtain the same result using
the first architecture.
f. If our goal is to design a circuit with the smallest area, i.e., one multiplier,
plot your architecture with datapath only. You do not have to show the control
signals.
g. Plot the critical path using the above architecture.
h. Plot your timing diagram and show how you can obtain the same result using
the first architecture.
53. Complete the designs in the previous exercise.
a. Please completely show your datapath and control signals (by FSM) using
the 2nd architecture in problem 52 and verify it using Modelsim (by showing
the timing diagram).
b. Please completely show your datapath and control signals (by FSM) using
the 3rd architecture in problem 52 and verify it using Modelsim (by showing
the timing diagram).
352 Principles of Verilog Digital Design
54. Serial-to-parallel cyclic redundant code conversion. The circuit in Figure 7.77
uses serial CRC-4 architecture.
where IN[n] is serial input and C[3 : 0] is the CRC result, n =0,1,2,.... The circuit
can be transformed to one using parallel architecture with 4 inputs, IN[3 : 0], at
a time, as shown in Figure 7.78.
where ⊕ denotes the bitwise XOR operation, and the subscript n+1 and n denote
the next and current states, respectively. Let C0 [3 : 0] = C[3 : 0], we have
Similarly, we can express C2 [3 : 0] using C[3 : 0]. Repeat this until C3 [3 : 0] can
be expressed using C[3 : 0].
Digital System Designs 353
where
min and max must denote minimum and maximum operations, respectively.
56. A sample tree structure is presented below.
Design a tree distance analyzer that can output the distance of the longest path
of all node pairs in a tree. For example, the tree displayed above has the longest
path between nodes 1 and 7, i.e., 4. The specification of interface timing diagram
is presented below. The information of all nodes, from nodes 1, 2, 3, ..., are
input sequentially. When the signal in_valid is true, the first in_data[6 : 0] is the
parent node of the current node, and following in_data[6 : 0] are children nodes.
For the leaf node, it has no child nodes. When the information of all nodes
are transferred, the signal last is true. After the processing is done, the signal
out_valid becomes true and out_data[6 : 0] represents the longest distance. In
Figure 7.80, there are 3 nodes. Nodes are transmitted sequentially from node 1,
2, to 3. The first node has 3 child nodes, the second node has 0 child nodes, and
the third node has 1 child node. All nodes have only one parent node, and the
parent node of the root node is 0. We assume that the maximum number of nodes
is 100. Each node has a maximum of 3 child nodes.
354 Principles of Verilog Digital Design
a. Write your RTL codes to build up a table of the tree data structure and its
child count. In the table of the sample tree, the parent node of node 8 is 0.
Therefore, node 8 is the root node. Sample RTL codes are given below for
your reference.
Figure 7.81: Table of the sample tree data structure together with its child count.
8 reg [6 : 0] ttab_raddr ;
9 reg [1 : 0] ttab_caddr ;
10 reg [6 : 0] root ;
12 integer i1 , i2 , i3 ;
Digital System Designs 355
31 if ( reset ) begin
32 ttab_raddr <=1;
33 for ( i2 =1; i2 <= MAX_NODE ; i2 = i2 +1)
34 ttab_chcnt [ i2 ] <=0;
35 end
36 else if (~ in_valid & in_valid_r ) begin
37 ttab_raddr <= ttab_raddr +1;
38 ttab_chcnt [ ttab_raddr ] <= ttab_caddr -1;
39 end
40 always @ ( posedge clk ) begin
45 root =0;
46 for ( i3 =1; i3 <= MAX_NODE ; i3 = i3 +1)
47 if ( ttab_par [ i3 ]==0) root = i3 ;
48 end
b. For each node, find the top 2 maximum distances between it and its child
nodes. For example, in Figure 7.79, node 1 has no child nodes. Therefore, its
top 2 maximum distances are both 0. Node 2 has 3 child nodes, 5, 3, and 6,
with the maximum distances of 2, 1, and 2, respectively. Therefore, its top 2
maximum distances are both 2. Node 6 has 1 child node. Therefore, its top 1
and top 2 maximum distances are 1 and 0, respectively.
Initiated from the root node, 8, you need to traverse all nodes and find their
top 2 maximum distances, max 1 and max 2, in Figure 7.82. The column
labeled by “child nodes processed” is a set of counters of all nodes used to
356 Principles of Verilog Digital Design
record the number of child nodes that have been processed. The counter of
a specific node increments when one of its child nodes is done (after finding
the top 2 maximum distances of a child node). When all child nodes are
processed, i.e., the counter reaches its maximum value indicated by child_cnt,
the final top 2 maximum distances of the specified node can be determined.
Write your RTL codes to build up the following table. Sample RTL codes are
given below for your reference.
Figure 7.82: Table for the top 2 maximum distances between a node and its child
nodes.
9 integer i4 , i5 ;
52 always @ (*)
60 if ( reset )
61 for ( i4 =1; i4 <= MAX_NODE ; i4 = i4 +1) dtab_chid [ i4 ] <=0;
62 else if ( is_RETN_ns )
63 dtab_chid [ pa_node ] <= dtab_chid [ pa_node ]+1;
64 assign tmp_max = dtab_max1 [ cur_node ]+1;
66 if ( reset )
358 Principles of Verilog Digital Design
c. According to the top 2 maximum distances of all nodes, the distance of the
longest path beneath a node can be decided, which is max 1+max 2 of the
node. As presented in Figure 7.83, the longest distance beneath node 2 is 4.
The maximum distance of all node pairs in a tree is the distance of the longest
path beneath the root node, which is derived on-the-fly by keeping the maxi-
mum distance of all complete nodes, as shown below.
1 reg [6 : 0] ans ;
2 wire [6 : 0] result , out_data ;
3 assign result = dtab_max1 [ cur_node ]+
Digital System Designs 359
4 dtab_max2 [ cur_node ];
5 always @ ( posedge clk or posedge reset )
6 if ( reset ) ans <=0;
7 else if (( is_RETN_ns | is_DONE_ns ) && result > ans )
8 ans <= result ;
9 assign out_valid = is_DONE_cs ;
Complete the whole design with the timing diagram shown below, where
dtab_chid denotes the set of counters for the “child nodes processed”, and
dtab_max1 and dtab_max2 denote the top 1 and top 2 maximum distances of
a node (relevant to its child nodes). The timing diagram for establishing the
tree tables, ttab_par, ttab_ch1, ttab_ch2, ttab_ch3, and ttab_chcnt, are omitted
here.
d. For leaf nodes, the memory space for the “max 1”, “max 2”, and “child nodes
processed” in Figure 7.82 are wasted. Please use a stack to store the “max 1”,
“max 2”, and “child nodes processed” to save the chip area.
360
Figure 7.84: Timing diagram of the tree distance analyzer for the sample tree.
Principles of Verilog Digital Design
8 Advanced System Designs
This chapter discusses several advanced system-level design issues, including
DRAM, flash memory, synchronizer design, and a crypto processor. DRAM chips are
commonly used for main memory. Flash memory is a solid-state non-volatile com-
puter memory storage medium that can be electrically erased and reprogrammed.
Synchronizer design is often encountered in ASIC wherever signals need to trans-
fer from one clock domain to another. We will see that violating setup and hold
times may result in the flip-flop entering an illegal unstable state in which its state
variable is neither a logic 1 nor a logic 0. The system-level design for the synchro-
nization of signals across different clock domains is comprehensively presented via
three sections: single-bit synchronizer, deterministic multi-bit synchronizer, and non-
deterministic multi-bit synchronizer (with and without flow control). Embedded co-
processor can offload the main processor. We introduce a specialized crypto proces-
sor for the Advanced Encryption Standard (AES). Finally, a digital design of com-
ponent labeling engine is illustrated from its algorithm to RTL design.
B
W
time of a DRAM to be much longer than that of an SRAM. Moreover, when the
DRAM cell is separated from the bit line by turning off the transistor, the charge
stored on the capacitor will still gradually leak. Thus, the control circuit uses a pro-
cess called refreshing to restore the charge on the capacitor before the charge de-
cays too much. Since DRAM cannot be accessed normally during a refreshing, the
refreshing must be interleaved between normal memory accesses. Typically, the re-
freshing is operated periodically, and the DRAM controller treats it as a high-priority
operation than normal memory accesses.
The cells in a DRAM are usually organized into several 2-D arrays, called banks.
A bank consists of several rows and columns. There are three steps to read and write
a specific address in DRAM, including row activation, column access, and precharge.
• Row activation: a specific row of a bank is activated and read into the sense
amplifiers, which destroys the stored charge in capacitors.
• Column access: commands are used to read and write specific columns of
the activated row.
• Precharge: the precharge command writes the row back into the bank.
For multiple reads and writes issued to different columns of the same row and bank,
activation and precharge are not required. Moreover, rows in different banks should
be activated.
As shown in Figure 8.2, an activation command for row 0 of bank 0 is issued in
cycle 1. Second activation command for row 6 of bank 3 is issued at cycle 2. After
a delay of tAC , a column access command to read column 0 of row 0 of bank 0 is
issued in cycle 3. After tRA , data are output in cycle 7, and its burst length is 2. Row
0 of bank 0 must be precharged in cycle 7 before a different row 2 in the same bank
(bank 0) can be accessed. The DRAM controller must wait tPC to allow the precharge
operation of row 0 of bank 0 to complete before performing another row activation
(row 2) on the same bank (bank 0) in cycle 11. If another column in the same row
is accessed, no precharge is necessary. For example, the column access command to
read column 1 of row 6 of bank 3 in cycle 9 follows that of column 2 of row 6 of
bank 3. Accesses of other banks can be interleaved between those of bank 0.
Advanced System Designs 363
Figure 8.3: Synchronizers are needed between asynchronous and synchronous sys-
tems. They are also needed between different clock domains, pclk and mclk.
violations are unavoidable. We will see that violating setup and hold times may re-
sult in the flip-flop entering an illegal unstable state in which its state variable is
neither a logic 1 nor a logic 0. Even worse, it will stay in the metatable state for an
unknown period of time before it finally reaches one of the two stable states (0 or
1). This synchronization failure will cause serious problems in digital systems. If the
unstable state of the flip-flop output is sampled, it will cause an indeterminate result.
The synchronization failure happens in two distinct scenarios, as shown in Figure
8.3. First, input signals come from truly asynchronous signal sources. They must
be synchronized before being used in a synchronous digital system. For example, a
keypad pressed by a human produces an asynchronous signal. However, this signal
can transition at any time.
Second, a synchronous signal may move from one clock domain to another. A
clock domain is simply a set of signals that are all synchronous with respect to a
single clock. For example, in a computer system, the processor may operate in one
clock domain, pclk = 3 GHz, whereas the memory system operates in a different one,
mclk= 800 MHz. These two clocks have different frequencies. Signals generated by
the processor that are synchronous to pclk cannot be directly used in the memory
system synchronous to mclk, and vice versa. They must be synchronized with the
destined clock domain before being used.
8.3.2 METASTABILITY
After the synchronization failure happens owing to timing constraint violation, most
of the illegal states of flip-flops decay to a legal 0 or 1 state. However, it is possible
that an illegal metastable state might prolong for an arbitrary amount of time before
reaching a legal state.
The CMOS master-slave flip-flop is displayed in Figure 8.4.
After the clock rises, the input transmission gate t1 of the master latch is off and
the feedback tristate inverter u3 is enabled. The master latch thus becomes two back-
to-back inverters. Additionally, the transmission gate t2 of the slave latch is on and
the feedback tristate inverter u5 is disabled. The slave latch thus becomes transparent.
Advanced System Designs 365
M S
Figure 8.4: A CMOS master-slave flip-flop constructed from two CMOS latches.
So that the equivalent circuit of the flip-flop is that of two back-to-back inverters, as
shown in Figure 8.5(a).
Figure 8.5(b) shows the transfer characteristics of the output V2 of the forward
inverter as a function of V1 , i.e., V2 = f (V1 ) (sold line), where f (·) denotes the trans-
fer function of an inverter, and the transfer characteristics of the output V1 of the
feedback tristate inverter as a function of V2 , i.e., V1 = f (V2 ) (dashed line). There are
three intersections of the two transfer characteristics on the figure. In the absence of
disturbances, these points are stable that the voltages V1 and V2 will never change. At
any point other than these three stable points, the circuit quickly converges to one of
the outer two stable points, i.e., ∆V ≡ V1 − V2 = +VDD or ∆V = −VDD .
For example, if we disturb the state slightly from the middle metastable point, the
state will quickly converge to the nearest outer stable point. As presented in Figure
8.6, when V1 slightly increases, it will decay to ∆V = +VDD , i.e., V1 = VDD and V2 = 0
through trace 1. Similarly, when V1 slightly decreases, it will decay to ∆V = −VDD ,
i.e., V1 = 0 and V2 = VDD through trace 2. Yet another case, if we disturb the state
slightly from either of the two outer stable points, the state returns to that stable point
again. A state, like the middle stable state, where a small disturbance causes a system
to leave that state, is called metastable.
The behavior of the metastable state can be presented in Figure 8.7. The ball
would remain on the top of the hill if it were perfectly balanced. Actually, nothing is
perfect, and the ball will eventually roll to one side or the other.
The detailed transistor-level schematic of Figure 8.5(a) is presented in Fig-
ure 8.8. We assume that all n-channel and p-channel FETs are perfectly matched
such that kn = k p , where kn = µnCox WL n denotes the device characteristics of
n-channel FET, k p = µ pCox WL p denotes the device characteristics of p-channel
FET, µn /µ p are the mobilities of electron/hole, Cox is the gate capacitance per unit
366 Principles of Verilog Digital Design
Figure 8.5: (a) When the clock is high, the master latch acts as two back-to-back
inverters, and the slave latch becomes transparent. (b) Transfer characteristics of the
back-to-back inverters.
T M
Figure 8.7: A metastable state that is represented by a ball at the top of a hill.
Figure 8.8: Transistor schematic of the master-slave flip-flop when the clock is high.
Figure 8.9: Convergence of ∆V (t) toward a stable state: (a) ∆V (0) > 0 and (b)
∆V (0) < 0.
where τ = kVCDD depending on the characteristics of devices. In other words, the rate
of change of ∆V is directly proportional to its magnitude. Actually, in addition to the
metastable state, the dynamics of the two back-to-back inverters hold whenever the
transistors are in saturation region.
The solution of this differential equation is simply given by
t
∆V (t) = ∆V (0)exp . (8.5)
τ
As displayed in Figure 8.9, when ∆V (0) > 0, ∆V (t) will converge toward ∆V (t) =
+VDD , where tCOV denotes the convergence time. By contrast, when ∆V (0) < 0, ∆V (t)
will converge toward ∆V (t) = −VDD .
Therefore, given ∆V (0) > 0, the amount of time that a synchronization failure
takes to converge to ∆V = +VDD is given by
∆V (0)
tCOV = −τ log (8.6)
VDD
where the natural logarithm, log(·), denotes the logarithm to the base e ≈ 2.71828 of
a number. Likewise, for ∆V (0) ≤ 0, the amount of time that a synchronization failure
takes to converge to ∆V = −VDD can be shown and omitted here. The convergence
time tCOV plotted against ∆V (0) is displayed in Figure 8.10. When ∆V (0) = 0, the flip-
flop is in the metastable state, and the convergence time tCOV = ∞. By contrast, when
∆V (0) = ±VDD , the flip-flop has already been in the stable state, and the convergence
time tCOV = 0.
Assuming that a flip-flop initially has a ∆V (0) uniformly distributed over the
interval (0, +VDD ). Given that the stable state has reached after some time, i.e.,
Advanced System Designs 369
∆V (t) = VDD , the probability of state error, PSE , for the convergence time tCOV to the
stable state longer than the waiting time, tw , can be written as
Figure 8.11: Input signal of a flip-flop cannot change during the slash area due to
setup time and hold time constraints.
cycle, the probability of timing error, PTE , for the setup time or hold time violation
that may cause the sampling flip-flop to enter the unstable state can be written as
tS + tH
PTE = = fC (tS + tH ). (8.9)
tC
If the transition frequency of the asynchronous input signal is fI , the frequency of
timing errors is given by
Example 8.1. Assuming that a transition of the asynchronous input signal is equally
likely to occur at any time instance during a cycle, what is the frequency of timing
errors for tS = tH = 0.1 ns, tC = 2 ns, and fI = 10 MHz?
0.2 ns
Solution: By putting the values into Equation 8.9, we find PTE = tS +t
tC = 2 ns =
H
0.1. Thus, according to Equation 8.10, the frequency of timing errors fTE = fI PTE = 10
MHz×0.1=1 MHz, which is relatively high.
states of FF1 to decay before resampling it with FF2 to produce the synchronized
output arr . As a result, the input signal is sampled by the output clock twice, so
the simple synchronizer is also called the double synchronizer. The double synchro-
nization scheme isolates the unstable signal ar from affecting the clkout domain and
allows (approximately) a clock period of clkout to wait for the unstable signal, ar , to
converge to one of the stable states.
The RTL codes of the double synchronizer are listed below.
Figure 8.13 displays the clock domain crossing (CDC) issue. It should be empha-
sized that the input signal a in clock domain clkin must be a registered output so that
there is only one transition in signal a at the clock rising edge, and the instability
issue can be reduced to the minimum. Besides, the frequency of clkout is commonly
higher than that of clkin so that, if the signal a to be synchronized is a strobe signal
with one-cycle high in its clock domain, clkin, the synchronized signal ar r still can
capture the strobe signal.
If the unstable state of ar settles down to the correct/wrong logic of a, the synchro-
nized signal, arr , will appear to have the correct logic value sooner/later, as shown in
Figures 8.14(a)/(b). When ar settles down to the wrong logic of a, at the next rising
372 Principles of Verilog Digital Design
Figure 8.14: (a) The unstable state of ar settles down to the correct logic of a. (b)
The unstable state of ar settles down to the wrong logic of a.
edge of clkout, it will sample the correct logic of a without any possibility of timing
violation.
It is interesting to understand how well the synchronizer works. In other words,
the probability of arr entering an illegal state after a transition on a needs to be
derived. This will happen only if (1) FF1 enters an illegal state and (2) this state has
not converged to one of stable states before ar is resampled by FF2. FF1 enter an
illegal state with probability PTE , and it will remain in this state after a waiting time
tw with probability PSE . Thus, the synchronization error probability of FF2 entering
an illegal state is given by
tS + tH t
w
PE = PTE PSE = exp − . (8.11)
tC τ
The synchronization error frequency of FF2 entering an illegal state is thus
fE = fI PE . (8.12)
In fact, FF1 has a delay tCQ to reflect its output ar . Also, the FF2 captures its input
ar at the time instance tS before the rising edge of clkout. Therefore, the relationship
between tw , tCQ , tS , and tC is presented in Figure 8.15.
Advanced System Designs 373
Figure 8.16: Double synchronizer with N = 2. Flip-flops in clkout domain are en-
abled every 2 clock cycles.
Hence, the waiting time tw is not a complete clock cycle, but rather a clock cycle
subtracting the required overhead:
tw = tC − tS − tCQ (8.13)
where tCQ denotes the clock-to-Q delay.
Example 8.2. If we have tS = tH = tCQ = τ = 0.1 ns, tC = 2 ns, and fI = 10 MHz, find
the probability of FF2 entering an illegal state and the frequency of synchronization
failure.
Solution: We have tw = 1.8 ns and the synchronization error probability of FF2
entering an illegal state is thus given by PE = 0.1 ns2 +0.1 ns exp − 1.8 ns = 1.523×
ns 0.1 ns
10−9 . If signal a has a transition frequency of fI = 10 MHz, then the synchronization
error frequency is given by fE = fI PE = (10 MHz)(1.523 × 10−9) = 0.0152 Hz.
If the synchronization failure probability is not low enough, we can increase the
waiting time to reduce it, because it is inversely proportional to the exponential of
waiting time. It is best to add a clock enable signal to the two flip-flops to accom-
plish this, and the flip-flops are enabled every N clock cycles. Figure 8.16 presents
the double synchronizer with N = 2. The frequency of clk_en is half that of clk-
out. Therefore, flip-flops, FF1 and FF2, in clkout domain are enabled every 2 clock
cycles.
This can extend the waiting time to tw = NtC − tS − tCQ , such as that of N = 2 in
Figure 8.17. Waiting longer with clock enable is more efficient than using multiple
flip-flops in series. Moreover, in addition to a reduced number of flip-flops, we only
pay the flip-flop overhead tS + tCQ once in the waiting time tw , rather than once per
flip-flop.
374 Principles of Verilog Digital Design
Example 8.3. In our example above, change the simple synchronizer to wait for two
clock cycles through the clock enable signal.
Solution: The waiting time becomes tw = NtC − tS − tCQ = 2 × 2 − 0.1 − 0.1 =
3.8 ns and the synchronization error probability of FF2 entering an illegal state is thus
given by PE = 0.1exp(−38) = 3.1391 × 10−18. The synchronization error frequency
is thus fE = fI PE = (10 MHz)(3.1391 × 10−18) = 3.1391 × 10−11 Hz.
According to Figure 8.16, the RTL codes of the synchronizer with N = 2 and clock
enable are presented below. It should be emphasized that the sampling frequency has
reduced by half. The sampling frequency must be acceptable for the system specifi-
cation.
4 output d_sync ;
12 if ( clk_en ) begin
13 d_r <= d ;
14 d_rr <= d_r ;
15 end
16 endmodule
In another way, we can increase the waiting by connecting flip-flops in series. For
example, the triple synchronization is shown in Figure 8.18.
Advanced System Designs 375
Figure 8.19: Relationship between tw , tCQ , tS , and tC for the triple synchronization.
With flip-flops in series, each additional flip-flop adds tC − tS − tCQ to our waiting
time. In that each flip-flop has a clock-to-Q delay, tCQ , and the input voltage sampled
by the flip-flop is the voltage at the setup time, tS , before the clock rising edge.
Therefore, for the synchronization using N flip-flops in series, the waiting time is
N(tC − tS − tCQ ). The waiting time of the triple synchronization is displayed in Figure
8.19.
Example 8.4. In our example above, change the simple synchronizer to wait for two
clock cycles by using three flip-flops in series.
Solution: We have tw = N(tC − tS − tCQ ) = 2 × (2 − 0.1 − 0.1) = 3.6 ns and the
synchronization error probability of FF2 entering an illegal state is thus given by
PE = 0.1exp(−36) = 2.3195 × 10−17. The synchronization error frequency is fE =
fI PE = (10 MHz)(2.3195 × 10−17) = 2.3195 × 10−10 Hz, which is approximately 10
times that of the previous example.
According to Figure 8.18, the RTL codes of the synchronizer with N = 2, i.e.,
three back-to-back flip-flops, are presented below.
4 output d_sync ;
376 Principles of Verilog Digital Design
5 input d , clkout ;
6 reg d_r , d_rr , d_rrr ;
7 assign d_sync = d_rrr ;
8 // Triple synchronizat io n
10 begin
11 d_r <= d ;
12 d_rr <= d_r ;
13 d_rrr <= d_rr ;
14 end
15 endmodule
Figure 8.20: Incorrect method of synchronizing a multi-bit signal when multiple bits
in the counter are changing concurrently.
Figure 8.21: Correct method of synchronizing a multi-bit signal when only single bit
in the counter can change at a time.
8 case ( cnt )
9 3 ’ b000 : cnt_gray =3 ’ b001 ;
10 3 ’ b001 : cnt_gray =3 ’ b011 ;
11 3 ’ b011 : cnt_gray =3 ’ b010 ;
12 3 ’ b010 : cnt_gray =3 ’ b110 ;
13 3 ’ b110 : cnt_gray =3 ’ b111 ;
14 3 ’ b111 : cnt_gray =3 ’ b101 ;
15 3 ’ b101 : cnt_gray =3 ’ b100 ;
16 default : cnt_gray =3 ’ b000 ;
378 Principles of Verilog Digital Design
17 endcase
18 always @ ( posedge clk2 ) begin
19 cnt_r <= cnt ;
20 cnt_rr <= cnt_r ;
21 end
4 case ( cnt_rr )
5 3 ’ b001 : cnt_bin =3 ’ b001 ;
6 3 ’ b011 : cnt_bin =3 ’ b010 ;
7 3 ’ b010 : cnt_bin =3 ’ b011 ;
8 3 ’ b110 : cnt_bin =3 ’ b100 ;
9 3 ’ b111 : cnt_bin =3 ’ b101 ;
10 3 ’ b101 : cnt_bin =3 ’ b110 ;
11 3 ’ b100 : cnt_bin =3 ’ b111 ;
12 default : cnt_bin =3 ’ b000 ;
13 endcase
where gray[i] denotes the Gray code, bin[i] denotes the binary data, i = 0, 1, ..., n − 1,
and bin[n]=0. The 3-bit binary-to-Gray-code conversion is written in Verilog func-
tion, gray, below.
It can be seen that the critical path of the Gray-to-binary-code conversion in-
creases linearly with the bit width of Gray code, while the critical path of the binary-
to-Gray-code conversion is a XOR gate and is constant without regarding to the bit
width. The implementation of conversion between binary and Gray codes using the
Boolean equation generally has a longer critical path but smaller area than that using
the case statement when the bit width becomes wider.
It is simple to generate the Gray code with a sequence of 2N numbers, where
N ≥ 1 is an integer. However, the desired number of count may not be a power of
2. If we want a Gray code with a sequence of arbitrary even numbers, it can still be
derived from the original Gray code with a sequence of 2N numbers. For example, a
Gray code with N = 3 is displayed in Figure 8.22. It can be observed that the codes
before and after a pair of adjacent codes with the same LSB, either 1 or 0, i.e., circled
ones, always differ by exactly one bit.
As a result, the circled codes can be removed so that remaining ones still exhibit
the property of a Gray code. For example, if we remove the two circled codes 001
and 011 with the same LSB 1 to have a 6-code sequence, remaining neighboring
codes still differ by only one bit. If we further remove 111 and 101 with the same
LSB 1, we have a 4-code sequence. Consequently, a Gray code with a sequence of
arbitrary even numbers can be eventually obtained. Similar approach can be applied
for circled codes with the same LSB 0.
However, it is impossible to derive a Gray code with arbitrary odd number of
count in a similar way. For practicality, the Gray code with odd number count can be
derived by folding the Gray code with even number count. For example, to determine
a 3-code sequence, if a 6-code sequence is 000, 010, 110, 111, 101, and 100, then
separated-by-3 codes in the 6-code sequence represent the same binary count. That
is, 000 and 111 represent binary 0, 010 and 101 represent binary 1, and 110 and 100
represent binary 2.
380 Principles of Verilog Digital Design
Figure 8.22: A pair of adjacent Gray codes with the same LSB bit.
On the output side, each rising edge of clock clkout advances the Gray-coded read
pointer, rd_ptr, to select each register in turn. Initially, rd_ptr is 00, and it selects the
contents a of register queue[0] to drive the output port, fifo_rdata. The second rising
edge of clkout advances rd_ptr to 01, and selects b from queue[1] to appear on the
output. The third edge selects c from queue[3] according to the rd_ptr of 11, and so
on. When the last element queue[2] has been read, the next element to be read is the
first element, queue[0]. It can be seen that the data stored in queue[0] to queue[3]
has been extended to four clock cycles of clkin. By using multiple registers to extend
the valid period of the input data, it is possible to make the FIFO synchronizer read
data without undergoing any unstable or transition state when selecting them on the
output fifo_rdata. Thus, there is no probability of violating setup and hold times in
the datapath of clkout if the read pointer selects the required data when it had been
stable (after its transition). There are even multiple cycles available for reading a
queued data before it is used.
In the FIFO synchronizer, the frequency of clkout is typically higher than that
of clkin. As a result, the frequency of read access is typically higher than that of
write access. Therefore, in such a condition, the FIFO is guaranteed not to overrun.
Consequently, the output and input of FIFO can never be stopped. Otherwise, the
flow control mechanism is needed. The need to synchronize is simply moved to the
control path. There is a wr_ptr in the clkin domain and a rd_ptr in the clkout domain
used for the write access and read access, respectively.
Example 8.5. Please design a FIFO synchronizer without flow control, as shown in
Figure 8.25. Since the flow control is not required, the speed of data output is higher
than or equal to that of the input data. Please write the RTL codes and determine
the worst-case queue depth of the FIFO such that no overrun can occur. The longest
382 Principles of Verilog Digital Design
queue depth is desired when the frequencies of clkin and clkout are the same. Also,
the frequency of read access is assumed to be the same as that of write access.
The output valid signal, out_valid, is used to indicate the nonempty status of the
FIFO queue. The out_valid is true when synchronized wr_ptr (to the clkout domain)
is not the same as the rd_ptr. To synchronize wr_ptr, it is encoded using the Gray
code. Even a properly designed FIFO synchronizer is ensured not to overrun, the
queue depth of the FIFO shall still be large enough because the wr_ptr is double
synchronized from clkin to clkout domain, which may incur (worst) 2 cycles delay
in clkout domain.
Solution: A FIFO with depth of 10 is demonstrated and implemented using a
circular buffer in Figure 8.26. When write pointer, wr_ptr, and read pointer, rd_ptr,
are the same, the FIFO may be empty or full. For example, initial write pointer and
read pointer are displayed in Figure 8.26, assume that there is no write access so that
wr_ptr is fixed. After 4 read accesses, rd_ptr=wr_ptr=2, the FIFO is empty at this
time. As another example, assume that there is no read access so that rd_ptr is fixed.
After 6 write accesses, rd_ptr=wr_ptr=8, the FIFO is full in this case.
Advanced System Designs 383
Figure 8.26: The FIFO memory is indexed by write and read pointers.
To easily distinguish the FIFO empty and full status using only read and write
pointers (without the queue length counter), a space in FIFO is often purposely left
unoccupied. Therefore, when write pointer and read pointer are the same, the FIFO
can be decided to be empty. By contrast, when the “next” write pointer and read
pointer are the same, the FIFO is full because a space in FIFO is not used. Conse-
quently, the out_valid signal is true when write pointer and read pointer are not the
same.
Considering the double synchronization and a burst write, the timing diagram
to assert the latest out_valid (owing to double synchronization of wr_ptr) is the
worst case for the (longest) queue depth and it is shown in Figure 8.27. The double
Figure 8.27: Worst-case timing diagram of a 4-entry FIFO synchronizer for deter-
mining the queue depth. Frequencies of clkin and clkout are the same.
384 Principles of Verilog Digital Design
Figure 8.28: Best-case timing diagram for the queue depth of a 4-entry FIFO syn-
chronizer. Frequencies of clkin and clkout are the same.
From above, we will design a FIFO with queue depth of 4 according to the worst-
case design criterion. Similar to the fifo_ctrl in Example 7.7, there is a wr_ptr in clkin
domain used to indicate the write address, and there is a rd_ptr in clkout domain used
to indicate the read address. To reduce the complexity of the synchronizer, only the
Gray coded wr_ptr signal is double synchronized to the clkout domain to indicate
the nonempty (or out_valid) status of the queue.
When the FIFO is full, write pointer and read pointer are the same. However,
when the FIFO is empty, they are also the same. To differentiate the FIFO full status
from the FIFO empty status, the FIFO full is commonly asserted when the next write
pointer equals the current read pointer. Doing so intends to leave one element unoc-
cupied, and a buffer space is wasted. In the sequel, out_valid is asserted whenever
the rd_ptr and the synchronized write pointer, wr_ptr_rr, are not the same. The RTL
codes are written below.
5 output [7 : 0] out_data ;
6 input in_valid ;
7 input [7 : 0] in_data ;
11 reg [7 : 0] out_data ;
12 // *********** * ** ** *
13 // * clkin domain
14 // *********** * ** ** *
26 if ( in_valid ) begin
27 // Case statement can be simply replaced by
28 // queue [ wr_ptr ] <= in_data ;
29 case ( wr_ptr ) // Gray coded pointer
30 2 ’ b00 : queue [0] <= in_data ;
31 2 ’ b01 : queue [1] <= in_data ;
32 2 ’ b11 : queue [3] <= in_data ;
386 Principles of Verilog Digital Design
37 // * clkout domain
38 // ********** ** * ** **
39 // Double sync
41 if ( rst ) begin
42 wr_ptr_r <=0;
43 wr_ptr_rr <=0;
44 end
45 else begin
46 wr_ptr_r <= wr_ptr ;
47 wr_ptr_rr <= wr_ptr_r ;
48 end
49 assign out_valid = wr_ptr_rr != rd_ptr ;
Example 8.6. Redesign the above FIFO synchronizer without flow control so that
all FIFO spaces can be fully utilized.
Solution: To fully utilize the FIFO space, the queue_length counter must be
implemented. To easily generate the out_valid signal, the queue_length counter is
located in clkout domain. Since the out_valid signal can be generated using the
queue_length counter, double synchronization is not required for wr_ptr. Rather, we
synchronize one-bit signal, in_valid, to count the queue_length so that the circuit
area can be reduced. The RTL code fragment is written below. Other parts are the
same as those RTL codes in the previous example and omitted here.
Advanced System Designs 387
The synchronized in_valid signal, in_valid_rr, may be too long and span
over several clock cycles of clkout because the frequency of clkin is usually
slower than that of clkout. Therefore, in_valid_rr should qualify in_valid_rrr, i.e.,
in_valid_rr&∼in_valid_rrr, so that an one-cycle input valid indication can be de-
rived.
1 // *********** * ** ** *
2 // * clkout domain
3 // *********** * ** ** *
5 reg [2 : 0] queue_length ;
6 // Double sync
8 if ( rst ) begin
9 in_valid_r <=0;
10 in_valid_rr <=0;
11 in_valid_rrr <=0;
12 end
13 else begin
14 in_valid_r <= in_valid ;
15 in_valid_rr <= in_valid_r ;
16 in_valid_rrr <= in_valid_rr ;
17 end
18 assign out_valid = queue_length !=0;
22 if (! rst_n )
23 queue_length <=0;
24 else if ( fifo_wr &&! fifo_rd )
25 queue_length <= queue_length +1 ’ b1 ;
26 else if ( fifo_rd &&! fifo_wr )
27 queue_length <= queue_length -1 ’ b1 ;
output interface should also adopt the flow control to prevent underrun that invalid
data will be provided if they are read when the FIFO is empty.
The interface of the FIFO synchronizer with valid-ready flow control is shown in
Figure 8.29. On both the input and output interfaces, the valid signal is true if the
transmitter has valid data on the data bus, and the ready signal is true if the receiver
is ready to receive new data. A data transfer takes place only when both valid and
ready signals are true.
On the input interface, the in_ready signal indicates the buffer not full status in the
FIFO queue. It is generated by comparing the write and read pointers. Unfortunately,
this comparison is complicated by the fact that write and read pointers are in differ-
ent clock domains, i.e., clkin and clkout, respectively. By the deterministic multi-bit
synchronizer, we generate a synchronized version of read pointer in the clkin do-
main, rd_ptr_rr. Similarly, on the output interface, the out_valid signal indicates the
buffer not empty status in the FIFO queue. It is also generated by comparing the
write and read pointers. By the deterministic multi-bit synchronizer, we generate a
synchronized version of write pointer in the clkout domain, wr_ptr_rr.
Based on the write and read pointers, the in_ready and out_valid signals can be
generated. However, if we allow all FIFO entries to be used, the full and empty sta-
tuses of the FIFO queue are true when both write and read pointers are the same.
Unfortunately, it is hard to discriminate between these two conditions, particularly
when the clock frequencies of clkin and clkout are different. Hence, we simply de-
clare the 4-entry FIFO to be full when the next write pointer, next_wr_ptr, and the
read pointer are the same, i.e., one entry is intentionally left unoccupied, as shown
below. We claim that the FIFO is nonempty when the write and read pointers are not
the same.
5 always @ (*)
The synchronization delays the synchronized write and read pointers relative to
their original ones, but this delay does not cause queue overrun and underrun. That
is, the synchronized rd_ptr_rr and wr_ptr_rr cause the late allowance of write access
and read access, respectively. Therefore, the queue full status is relieved later and the
queue nonempty status is reported later as well, which will conservatively stop the
write access and read access in the input and output interfaces, respectively.
The RTL codes of 4-entry nondeterministic FIFO synchronizer with flow con-
trol are written below. Except the flow control mechanism, most parts of the RTL
codes in the FIFO synchronizer with flow control are the same as those in the FIFO
synchronizer without flow control.
6 input in_valid ;
7 input [7 : 0] in_data ;
8 output in_ready ;
9 // Output interface
10 output out_valid ;
11 output [7 : 0] out_data ;
12 input out_ready ;
16 wire in_wr_en ;
18 wire out_rd_en ;
20 reg [7 : 0] out_data ;
21 // *********** * ** ** *
22 // * clkin domain
23 // *********** * ** ** *
25 always @ (*)
34 if ( rst ) begin
35 rd_ptr_r <=0;
36 rd_ptr_rr <=0;
37 end
38 else begin
39 rd_ptr_r <= rd_ptr ;
40 rd_ptr_rr <= rd_ptr_r ;
41 end
42 assign in_wr_en = in_valid & in_ready ;
54 if ( in_wr_en ) begin
55 case ( wr_ptr ) // Gray coded pointer
56 2 ’ b00 : queue [0] <= in_data ;
57 2 ’ b01 : queue [1] <= in_data ;
58 2 ’ b11 : queue [3] <= in_data ;
59 2 ’ b10 : queue [2] <= in_data ;
60 endcase
61 end
62 // ********** ** * ** **
63 // * clkout domain
64 // ********** ** * ** **
66 // Double sync
68 if ( rst ) begin
69 wr_ptr_r <=0;
70 wr_ptr_rr <=0;
71 end
72 else begin
73 wr_ptr_r <= wr_ptr ;
74 wr_ptr_rr <= wr_ptr_r ;
Advanced System Designs 391
75 end
76 assign out_rd_en = out_valid & out_ready ;
77 always @ ( posedge clkout or posedge rst )
88 case ( rd_ptr )
89 2 ’ b00 : out_data = queue [0];
90 2 ’ b01 : out_data = queue [1];
91 2 ’ b11 : out_data = queue [3];
92 2 ’ b10 : out_data = queue [2];
93 endcase
94 endmodule
The timing diagram of the nondeterministic FIFO synchronizer with flow control
is presented in Figure 8.30. The out_ready signal is assumed to be asserted every 2
cycles.
Figure 8.32: A high-performance embedded computer with multiple buses: one for
the instruction memory, one for the data memory and an accelerator, and one for I/O
controllers.
There are specialized processing elements, named DSPs, optimized for the kinds
of operations involved in dealing with digitized signals, such as audio, video or other
streams of data from sensors. Even though, applications still often need a general-
purpose processor to perform other tasks, such as interacting with the user and overall
coordination of system operation. Hence, DSPs are often combined with CPUs in
heterogeneous multiprocessor systems.
394 Principles of Verilog Digital Design
In order to keep track of which instruction to fetch next, the CPU has a special
register called the program counter, PC, in which the address of the next instruction is
kept. In the fetching step, the CPU uses the content of the PC to do a read access from
the instruction memory, and then automatically increments the PC. In the decoding
step, the CPU determines the resources required to perform the operation specified
by the instruction. The decoding step in a low-end CPU is simple. By contrast, in a
complex CPU, decoding may involve such actions as checking for resource conflicts
and availability of data, and waiting until resources are free. In the execution step, the
CPU activates corresponding resources to perform the operation. This involves using
control signals generated in the decoding step to select required operands, enable the
ALU to perform the required operation, and route the results to destination registers.
In a non-pipelined CPU, these steps are performed in order, and when the instruc-
tion is finished, the CPU starts again with the fetching step of the next instruction.
Modern high-performance CPUs, however, can overlap the steps as if the steps were
performed sequentially. Techniques used within CPUs to execute multiple instruc-
tions concurrently include pipelining and superscalar techniques.
The block diagram and interface of the crypto processor are shown in Figure
8.33. Detailed design of the control unit and datapath will be presented later. The
program is stored in the instruction memory, which is implemented using a ROM,
while data are stored in the data memory implemented in a RAM. The processor
requires a 256 × 16-bit instruction memory and a 256 × 8-bit data memory. For the
largest key size of 256 bits, we need 15 keys and each key has 128 bits (16 Bytes).
Hence, the maximum required space of data memory is (15 × 16 Bytes (for keys) +
16 Bytes (for plaintext)) = 256 Bytes. The maximum allowed ROM space for the
program is assumed to be 512 Bytes. The one-cycle start signal enables the crypto
processor. When start is true, the signal, klen[1:0], selects the key size of 128, 192,
or 256 bits. The signal, done, is asserted after the ciphertext has been stored into the
data memory.
The overall AES encryption algorithm for 128-bit key is displayed in Figure 8.34.
Typically, an encryption algorithm consists of several rounds to maintain the security
of a cipher: confusion and diffusion. Round keys have been provided and stored in
the data memory so that the key expansion function will be neglected.
Each round consists of several processing steps:
In AddRoundKey, each byte of the state is combined with a round key using bit-
wise XOR; in SubBytes, a non-linear substitution step where each byte is replaced
with another according to a lookup table, called S-box, as shown in Table 8.1; in
ShiftRows, a transposition step where the last three rows of the state are shifted
cyclically a certain number of steps; in MixColumns, a linear mixing operation
which operates on the columns of the state, combining the four bytes in each col-
umn. Based on the initial encryption key, round keys are derived from it using the
key expansion function. AES requires multiple 128-bit keys for each rounds and the
initial round key addition. We assume that all keys, including initial key and round
keys, have been calculated and stored in data memory.
In the SubBytes step, each byte ai, j in the state array is substituted with S(ai, j )
using the 8-bit S-box, as shown in Figure 8.35.
In the ShiftRows step, the states in each row are cyclically shifted by a certain
offset, as shown in Figure 8.36.
In the MixColumns step, each column is transformed using a fixed matrix (matrix
left-multiplied by column gives new value of column in the state), and it can be
written as
′
a0, j 02 03 01 01 a0, j
a′1, j 01 02 03 01 a1, j
′ =
a2, j 01 01 02 03 a2, j , j = 0, 1, 2, 3 (8.15)
a′3, j 03 01 01 02 a3, j
Advanced System Designs 397
3ODLQWH[W
,QLWLDO
5RXQG
$GGURXQGNH\ ,QLWLDONH\
6XEVWLWXWHE\WHV
6KLIWURZV
5RXQG
0L[FROXPQV
$GGURXQGNH\ 5RXQGNH\
įġįġį
įġįġį
6XEVWLWXWHE\WHV
.H\([SDQVLRQ
6KLIWURZV
5RXQG
0L[FROXPQV
$GGURXQGNH\ 5RXQGNH\
6XEVWLWXWHE\WHV
)LQDO5RXQG
6KLIWURZV
$GGURXQGNH\ 5RXQGNH\
&LSKHUWH[W
Table 8.1: S-box. The column is determined by the least significant nibble, and the
row by the most significant nibble. For example, the value 0xc7 is converted into
0xc6.
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
00 63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76
10 ca 82 c9 7d fa 59 47 f0 ad d4 a2 af 9c a4 72 c0
20 b7 fd 93 26 36 3f f7 cc 34 a5 e5 f1 71 d8 31 15
30 04 c7 23 c3 18 96 05 9a 07 12 80 e2 eb 27 b2 75
40 09 83 2c 1a 1b 6e 5a a0 52 3b d6 b3 29 e3 2f 84
50 53 d1 00 ed 20 fc b1 5b 6a cb be 39 4a 4c 58 cf
60 d0 ef aa fb 43 4d 33 85 45 f9 02 7f 50 3c 9f a8
70 51 a3 40 8f 92 9d 38 f5 bc b6 da 21 10 ff f3 d2
80 cd 0c 13 ec 5f 97 44 17 c4 a7 7e 3d 64 5d 19 73
90 60 81 4f dc 22 2a 90 88 46 ee b8 14 de 5e 0b db
a0 e0 32 3a 0a 49 06 24 5c c2 d3 ac 62 91 95 e4 79
b0 e7 c8 37 6d 8d d5 4e a9 6c 56 f4 ea 65 7a ae 08
c0 ba 78 25 2e 1c a6 b4 c6 e8 dd 74 1f 4b bd 8b 8a
d0 70 3e b5 66 48 03 f6 0e 61 35 57 b9 86 c1 1d 9e
e0 e1 f8 98 11 69 d9 8e 94 9b 1e 87 e9 ce 55 28 df
f0 8c a1 89 0d bf e6 42 68 41 99 2d 0f b0 54 bb 16
where a′i, j denotes the new state, i = 0, 1, 2, 3, and ai, j denotes the old state. The
multiplication by constant matrix can be reduced. For example,
where ⊕ denotes the bitwise XOR operation, and xtime(ai, j ) ≡ {02} · ai, j and
{03} · ai, j = ({01} ⊕ {02}) · ai, j = ai, j ⊕ xtime(ai, j ). The function xtime(·) is the
Advanced System Designs 399
Figure 8.36: States in each row are shifted cyclically to the left. The number of shift
differs in each row.
Figure 8.37: Each column of the states can be viewed as being multiplied with a
fixed matrix or fixed polynomial, where c(x) denotes the polynomial that linearly
combines each columns.
where ≪ denotes the left shift operation. The MixColumns can also be visualized
in Figure 8.37.
In the AddRoundKey step, each byte of the state is added by the corresponding
byte of the subkey (or round) using bitwise XOR, as shown in Figure 8.38.
Figure 8.38: Each byte of the state is combined with a byte of the round subkey using
the XOR operation.
immediate constant, which represents the jump or subroutine call address. For in-
structions with format A, depending on their opcodes, 6-bit Rd and/or 6-bit Rs fields
may be absent, replaced with a data in the data memory addressed by Ra (address
register of data memory), or simply replaced with a 6-bit immediate constant.
The primary field of an instruction is the 4-bit opcode, short for operation code
that specifies the operation to be performed and, by implication, the layout of the re-
maining fields within the code word. All registers in the processor have 8 bits. There
are general-purpose registers, R0-R31, that can store the 128-bit plaintext and 128-
bit key, two more general-purpose registers, R32 and R33, the program counter, PC,
the address register of data memory, Ra, the read-only status register, Ri, including
E-bit (equivalence bit) in bit 0, P-bit (pause bit) in bit 1, and other bits are reserved,
Advanced System Designs 401
and the read-only round register, Rr = 9, 11, or 13 configured by the key size sig-
nal, klen[1:0], for 128-bit, 192-bit, and 256-bit key, respectively. The stack memory
has only one entry that can support non-nested subroutine call. By keeping the field
layout simple and regular, we make the circuit for the instruction decoder simple.
In a complex processor with a large number of various instructions, to speedup the
instruction decoding, instruction sets are usually encoded by distinct prefix for dif-
ferent categories of instructions.
Based on the AES algorithm, we define the instruction set specialized for it, as
shown in Table 8.2, where Rd denotes the destination register, Rs denotes the source
register, Ra is the address register of data memory, (Ra) denotes the content of Ra,
C represents an 8-bit constant for jmp, jne, and jsb instructions or 6-bit constant for
mvc and adc instructions, Ri is the status register, and PC is the program counter.
Notice that Rd and Rs can be either R0-R33, Ra, PC, Ri, Rk, or Rr. The register,
Rk, is the stack register with only one entry used to save and restore PC upon execut-
ing the instructions, jsb and ret, respectively. However, it is not allowed to manually
update read-only registers, PC, Ri, Rk, and Rr. The instructions, ldm and stm, au-
tomatically increment Ra without explicitly needing another increment instruction.
To reduce the space of instruction memory, we support the subroutine call using
the instruction, jsb. However, nested subroutine call is not allowed to reduce the
space of stack memory as well. The subroutine call automatically saves the (PC+1)
(next to the subroutine call) into the stack memory, and restores the stored PC when
the instruction, ret, has been encountered. The jsb must be used in tandem with ret.
The instruction, wat sets the P-bit in Ri and waits for the start signal that activates
the processor. The instruction, dne, asserts the signal, done.
Address mapping of the processor is listed in Table 8.3.
When the CPU is reset, it clears the PC to 0 to fetch the first instruction from
address 0 in the instruction memory, and starts the fetch-decode-execute steps by
the start signal. The PC automatically increments to fetch instructions sequentially
unless a jump instruction, jne or jsb, is encountered.
The control unit in Figure 8.33 has two blocks: state machine and decoder. The
state machine has 4 states as shown in Figure 8.40. After the assertion of start signal,
the processor begins the fetch-decode-execute steps for each instruction until the wat
instruction is encountered that causes the state machine to transit to the ST_WAIT
state. To continue encryption of the next block of plaintext, the start signal should
be asserted again. The decoder generates control signals to the datapath according to
the opcode.
The timing diagrams of instruction set are presented in Figures 8.41–8.43. Notice
that the timing diagrams of mvc, mul2, and mul3 are similar to that of mvr, and the
timing diagrams of jmp and jne are similar to that of jsb. Whereas, unlike jsb, the
instructions, jmp and jne, do not store the PC into the stack register, Rk.
402 Principles of Verilog Digital Design
Table 8.2: Instruction set.
Instructions Opcode Description
ldm Rd, (Ra) 4’b0000 Load data memory addressed by Ra to Rd. Ra is
automatically incremented.
stm (Ra), Rs 4’b0001 Store Rs to data memory addressed by Ra. Ra is
automatically incremented.
mvr Rd, Rs 4’b0010 Move Rs to Rd.
mvc Rd, C 4’b0011 Move 6-bit constant C to Rd.
cmp Rd, Rs 4’b0100 Compare Rd with Rs. If equal, E-bit in Ri is
set.
adc Rd, C 4’b0101 Add Rd with 6-bit constant C and store the
result into Rd.
sbt Rd 4’b0110 Substitute Rd using the S-box and store the result
into Rd.
ml2 Rd, Rs 4’b0111 Multiply Rs by 2 in GF(2) and store the result
into Rd.
ml3 Rd, Rs 4’b1000 Multiply Rs by 3 in GF(2) and store the result
into Rd.
xor Rd, Rs 4’b1001 XOR Rs with Rd and store the result into Rd.
jmp C 4’b1010 Unconditionally jump to address specified by 8-bit
constant C.
jne C 4’b1011 Jump to address specified by 8-bit constant C when
E-bit is false.
jsb C 4’b1100 Jump to subroutine specified by 8-bit constant C.
PC+1 is automatically saved.
ret 4’b1101 Return from subroutine call. PC is automatically
restored.
dne 4’b1110 Program done. Output the done signal.
wat 4’b1111 Wait for the start signal. Set the P-bit in Ri.
6 state_ns = state_cs ;
7 case ( state_cs )
8 ST_WAIT : state_ns = start ? ST_FET : ST_WAIT ;
9 ST_FET : state_ns = ST_DEC ;
10 ST_DEC : state_ns = ST_EXE ;
11 ST_EXE : state_ns = inst_dec15_rr ? ST_WAIT : ST_FET ;
12 default : state_ns = state_cs ;
13 endcase
14 end
Figure 8.41: Timing diagrams of instructions: (a) ldm, (b) stm, (c) mvr, and (d)
cmp.
In the control unit, the RTL codes of the decoder are described below. During
the state, ST_FET, the PC is incremented; during the state, ST_DEC, the opcode is
decoded to generate instruction enable signals, inst_dec[15 : 0], write enable signals,
wr_en[34 : 0], for Ra and R33-R0, the enable signals to latch operands for the ALU,
op1_en and op2_en. The LSB 4 bits of the register Rr is encoded according to key
size and remaining 4 bits are reserved. The signals, inst_dec[15 : 0] and wr_en[34 : 0],
are pipelined.
Advanced System Designs 405
Figure 8.42: Timing diagrams of instructions (continued): (e) adc, (f) sbt, (g) xor,
and (h) jsb.
Figure 8.43: Timing diagrams of instructions (continued): (i) ret, (j) dne, and (k)
wat.
406 Principles of Verilog Digital Design
4 wire inc_PC ;
8 reg [7 : 0] R [0 : 37];
9 integer i ;
20 // Increment PC
21 assign inc_PC =( state_ns == ST_FET );
22 always @ (*) begin
23 inst_dec =16 ’ d0 ;
24 wr_en =35 ’ d0 ;
25 op1_en =1 ’ b0 ;
26 op2_en =1 ’ b0 ;
27 if ( state_ns == ST_DEC ) begin
28 case ( opcode )
29 INST_LDM : begin
30 inst_dec [0]=1 ’ b1 ;
31 for ( i =0; i <=34; i = i +1)
32 if ( Rd == i ) wr_en [ i ]=1 ’ b1 ; // Update Rd
33 end
34 INST_STM : begin inst_dec [1]=1 ’ b1 ; op2_en =1 ’ b1 ; end
35 INST_MVR : begin
36 inst_dec [2]=1 ’ b1 ;
37 for ( i =0; i <=34; i = i +1)
38 if ( Rd == i ) wr_en [ i ]=1 ’ b1 ; // Update Rd
39 op2_en =1 ’ b1 ;
40 end
41 INST_MVC : begin
42 inst_dec [3]=1 ’ b1 ;
43 for ( i =0; i <=34; i = i +1)
44 if ( Rd == i ) wr_en [ i ]=1 ’ b1 ; // Update Rd
45 op2_en =1 ’ b1 ;
46 end
47 INST_CMP : begin
48 inst_dec [4]=1 ’ b1 ;
49 op1_en =1 ’ b1 ; op2_en =1 ’ b1 ;
50 end
51 INST_ADC : begin
52 inst_dec [5]=1 ’ b1 ;
53 for ( i =0; i <=34; i = i +1)
54 if ( Rd == i ) wr_en [ i ]=1 ’ b1 ; // Update Rd
55 op1_en =1 ’ b1 ;
56 op2_en =1 ’ b1 ;
57 end
58 INST_SBT : begin
59 inst_dec [6]=1 ’ b1 ;
60 for ( i =0; i <=34; i = i +1)
61 if ( Rd == i ) wr_en [ i ]=1 ’ b1 ; // Update Rd
62 op1_en =1 ’ b1 ;
63 end
64 INST_ML2 : begin
65 inst_dec [7]=1 ’ b1 ;
66 for ( i =0; i <=34; i = i +1)
408 Principles of Verilog Digital Design
94 if ( rst ) begin
95 wr_en_r <=35 ’ d0 ;
96 wr_en_rr <=35 ’ d0 ;
97 end
98 else begin
99 wr_en_r <= wr_en ;
100 wr_en_rr <= wr_en_r ;
101 end
102 always @ ( posedge clk or posedge rst )
In the datapath, the ROM and RAM interfaces, registers R0-R33, Ra, PC, Ri, and
Rk, and done signal are described below.
5 integer i1 ;
7 assign Rk = R [37];
14 // Rk : R [37] , Rr : R [38]
16 if ( inst_dec0_rr )
17 for ( i1 =0; i1 <=33; i1 = i1 +1)
18 if ( wr_en_rr [ i1 ])
19 R [ i1 ] <= ramq ; // Load mem to R [0]~ R [33]
20 else
21 casex ( inst_dec_r )
22 16 ’ bxxxx_xxxx_xxxx_x1x x , 16 ’ bxxxx_xxxx _ xx x x_ 1 xx x :
23 for ( i1 =0; i1 <=33; i1 = i1 +1) // Write to R [0]~ R [33]
24 if ( wr_en_r [ i1 ]) R [ i1 ] <= alu_out ;
25 16 ’ bxxxx_xxxx_xx1x_xxx x , 16 ’ bxxxx_xxxx_x1xx_xxxx ,
26 16 ’ bxxxx_xxxx_1xxx_xxx x , 16 ’ bxxxx_xxx1_xxxx_xxxx ,
27 16 ’ bxxxx_xx1 x_ x xx x _x x x x :
28 for ( i1 =0; i1 <=33; i1 = i1 +1) // Write to R [0]~ R [33]
29 if ( wr_en_r [ i1 ]) R [ i1 ] <= alu_out ;
30 endcase
31 always @ ( posedge clk or posedge rst )
In the datapath, the multiplexers used to produce the operands of ALU are de-
scribed below.
15 always @ (*)
In the datapath, the ALU is described below, where the S-box is implemented
using a lookup table.
1 / / D a t a p a t h : ALU
2 reg [7 : 0] alu_out ;
3 a l w a y s @( * ) b e g i n
4 a l u _ o u t = op2 ;
5 casex ( inst_de c_r )
6 16 ’ bxxxx_xxxx_xxx1 _x xx x :
7 a l u _ o u t ={7 ’ d0 , ( op1 ==op2 ) } ;
8 16 ’ bxxxx_xxxx_xx1x _x xx x :
9 a l u _ o u t =op1+ op2 ;
10 16 ’ bxxxx_xxxx_x1xx _x xx x :
11 a l u _ o u t =Sbox ( op1 ) ;
12 16 ’ bxxxx_xxxx_1xxx _x xx x :
13 a l u _ o u t =ml2 ( op2 ) ;
14 16 ’ bxxxx_xxx1_xxxx _x xx x :
15 a l u _ o u t =ml3 ( op2 ) ;
16 16 ’ bxxxx_xx1x_xxxx _x xx x :
17 a l u _ o u t =op1 ^ op2 ;
18 endcase
19 end
20 / / S−Box i m p l e m e n t e d u s i n g t a b l e l o o k u p
21 f u n c t i o n [ 7 : 0 ] Sbox ;
22 i n p u t [7 : 0] i n b y t e ;
23 case ( inbyte )
24 8 ’ h00 : Sbox = 6 3 ; 8 ’ h01 : Sbox =7 c ; 8 ’ h02 : Sbox = 7 7 ; 8 ’ h03 : Sbox =7 b ;
25 8 ’ h04 : Sbox= f 2 ; 8 ’ h05 : Sbox =6 b ; 8 ’ h06 : Sbox =6 f ; 8 ’ h07 : Sbox= c5 ;
26 8 ’ h08 : Sbox = 3 0 ; 8 ’ h09 : Sbox = 0 1 ; 8 ’ h0a : Sbox = 6 7 ; 8 ’ h0b : Sbox =2 b ;
412 Principles of Verilog Digital Design
97 end
98 endfunction
99 / / X3 f u n c t i o n by a d d i n g t h e r e s u l t s o f X2 and X1
100 f u n c t i o n [ 7 : 0 ] ml3 ;
101 i n p u t [7 : 0] i n b y t e ;
102 reg [7 : 0] m l 2 _ r e s u l t ;
103 begin
104 m l 2 _ r e s u l t =ml2 ( i n b y t e ) ;
105 ml3= m l 2 _ r e s u l t ^ i n b y t e ;
106 end
107 endfunction
# wait for s t a r t si g n al
Main : wat
# load p l a i n t e x t
mvc Ra , 0
ldm R0 , ( Ra )
ldm R1 , ( Ra )
...
ldm R15 , ( Ra )
# l o a d i n i t i a l key
j s b LoadKey
# xor p l a i n t e x t and i n i t i a l key
j s b AddRoundKey
# r e p e a t 9 r o u n d s f o r 128− b i t key ,
# 11 r o u n d s f o r 192− b i t key , and
# 13 r o u n d s f o r 256− b i t key
mvc R33 , 0
# s u b r o u t i n e : MainRound
# A l o o p t h a t e x e c u t e s Rr t i m e s
MainRound : j sb SubBytes
jsb ShiftRows
j s b MixColumns
j s b LoadKey
j s b AddRoundKey
adc R33 , 1
cmp R33 , Rr
j n e MainRound
LastRo u n d : j sb SubBytes
jsb ShiftRows
j s b LoadKey
j s b AddRoundKey
416 Principles of Verilog Digital Design
# s t o r e c i p h e r t e x t i n t o d a t a memory
mvc Ra , 0
stm ( Ra ) , R0
stm ( Ra ) , R1
...
stm ( Ra ) , R15
dne
jmp Main
The subroutine, LoadKey, shown below loads the key into R16-R31.
The subroutine, AddRoundKey, shown below XORs the plaintext stored in R0-R15
and the key stored in R16-R31.
The subroutine, SubBytes, shown below substitutes the plaintext stored in R0-R15
by S-box.
SubBytes : s b t R0
s b t R1
...
s b t R15
ret
The subroutine, ShiftRows, shown below left shifts each row of the plaintext stored
in R0-R15 using different steps.
Advanced System Designs 417
# s h i f t 2 nd row
ShiftRows : mvr R32 , R4
mvr R4 , R5
mvr R5 , R6
mvr R6 , R7
mvr R7 , R32
# s h i f t 3 r d row
mvr R32 , R8
mvr R8 , R10
mvr R10 , R32
mvr R32 , R9
mvr R9 , R11
mvr R11 , R32
# s h i f t 4 t h row
mvr R32 , R15
mvr R15 , R14
mvr R14 , R13
mvr R13 , R12
mvr R12 , R32
ret
The subroutine, MixColumns, shown below mixes the elements in each column of
the plaintext stored in R0-R15 using a constant matrix with values 0x01, 0x02, and
0x03. When deriving the new states of the first column (R0, R4, R8, and R12), the
old states of the first column must remain their values. Therefore, the new states of
the first column will be temporarily stored in R16, R20, R24, and R28; the new
states of the second column will be temporarily stored in R17, R21, R25, and R29,
and so on. Finally, after all new states in R16-R31 are available, they will be copied
into R0-R15.
# mix 1 s t column
MixColumns : ml2 R16 , R0 # 1 s t e l e m e n t
ml3 R20 , R4
xor R16 , R20
xor R16 , R8
xor R16 , R12
ml2 R20 , R4 # 2 nd e l e m e n t
ml3 R24 , R8
xor R20 , R0
xor R20 , R24
xor R20 , R12
ml2 R24 , R8 # 3 r d e l e m e n t
418 Principles of Verilog Digital Design
Figure 8.50: Pattern used to detect the connectivity of a pixel (marked as black).
Notably, NW, W, N, and NE represent north-west, west, north, and north-east, re-
spectively.
determined because, using this method, it is known with complete confidence that
the remaining pixels can be used to identify all previously connected pixels.
The SSRAM will store the given identification numbers of all pixels. To detect
the object identification number of the current pixel, the identification numbers of its
neighboring 4 pixels must also be available. Therefore, a FIFO buffer implemented
by (16 + 1) × 8 registers is required to fulfill the algorithm. The FIFO can store the
object identification numbers of all pixels prior to the current pixel up to that of the
north-west one in the previous row. The FIFO advances every clock cycle, so the
following 4 identification numbers in the FIFO are needed: the first, the second, the
third, and the last, which correspond to the NW, W, N, and NE pixels, respectively.
The method only needs to read the original pixels once, reducing the detection time
to the bare minimum.
For these examples, we assume the maximum number of temporarily identifiable
objects to be 254, i.e., from 8’d1 to 8’d254. Notably, the number of temporarily
identifiable objects may be different from the final identified objects because some
can be merged owing to “the late detection of connectivity”, a concept which will be
introduced later. The object identification number, 8’d255, is reserved to represent
the background. The object identification number, 8’d0, is not used for debugging
purpose.
If the value of the pixel being identified is 0, then its identification number will
be given 8’d255, but if it is 1, there are several different possibilities, as shown in
422 Principles of Verilog Digital Design
Figure 8.51: Three possible cases detected using the pattern in Figure 8.50: (a) no
previously identified objects, (b) one previously identified object, and (c) two previ-
ously identified objects.
Figure 8.51. For case (a) (in Figure 8.51(a)), the pixel is identified as “unconnected
to any previous pixels” because its surrounding identification numbers are all given
by the background one, 8’d255. Hence, it is given a new, temporary, identification
number, 8’d1, for the first new identification number, 8’d2 for the second, and so
on. For case (b), the pixel is identified as “connected to one previous pixel with
identification number 8’d1”. Therefore, the identification number of the previously
identified object is assigned to the pixel. For case (c), the pixel is identified as “con-
nected to two previous, presumably unconnected, pixels with identification numbers
8’d1 and 8’d2.” Hence, the smallest identification number, 8’d1, of the previously
identified object is assumed to be assigned to the pixel. Notably, it can be proven
that the maximum number of previous pixels that had previously been considered to
be unconnected but are now connected due to their relation to the new pixel under
detection is two. This is referred to as “the late detection of connectivity.” These two
previously identified objects with different object numbers need to be merged later.
Using this detection method may result in the late detection of connectivity.
Therefore, a label table (label_tab) needs to be used to record the identification num-
bers, which will need to be merged, owing to newly detected connections. The label
table has 254 entries with addresses ranging from 1 to 254. Its contents are initialized
based upon their corresponding addresses, as shown in Figure 8.52.
The obj_id_cnt counts the temporarily identified objects and points to (or repre-
sents) the new, temporary, identification number. If the pixel value under identifi-
cation is 0, then it is given identification number 8’d255. Hence, the original pixel
value is replaced by 8’d255 and written into the FIFO and the same SSRAM address.
As such, the global picture of the connectivity is saved in the SSRAM, whereas the
FIFO stores the local connectivity. If the pixel under identification is 1, for case (a) (in
Figure 8.51(a)), the pixel under identification is given a new temporary identification
number, obj_id_cnt, and written into the FIFO and the same SSRAM address. Hence,
the original pixel value is replaced by obj_id_cnt. Then, obj_id_cnt increases by 1.
For case (b) (in Figure 8.51(b)), the pixel is identified as “connected to one previous
pixel with identification number 8’d1”. Therefore, the identification number, 8’d1, of
the previously identified object is assigned to the pixel and written into the FIFO and
the same SSRAM address. Thus, the obj_id_cnt keeps the same value. For case (c)
(in Figure 8.51(c)), the smallest identification number, 8’d1, of the previously iden-
Advanced System Designs 423
Figure 8.53: Recording the information in label_tab for merging two connected ob-
jects with different identification numbers, 8’d1 and 8’d2, later using label_tab.
tified object is assigned to the pixel and written into the FIFO and the same SSRAM
address. Thus, the obj_id_cnt still keeps the same value.
To merge the connected objects with different identification numbers for case (c)
(in Figure 8.51(c)), the content of label_tab addressed by the larger identification
number, 8’d2, is replaced by the smaller one, 8’d1, to indicate that the identification
number, 8’d2, will be merged with the identification number of 8’d1, as shown in
Figure 8.53.
424 Principles of Verilog Digital Design
Figure 8.54: After determining the object identification number for each pixel: (a)
contents of SSRAM and (b) label_tab. Initial contents in label_tab are overwritten
by those object numbers that need to be merged.
Till now, the label_tab has only included objects connected pairwise. That is,
the pattern which has been used to detect the connectivity of pixels only guaran-
tees the connectivity of adjacent pixels, but this does not guarantee that their object
identification numbers are the same. For example, the contents of SSRAM and the la-
bel_tab of the binary image in Figure 8.46 become the ones shown in Figure 8.54(a)
and 8.54(b), respectively. The obj_id_cnt is 12, which means that there are 11 tem-
porarily identified objects.
To merge all connected objects using the same object number, the label_tab shown
in Figure 8.55 must be thoroughly scanned so that a unique identification number can
be used for all connecting objects. To accomplish this, each entry in the label_tab is
scanned from that pointed to by the (obj_id_cnt−1) until the minimum identification
number, i.e., 1, for all connecting objects has reached. When scanning a specific
entry, its content is looked up and compared to its address. If they are not the same,
its content will be used as the next address. This process continues till the entry that
its content and address have the same value. Such an entry represents the unique (and
minimum) identification number that the specific entry must use.
For example, the temporarily identified object number, 8’d11, must be finally
merged with the object number, 8’d8, shown in Figure 8.55(a). This can be achieved
by looking up the content of entry addressed by 8’d11, which indicates that 8’d11
should be merged with object number 8’d9. Then, the content of the entry addressed
by 8’d9 indicates that 8’d9 should be merged with object number 8’d8. Furthermore,
Advanced System Designs 425
Figure 8.55: Merging of (a) temporarily identified object number, 8’d11 and (b) all
temporarily identified objects. The goal is to derive a unified object number for all
connected pixels.
the content of the entry with the address 8’d8 indicates that 8’d8 should be merged
with object number 8’d8. At this time, the scanning for object number 8’d11 stops
when the address and its content become identical. Finally, the entry for the tem-
porarily identified object number, 8’d11, is written into 8’d8, the minimum identifi-
cation number. This process continues until all temporarily identified object numbers
(from obj_id_cnt−1 = 11 to 1) have been merged, as shown in Figure 8.55(b). Even-
tually, the label_tab specifies that object numbers (from 11 to 1), 11, 10, and 9 should
be merged with 8, 8 (the minimum object number itself) should be merged with 8, 7
and 6 should be merged with 4, 5 should be merged with 3, 4 should be merged with
4, and so on.
Finally, the identification numbers stored in SSRAM are remapped to those spec-
ified in the label_tab and the background identification number, 8’d255, is remapped
to 8’d0.
Figure 8.57: Architecture of the datapath: FIFO buffer and SSRAM interface.
REMBLGBVHFBPLQ
GHF
ODEHOBWDE RGH
ODEHOBWDE>@
REMBLGBPLQ ODEHOBWDE>@
5
ODEHOBWDE>@ VFKBREMBLG
REMBLGBFQW
įġġġįġġġį
ODEHOBWDE>@
G VKDSHBQZ
E
),)2>@ REMBLGBFQW
PJBREMBLG
5
VKDSHBQ
),)2>@ REMBLGBPLQ
VKDSHBQH 6257
G REMBLGBVHFBPLQ
),)2>@ PJBREMBLG
ODEHOBWDE>VFKBREMBLG@ VFKBREMBLG
5
G VKDSHBZ
),)2>'(37+@
The timing diagram that governs the operations of CLE is presented in Figure 8.59
based on the appropriate algorithm, state machine, and architecture. In the timing
diagram, we assume that there are only two temporarily identified objects: 2 and 1.
Also, the label_tab indicates that object 2 needs to be merged with object 1.
The RTL codes of the state machine are illustrated below.
9 state_ns = state_cs ;
10 case ( state_cs )
11 ST_IDLE : state_ns = ST_RD_PIX ;
12 ST_RD_PIX : state_ns = ST_WR_PIX ;
13 ST_WR_PIX : state_ns = pix_end ? ST_MG_INI : ST_RD_PIX ;
14 ST_MG_INI : state_ns = ST_MG_CHK ;
15 ST_MG_CHK : state_ns = id_end ? ST_RD_ID : ST_SCH_INI ;
16 ST_SCH_INI : state_ns = ST_MG_SCH ;
17 ST_MG_SCH : state_ns = sch_end ? ST_MG_CHK : ST_MG_SCH ;
18 ST_RD_ID : state_ns = ST_WR_ID ;
19 ST_WR_ID : state_ns = pix_end ? ST_DONE : ST_RD_ID ;
20 ST_DONE : state_ns = ST_DONE ;
21 endcase
22 end
A pixel counter register, pix_cnt, is used to scan all pixels or remap all objects
according to the label_tab. The pix_cnt is also used as the SSRAM address.
The “scanning all pixels” operation reads each original pixel value, determines its
object identification number, and writes the object identification number into both
the SSRAM and FIFO. The temporary identification number is determined using
the shape presented in Figure 8.50. During the state ST_WR_PIX, if the pixel value
under identification is 0, then it is given the identification number 8’d255; if the pixel
under identification is 1, there can be three possible results, which are presented in
Figure 8.51.
During the state ST_WR_ID, the temporary identification number 8’d255, i.e.,
the reserved background object identification number BG_OBJ_ID, is remapped to
8’d0; otherwise, it is remapped according to the label_tab. Therefore, the SSRAM
is written at states ST_WR_PIX and ST_WR_ID. The pix_cnt is also used for the
SSRAM address.
1 // SRAM interface
2 reg sram_wen ;
3 reg [7 : 0] sram_d , obj_id_cnt ;
4 wire [9 : 0] sram_a ;
12 always @ (*)
north-western and western pixels are absent and assumed to be BG_OBJ_ID, and
the identification numbers for the northern and north-eastern pixels are respectively
FIFO[1] and FIFO[2], while for those pixels under detection in the last column, the
identification number of the north-eastern pixel is also absent and assumed to be
BG_OBJ_ID.
The register, obj_id_cnt, counts the number of used identifications. If the pixel
value is 1, when obj_id_min is BG_OBJ_ID, i.e., 8’d255, all four identification num-
bers for the pixels around the target pixel are BG_OBJ_ID. Hence, a new identifi-
cation number is used and obj_id_cnt increases by 1. In addition to being written
into the SSRAM, the identification number is also written into the FIFO to store
the local connectivity of remaining pixels. The label_tab stores the locally merged
identification numbers. When obj_id_min and obj_id_sec_min are different and do
not equal BG_OBJ_ID, the larger identification number should be merged with the
smaller one by writing obj_id_min into address obj_id_sec_min of label_tab, which
records this information.
7 integer i1 , i2 ;
17 shape_w };
18 assign { sort_result [0] , sort_result [1] , sort_result [2] ,
31 if ( reset )
32 // Object id 0 is not used , 255 is reserved
33 obj_id_cnt <=1;
34 else if ( state_cs == ST_WR_PIX && sram_q [0]==1 ’ b1 &&
35 obj_id_min == BG_OBJ_ID )
36 obj_id_cnt <= obj_id_cnt +1;
37 always @ ( posedge clk or posedge reset )
38 if ( reset )
39 for ( i1 =0; i1 <= DEPTH ; i1 = i1 +1)
40 FIFO [ i1 ] <= BG_OBJ_ID ;
41 else if ( state_cs == ST_WR_PIX ) begin
42 for ( i1 =0; i1 <= DEPTH -1; i1 = i1 +1)
43 FIFO [ i1 ] <= FIFO [ i1 +1];
44 FIFO [ DEPTH ] <= sram_d ;
45 end
46 always @ ( posedge clk or posedge reset )
47 if ( reset )
48 for ( i2 =1; i2 <=254; i2 = i2 +1)
49 label_tab [ i2 ] <= i2 ;
50 else if ( state_cs == ST_WR_PIX && sram_q [0]==1 ’ b1 &&
51 obj_id_min != BG_OBJ_ID && obj_id_sec_m in != BG_OBJ_ID
52 && obj_id_sec_mi n != obj_id_min )
53 label_tab [ obj_id_sec_m in ] <= obj_id_min ;
54 else if ( state_cs == ST_MG_SCH && state_ns == ST_MG_CHK )
55 label_tab [ mg_obj_id ] <= sch_obj_id ;
56 // Sort the obj ids
The register, mg_obj_id, stores the current identification number that is being
searched for the minimum identification number that should be used for merging.
At state ST_MG_INI, the identification numbers in the label_tab are checked for
merging using the mg_obj_id initialized based upon the number of identification
numbers, obj_id_cnt−1. The operation progresses in reverse order from the last to
the first identification number. During each transition to ST_MG_CHK, mg_obj_id
decreases by 1 so that the next smaller identification number is checked.
When mg_obj_id reaches the smallest identification number, the merging stops.
For each current identification number being searched to find the minimum
identification number that can be used for merging, the register sch_obj_id loads
the content of the label_tab which it points to, i.e., label_tab[sch_obj_id]. When
sch_obj_id and label_tab[sch_obj_id] are equal, it indicates that the minimum iden-
tification number used for merging has been found and the search stops. Therefore,
the merging process of label_tab actually implements two nested for loops.
1 // Merging of label_tab
2 reg [7 : 0] mg_obj_id , sch_obj_id ;
3 wire id_end , sch_end ;
4 reg finish ;
Finally, during state ST_WR_ID, the object identification numbers stored in the
SSRAM are remapped based upon the rules specified by the label_tab except that the
identification number, BG_OBJ_ID, is remapped to 8’d0. Also, the control signal,
“finish”, asserts when the state machine enters the state ST_DONE.
• David Money Harris and Sarah L. Harris, Digital design and computer ar-
chitecture, 2nd Ed., Morgan Kaufmann, 2013.
• John F. Wakerly, Digital design: principles and practices, 5th Ed., Prentice
Hall, 2018.
• Mark Gordon Arnold, Verilog digital computer design: algorithms into
hardware, Prentice Hall, 1999.
• Michael D. Ciletti, Advanced digital design with the Verilog HDL, 2nd Ed.,
Prentice Hall, 2010.
• Peter J. Ashenden, Digital design: an embedded systems approach using
Verilog, Morgan Kaufmann Publishers, 2007.
• Ronald W. Mehler, Digital integrated circuit design using Verilog and Sys-
temverilog, Elsevier, 2014.
• Stephen Brown and Zvonko Vranesic, Fundamentals of digital logic with
Verilog design, McGraw-Hill, 2002.
• William J. Dally and R. Curtis Harting, Digital design: a systems approach,
Cambridge University Press, 2012.
Advanced System Designs 435
PROBLEMS
1. a. If ∆V (0) = e−1VDD , find the time for the circuit to converge to ∆V (t) = VDD .
b. What about ∆V (0) = e−2VDD ?
c. What about ∆V (0) = 0.25VDD ?
2. For the stable states, ∆V (t) = +3VDD or ∆V (t) = −3VDD . What is the smallest
value of ∆V (0) that converges in less than 5τ ?
3. What is the frequency of timing error, fTE , for tS = tH = 0.1 ns, tC = 5 ns, and
fI = 1 MHz?
4. Please verify the functions, bin_to_gray and gray_to_bin, in Section 8.3.5.
5. Calculate the MTBF for a system with fI = 100 kHz, fC = 1 GHz, tS = tH = 50
ps, τ = 100 ps, and tCQ = 80 ps that uses three back-to-back flip-flops for a
synchronizer.
6. We want to synchronize a binary sequence from clkin domain to clkout domain.
Please identify the potential problems in the following deterministic multi-bit
synchronizer and fix them.
5 case ( cnt )
6 3 ’ b000 : cnt_gray =3 ’ b000 ;
7 3 ’ b001 : cnt_gray =3 ’ b001 ;
8 3 ’ b010 : cnt_gray =3 ’ b011 ;
9 3 ’ b011 : cnt_gray =3 ’ b010 ;
10 3 ’ b100 : cnt_gray =3 ’ b110 ;
11 3 ’ b101 : cnt_gray =3 ’ b111 ;
12 3 ’ b110 : cnt_gray =3 ’ b101 ;
13 3 ’ b111 : cnt_gray =3 ’ b100 ;
14 endcase
15 always @ ( posedge clkout ) begin
9. Assume that tS = 50 ps, tH = 20 ps, τ = 40 ps, tCQ = 20 ps, fI = 200 MHz, and
fC = 2 GHz. Calculate the MTBF for the following synchronizers.
a. Waiting one cycle for synchronization.
b. Waiting five cycles for synchronization.
c. Using five back-to-back flip-flops for synchronization.
10. When using a two-bit simple synchronizer to transfer a two-bit Gray-coded sig-
nal across clock domains, what is the minimum amount of time that needs to
elapse between bit toggles? That is, what is the maximum clock rate at which
the Gray codes can advance?
11. Consider a FIFO synchronizer that uses simple synchronizers composed of three
back-to-back flip-flops to synchronize the read and write pointers. Assuming
the input and output clocks are running at approximately the same frequency
(±10%), what is the minimum FIFO depth that will support data transport at
full rate?
12. Please design a Gray code sequence with 5 elements.
13. Please verify the 8-bit crypto processor by finishing remaining RTL codes and
assembly codes, and translating the assembly codes into binary machine codes
that can be executed directly by the processor. The binary machine is converted
from the assembly code using the instruction format, opcode, and address map-
ping of registers.
14. The status register, Ri, is updated upon the execution of cmp instruction. Hence,
if the cmp instruction executes in a subroutine, the contents of Ri produced
in the main program become invalid. To restore Ri (or other registers) after a
subroutine, it (or they) must be automatically saved into the stack memory when
jsb instruction executes. Please expand the stack register to 2 entries that can
store and load Ri as well as PC to/from the stack memory when encountering
jsb and ret instructions, respectively.
15. Please redesign the 8-bit crypto processor, including the ISA, RTL codes, and
assembly codes, such that the AES decryption algorithm can be implemented.
16. Please redesign the 8-bit crypto processor, including the assembly codes, and
instruction and data memories if needed, such that the AES operation mode of
cipher block chaining (CBC) for 2 blocks of plaintext can be implemented. The 2
blocks of plaintext are put in the first two 16 Bytes of the data memory. Then, the
16-Byte initial vector follows. The maximum required space of 15 × 16 = 240
Bytes allocated for the keys follows the initial vector.
17. Please redesign the 8-bit crypto processor such that both the AES encryption
and decryption algorithms can be implemented. There are two 256 × 8 ROMs
for encryption and decryption assembly, respectively. There are also two 256 × 8
RAMs for plaintext (to be encrypted) and ciphertext (to be encrypted), respec-
tively. The opcodes of encryption and decryption are shared such that the 8-bit
crypto processor needs not to be extended to 16 bits. To switch between encryp-
tion and decryption, an input signal can select the mode, and required ROM and
RAM.
Advanced System Designs 437
18. Please redesign and extend the crypto processor to 32 bits such that an instruc-
tion can process 4 states of plaintext at a time. What’s the throughput of your
design? How much performance has been improved?
19. The instruction stages of the original 8-bit crypto processor are not pipelined.
Please redesign a pipelined 8-bit crypto processor such that the fetch-decode-
execute steps can be performed concurrently. What’s the throughput of your
design? How much performance has been improved?
20. Please redesign the 8-bit crypto processor, including the ISA, RTL codes, and
assembly codes, such that the data encryption standard (DES) encryption algo-
rithm can be implemented.
21. Please redesign the CLE such that the temporary identification number of back-
ground can be given by 8’d0.
22. Please redesign the CLE such that a smaller FIFO with 3 × 8 registers that store
the object IDs of NW, N, and W pixels is needed. To this, the memory must
be accessed twice, once for reading the pixel value under detection and once
again for the object ID of the NE pixel. Please compare the new design with the
original CLE using a FIFO of (32 + 1) × 8 registers.
23. Rewrite the Verilog codes of the CLE using the named block for all for loops.
24. Please design an edge decoder that can compute the derivatives of the inten-
sity signal in the x and y directions for detecting abrupt changes in intensity,
particularly at the boundaries of objects. Subsequent analysis should be able to
determine what the objects are. We assume a monochrome image of 480×640
pixels, each of 8 bits. The pixels of an image, from left to right, top to bottom,
are to be stored in a 76800×32 SRAM. Four pixels are in an SRAM address.
Pixel values are interpreted as unsigned integers ranging from 0 (black) to 255
(white). We can use the Sobel edge detector, which approximates the derivative
in each direction for each pixel by a process called convolution. This involves
multiplying 9 adjacent pixels by 9 coefficients, which are often represented by
two 3×3 convolutional masks, Gx and Gy (shown in Figure 8.60), and then sum-
ming the 9 products to form two partial derivatives for the derivative image, Dx
and Dy .
– – –
However, since we are just interested in finding the maxima and minima in the
magnitude, a sufficient approximation is
|D| = |Dx | + |Dy |. (8.19)
Note that the pixels around the edge of the image do not have a complete set of
neighboring pixels, so we need to treat them separately. The simplest approach
is to set the derivative value, |D|, of the edge pixels to 0. A pseudo code is
presented below. Let P[r][c] denote pixel value in the original image, and D[r][c]
denote pixels in the derivative image, where the row index, r, ranges from 0
to 479, and the column index, c, ranges from 0 to 639. Also, let Gx [i][ j] and
Gy [i][ j] denote the convolutional masks for the x and y axes, respectively, where
i, j = −1, 0, +1.
10 // Other pixels
c. Please redesign the edge detector with a constraint on the number of times
memory can be accessed, such that each memory is only allowed to be read
once.
25. Design and verify the video edge detection based on the Sobel accelerator.
26. Design and verify the video edge detection based on the Sobel accelerator using
a model in which the pixels can only be read once.
9 I/O Interface
The way of transferring information between internal storage and external I/O
devices is called I/O interface. The I/O interface interacts with physical world
through I/O devices, such as the human interfaces of display and keypad.
A bus is a communication system that can transfer data within or between com-
puters. Some buses are used for connecting separate chips on a circuit board. Others
connect separate boards in a system. Bus specifications and protocols vary, depend-
ing on the requirement of their intended use. Off-chip buses use tristate drivers for
signals that have multiple data sources. For example, the PCI bus is used to connect
add-on cards to a computer. On-chip buses are used to connect sub-modules within
an IC. They have separate input and output signals that allow to use multiplexers and
demultiplexers to connect components. Examples include the AMBA buses specified
by ARM, the CoreConnect buses specified by IBM, and the Wishbone bus specified
by the OpenCores organization.
To easily integrate components designed by different teams, a number of com-
mon bus protocols have been specified. Components connected using buses should
conform to the same bus protocol. Otherwise, some interface glue logics may be
required. The specification of a bus protocol includes a signal list for connecting
compliant components, and a description of the operation sequences and signal tim-
ings to implement various bus operations. The address width of a bus determines the
memory space it can address, and the data width of a bus results in different speeds
of transferring data.
This chapter introduces the I/O controller for the keypad. Additionally, an I/O
processor is used to program or control the I/O controller. The multiplexed, tristate,
and open-drain buses, and several serial transmission protocols are presented. The
main difference between the programs in embedded systems and general purpose
computers is that the embedded software must be able to react promptly when an
event occurs. Therefore, we introduce the I/O interfaces of embedded software, such
as polling, interrupt, and timer, for embedded systems. Finally, an accelerator of FFT
processor is illustrated from its algorithm to RTL design.
Figure 9.1: (a) Keypad switches arranged in a scanned matrix. (b) A keypad ma-
trix with an output register, row register, for driving row lines and an input register,
column register, for latching column lines.
i.e., row[3 : 0] = 4’b1101, if either digit 4, 5, or 6 has been pushed, its corresponding
column signals will be pulled low and detected. When all of the key switches are
open, all column lines are pulled high by the resistors.
The row signal is controlled by a processor through its I/O interface, as shown in
Figure 9.2. The data buses, din[7 : 0] and dout[7 : 0], have 8 bits. When cen is true,
a write (wen = 1’b1) or read (wen = 1’b0) I/O command has been issued. There are
3 I/O ports provided by the controller, and they are decoded by the address signal,
addr[1 : 0]. The port numbers 0, 1, and 2 are used to access the status register, row
register, and column register, respectively. Bit 0 of the status register indicates a valid
column signal, col[2 : 0], that has been sampled. Bits 7 to 1 of the status register are
reserved. Bits 3 to 0 of the row register drive the row signal, row[3 : 0]. Bits 7 to 4
of the row register are reserved. Bits 2 to 0 of the column register stores the valid
column signal, col[2 : 0], and bits 7 to 3 are reserved.
However, a user can push the button or switch at a random time. Even worse, as
the switch closes, the contact may bounce back and forth several times. This may
cause the circuit to open and close several times before finally staying in the stable
and closed position. Hence, the column signals should be synchronized to eliminate
the timing failure and debounced to generate a stable column signal. The minimum
I/O Interface 443
scanning period for each row is assumed to be 2 ms, and the debouncing interval is
1 ms.
The steps of the processor used to decide the pushed keys are: 1) configure the
row register; 2) wait for a valid and debounced column signal by reading the status
register. Bit 0 of the status register automatically clears after being read; 3) read the
stable column register. Debounced column signal can be obtained by comparing two
column signals separated by 1 ms. If they are the same, a debounced column signal
has been derived; otherwise, additional 1 ms needs to wait for the contact to settle
down.
We develop a Verilog model for the keypad controller that can generate a stable
signal to indicate the status of 12 keys, as shown below. The frequency of system
clock is 50 MHz.
1 // Module of I / O controller
2 module io_ctrl ( row , dout , col , cen , wen , addr , din ,
3 clk_50mhz , rst );
4 output [3 : 0] row ;
5 output [7 : 0] dout ;
6 input [2 : 0] col ;
7 input [1 : 0] addr ;
8 input [7 : 0] din ;
10 reg [3 : 0] row_reg ;
12 wire time_1ms ;
14 reg [7 : 0] dout ;
16 reg col_valid ;
20 state_ns = state_cs ;
21 case ( state_cs )
22 ST_WAIT : state_ns =( cen && wen && addr ==1)?
23 ST_DET : ST_WAIT ;
24 ST_DET : state_ns =( time_1ms && col_old == col_rr )?
25 ST_DEBD : ST_DET ;
26 ST_DEBD : state_ns =( cen &&! wen && addr ==0)?
27 ST_WAIT : ST_DEBD ;
28 endcase
29 end
55 // Double synch
57 begin
The primary field of an instruction is the 4-bit opcode, short for operation code
that specifies the operation to be performed and, by implication, the layout of the
remaining fields within the code word. All registers in the processor have 8 bits.
There are general-purpose registers, R0-R3, the program counter, PC, and the read-
only status register, Ri, for the comparison instruction, including E-bit (equivalence
bit) in bit 0, P-bit (positive bit) in bit 1, N-bit (negative bit) in bit 2, and other bits
are reserved. The stack memory has only one entry, i.e., stack register Rk, that can
support non-nested subroutine call. Notice that Rd and Rs can be either R0-R3, PC,
Ri, or Rk. However, it is not allowed to manually update read-only registers, PC, Ri,
and Rk.
Address mapping of the processor is listed in Table 9.2. The register, Rk, is the
stack register with only one entry used to save and restore PC upon executing the
instructions, jsb and ret, respectively.
Upon reset, the processor starts the fetch-decode-execute steps from the PC= 0.
The PC automatically increments to fetch instructions sequentially unless a jump
instruction, jmp, jne, jp, or jsb, is encountered.
The timing diagrams of instruction set are presented in Figure 9.3. Notice that the
timing diagrams of jmp, jne, and jp are similar to that of jsb. Whereas jmp, jne,
and jp do not store the PC into the stack register, Rk.
The detailed datapath of the processor is presented in Figure 9.4.
Figure 9.3: Timing diagrams of instructions: (a) mvc, (b) cmpc, (c) jsb, (d) ret, (e)
out, and (f) inp.
6 state_ns = state_cs ;
7 case ( state_cs )
8 ST_FET : state_ns = ST_DEC ;
9 ST_DEC : state_ns = ST_EXE ;
10 ST_EXE : state_ns = ST_FET ;
11 endcase
12 end
In the control unit, the RTL codes of the decoder are described below. During the
state, ST_FET, the PC is incremented; during the state, ST_DEC, the opcode is
decoded to generate instruction enable signals, inst_dec[8 : 0], write enable signals,
wr_en[3 : 0], for R3-R0, the enable signals to latch operands for the ALU, op1_en
and op2_en. The signals, inst_dec[8 : 0] and wr_en[3 : 0], are pipelined.
4 wire inc_PC ;
8 integer i ;
18 inst_dec =9 ’ d0 ;
19 wr_en =4 ’ d0 ;
20 op1_en =1 ’ b0 ;
21 op2_en =1 ’ b0 ;
22 if ( state_ns == ST_DEC ) begin
23 case ( opcode )
I/O Interface 449
24 INST_MVC : begin
25 inst_dec [0]=1 ’ b1 ;
26 for ( i =0; i <=3; i = i +1)
27 if ( Rd == i ) wr_en [ i ]=1 ’ b1 ; // Update Rd
28 op2_en =1 ’ b1 ;
29 end
30 INST_CMPC : begin
31 inst_dec [1]=1 ’ b1 ;
32 op1_en =1 ’ b1 ; op2_en =1 ’ b1 ;
33 end
34 INST_JMP : begin inst_dec [2]=1 ’ b1 ; op2_en =1 ’ b1 ; end
35 INST_JNE : begin inst_dec [3]=1 ’ b1 ; op2_en =1 ’ b1 ; end
36 INST_JP : begin inst_dec [4]=1 ’ b1 ; op2_en =1 ’ b1 ; end
37 INST_JSB : begin inst_dec [5]=1 ’ b1 ; op2_en =1 ’ b1 ; end
38 INST_RET : inst_dec [6]=1 ’ b1 ;
39 INST_OUT : begin inst_dec [7]=1 ’ b1 ; op2_en =1 ’ b1 ; end
40 INST_INP : begin
41 inst_dec [8]=1 ’ b1 ;
42 for ( i =0; i <=3; i = i +1)
43 if ( Rd == i ) wr_en [ i ]=1 ’ b1 ; // Update Rd
44 end
45 endcase
46 end
47 end
50 if ( rst ) begin
51 wr_en_r <=4 ’ d0 ;
52 wr_en_rr <=4 ’ d0 ;
53 end
54 else begin
55 wr_en_r <= wr_en ;
56 wr_en_rr <= wr_en_r ;
57 end
58 always @ ( posedge clk or posedge rst )
59 if ( rst ) begin
60 inst_dec_r <=9 ’ d0 ;
61 inst_dec8_rr <=1 ’ b0 ;
62 end
63 else begin
64 inst_dec_r <= inst_dec ;
65 // For inp to latch data from output port
66 inst_dec8_rr <= inst_dec_r [8];
67 end
In the datapath, the ROM and I/O interfaces, registers R0-R3, PC, Ri, and Rk are
described below.
450 Principles of Verilog Digital Design
4 reg [5 : 0] addr ;
6 wire [7 : 0] din ;
7 integer i1 ;
9 assign roma = PC ;
12 if ( rst )
13 wen <=1 ’ b0 ;
14 else if ( inst_dec [7]) wen <=1 ’ b1 ; // out
15 else wen <=1 ’ b0 ;
16 always @ ( posedge clk or posedge rst )
17 if ( rst )
18 cen <=1 ’ b0 ;
19 else if ( inst_dec [7]| inst_dec [8]) cen <=1 ’ b1 ; // out & inp
20 else cen <=1 ’ b0 ;
21 always @ ( posedge clk )
26 if ( rst )
27 for ( i1 =4; i1 <=6; i1 = i1 +1) R [ i1 ] <=0;
28 else if ( inc_PC ) R [4] <= R [4]+1 ’ b1 ;
29 else if ( inst_dec8_rr )
30 for ( i1 =0; i1 <=3; i1 = i1 +1)
31 if ( wr_en_rr [ i1 ]) R [ i1 ] <= dout ;
32 else
33 casex ( inst_dec_r )
34 9 ’ bx_xxxx_xxx1 :
35 for ( i1 =0; i1 <=3; i1 = i1 +1)
36 if ( wr_en_r [ i1 ]) R [ i1 ] <= alu_out ;
37 9 ’ bx_xxxx_xx1x : R [5][2 : 0] <= alu_out [2 : 0];
38 9 ’ bx_xxxx_x1xx : R [4] <= alu_out ;
39 9 ’ bx_xxxx_1xxx : R [4] <=~ R [5][0]? alu_out : R [4];
40 9 ’ bx_xxx1_xxxx : R [4] <= R [5][1]? alu_out : R [4];
41 9 ’ bx_xx1x_xxxx : begin
42 R [4] <= alu_out ; R [6] <= R [4];
43 end
44 9 ’ bx_x1xx_xxxx : R [4] <= R [6];
45 9 ’ bx_1xxx_xxxx : ; // No register to update
46 endcase
I/O Interface 451
In the datapath, the multiplexers used to produce the operands of ALU are de-
scribed below.
13 always @ (*)
1 // ALU
2 reg [7 : 0] alu_out ;
3 always @ (*) begin
4 alu_out = op2 ;
5 casex ( inst_dec_r )
6 9 ’ bx_xxxx_xx1x : alu_out ={5 ’ d0 ,( op1 < op2 ) ,( op1 > op2 ) ,
7 ( op1 == op2 )};
8 endcase
9 end
to define a constant. The registers, R0, R1, and R2, store the row register, status
register, and column register of the I/O controller, respectively. Rows are scanned by
configuring the row register in the I/O controller. After a valid and debounced column
register has acquired, R2 can be used to decide the pushed key in corresponding row.
In the assembly codes, comments start with the “#” character and extend to the
end of the line. Notice that similar and repeated instructions are omitted to save the
space and represented by “...”. We can place a label followed by a colon before an
instruction. The label is the designation for the address of the instruction. We assume
that the assembler works out the address for us. We can then refer to the label in
instructions.
# port address d ef i n i t i o n
KEY_STATUS equ 0
ROW_REG equ 1
COL_REG equ 2
Main : mvc R0 , 14 # s e l e c t 1 s t row
o u t ROW_REG, R0
j s b WaitCo lReg
cmpc R2 , 6 # key i n t h e 1 s t column
# has been p r e sse d
cmpc R2 , 5 # key i n t h e 2 nd column
# has been p r e sse d
cmpc R2 , 3 # key i n t h e 3 r d column
# has been p r e sse d
mvc R0 , 13 # s e l e c t 2 nd row
...
mvc R0 , 11 # s e l e c t 3 r d row
...
mvc R0 , 7 # s e l e c t 4 t h row
...
jmp Main # r e p e a t s c a n n i n g a l l r o ws
After setting a row register, the subroutine, WaitColReg, shown below waits until
a valid column register has been detected and indicated by bit 0 of the status register
in I/O controller.
D
S
S
D
9.2 BUSES
Bus is an interconnect that can move data between components. The most simple bus
we have seen thus far is a point-to-point connection where one component acts as the
source of data and another one acts as the destination. However, in many systems,
it is necessary to use a common interface to connect multiple sources to multiple
destinations, shown conceptually in Figure 9.5. The interconnect carries both data
and control signals to sequence the operations on the bus. Three solutions that can
avoid bus contention are presented below.
I O
Example 9.1. A 32-bit data is serially transmitted between two sides of a system.
The timing diagram of the transmission scheme is presented in Figure 9.11. Assume
that both transmitter and receiver are in the same clock domain. The strobe signal,
load_en, indicates that a data is ready to transmit. The transmitter outputs the first bit
from the least significant bit bit. When a load_en strobe occurs, the signal, tx_valid,
indicates the serial data valid for 32 cycles to shift the serial data in on the receiver.
After the transmission is complete, the receiver generates a strobe signal, rx_valid.
Solution: The RTL codes of the transmitter are displayed below, within which we
need a 32-bit shift register with parallel load control.
I/O Interface 457
6 reg tx_valid ;
8 reg [4 : 0] tx_cnt ;
On the receiver side, we also need a 32-bit shift register, as shown below.
7 reg rx_valid ;
9 reg [4 : 0] rx_cnt ;
10 always @ ( posedge clk )
11 if ( tx_valid )
12 rx_shift_reg <={ tx_dout , rx_shift_reg [31 : 1]};
13 always @ ( posedge clk or posedge rst )
data. It waits until the middle of each bit interval and samples the signal into the
receiving shift register. The receiver uses the stop bit to return to the idle state. The
clocks of transmitter and receiver might have slightly different frequencies, i.e., clock
drift, and are not related in phase. The clock drift does not matter provided that each
transmission does not last too long. Historically, computers have a component called
the universal asynchronous receiver/transmitter (UART) for serial communications
ports which were popular for the digital modem. The firmware can program the bit
rate and other transmission parameters.
The third scheme is transmitting the clock together with data using the Manchester
encoding. The Manchester line code represents a bit 0 with a transition from low to
high in the middle of the bit interval, and a bit 1 with a transition from high to low,
as shown in Figure 9.13. Since there is an indication of the transition in the middle
of a bit, and hence, the sampling time of each bit, this avoids the need for complex
clock synchronization.
Since the clock information of a transmitter is embedded in the line code, the
receiver must be able to recover the transmitted clock and data from the received
signal. The receiver employs the famous phase-locked-loop (PLL), which is an os-
cillator whose frequency and phase can be adjusted to line up with a reference clock
signal. For the synchronization purpose, the transmitter sends a continuous sequence
of encoded data with bit 1 before sending normal data. The PLL on the receiver side
locks onto the sequence of encoded 1 bits (indicated by the PLL_locked signal) to
give a reference clock that can be used to determine the bit intervals of transmitted
data, as shown in Figure 9.14.
The advantage of Manchester encoding is that it can save a separate clock wire.
The disadvantage is that the bandwidth of Manchester encoding is double that of
conventional NRZ encoding. Manchester encoding has been adopted in numerous
serial transmission standards, including the Ethernet standard.
460 Principles of Verilog Digital Design
9.4.1 POLLING
Polling of the embedded software is the simplest mechanism for I/O synchronization.
The embedded program uses a busy loop to repeatedly monitor the status input from
a controller to see if an event has occurred. If multiple events occur, the program will
sequentially process the events one at a time.
Example 9.2. In Figure 9.15, a factory automation system monitors two devices in
the system based on the processor with the instruction set in Table 9.1. The system
has a temperature sensor for the first device and a pressure sensor for the second
device. The sensor data are sampled by an I/O controller. The program reads the
temperature of the first device, represented as an 8-bit unsigned integer in ◦ C, from an
input register at address 8 in the I/O controller. For the second device, the processor
monitors its pressure, represented as an 8-bit unsigned number in u(4.4) format, from
an input register at address 9 in the I/O controller. If the temperature of the first device
is higher than 60 ◦ C or the pressure of the second device is larger than 1.5 atm, the
alarm bell is enabled by writing logic 1 to bit 0 of an output register at address 10
in the I/O controller, and writing 0 disables it. Develop an embedded program to
monitor the inputs and activate the alarm bell when any abnormal condition arises.
Solution: The polling loop must repeatedly read the input registers even there is
no abnormal events. The assembly codes are shown below.
# port address d ef i n i t i o n
TEMP_REG equ 8
PRES_REG equ 9
I/O Interface 461
ALAR_REG equ 10
Main : inp R0 , TEMP_REG # p o l l t e m p e r a t u r e
cmpc R0 , 60 # co m p ar e w i t h 60
jp SetAlarm # s e t alarm i f
# l a r g e r t h a n 60
i n p R1 , PRES_REG # p o l l p r e s s u r e
cmpc R1 , 24 # co m p ar e w i t h 1 . 5
# for u ( 4 . 4 ) format
jp SetAlarm # s e t alarm i f
# l a r g e r than 2.0
o u t ALAR_REG , 0 # c l e a r a l a r m
jmp Main # loop
The subroutine, SetAlarm, shown below sets the alarm and jump to the main loop
instead of returning to the instruction next to the subroutine. Doing so sets the alarm
without clearing it until the abnormal condition has been removed.
Polling is so simple that no extra circuits are required except the input and output
registers of the I/O controller. However, the processor must be continuously execut-
ing even there is no events to process. Moreover, if the program is busily processing
another event, it will not be able to respond immediately,
9.4.2 INTERRUPTS
The use of interrupts is the most common way for the I/O synchronization mech-
anism. The processor can execute its normal tasks. When an event occurs, the cor-
responding I/O controller interrupts the processor. The processor stops what it was
doing, then starts executing an interrupt handler, and finally resumes its internal sta-
tus and jumps to the instruction before the occurrence of interrupts. Some processors
provide different priorities to different controllers so that a higher-priority event can
interrupt service of a lower-priority event, but not vice versa.
To implement the interrupt mechanism, the processor has following features.
• The signal, int_req, is generated by wired-AND function of the individual
controllers’ requests that connect to the signal with an open-drain or open-
collector driver.
• The processor must be able to prevent the interrupt while it is executing cer-
tain non-interruptable sequences of instructions. Examples are instructions
that update information shared between an interrupt handler. If the proces-
sor is halfway through updating such information, the interrupt handler will
462 Principles of Verilog Digital Design
Also, as shown in Figure 9.16, a stack memory, stk_mem, with 6 entries are
provided to store the program counter, status register, and other registers used in the
ISR. The stack memory adopts the last-in first-out policy, and has only one pointer,
stk_ptr, that points to the top of the available entry in stack, i.e., the write address
of the stack memory. The pointer automatically increments and decrements upon the
instructions, push and pop, respectively. Initially, the stk_ptr is 0.
The timing diagrams of the instructions in Table 9.3 are shown in Figure 9.17.
A new state ST_INT of the state machine is used to notify that an interrupt event
has happened as indicated by the signal, int_req_g. As a result, the int_dis signal
is set to prevent the nested interrupt, the PC goes to the ISR at address 1 of ROM,
PC and Ri are respectively stored in the addresses, stk_ptr and stk_ptr+1, of the
stk_mem, and the stk_ptr points to the next write address, stk_ptr+2. The ISR is
I/O Interface 463
Figure 9.17: Timing diagrams of instructions: (a) interrupt event occurs, (b) reti, (c)
disi, (d) eni, (e) push, and (f) pop.
464 Principles of Verilog Digital Design
left by the instruction reti, which leads to reverse actions compared to those of an
interrupt event.
The signal, int_req_g, is a gated signal of the original interrupt request, int_req,
and two interrupt disable signals, int_dis and int_dis1, as shown below. Either inter-
rupt disable signals can disable the interrupt. The signal, int_dis, is maintained by
the processor. When an interrupt occurs, int_dis will be set to prevent the nested in-
terrupt; when encountering the instruction reti, it is cleared. By contrast, the signal,
int_dis1, is maintained by the program. It is set upon the instruction disi and cleared
upon the instruction eni. The instructions, disi and eni, appear in tandem.
5 output [7 : 0] dout ;
7 output int_req ;
8 input [2 : 0] col ;
9 input [1 : 0] addr ;
10 input [7 : 0] din ;
12 reg [3 : 0] row_reg ;
14 wire time_1ms ;
16 col_reg_old [0 : 3];
I/O Interface 465
17 reg [7 : 0] dout ;
18 reg [1 : 0] state_ns , state_cs ;
19 reg int_req ;
24 state_ns = state_cs ;
25 case ( state_cs )
26 ST_WAIT : state_ns = ST_DET ;
27 ST_DET : state_ns =( time_1ms && col_old == col_rr )?
28 ST_DEBD : ST_DET ;
29 ST_DEBD : state_ns =
30 ( row_reg ==4 ’ b1110 && col_reg != col_reg_old [0] |
31 row_reg ==4 ’ b1101 && col_reg != col_reg_old [1] |
32 row_reg ==4 ’ b1011 && col_reg != col_reg_old [2] |
33 row_reg ==4 ’ b0111 && col_reg != col_reg_old [3])?
34 ST_INT : ST_WAIT ;
35 ST_INT : state_ns =( cen && wen && addr ==0)
36 ST_WAIT : ST_INT ;
37 endcase
38 end
52 always @ (*)
53 case ( char_reg )
54 4 ’ d0 : led ={1 ’ b1 ,1 ’ b1 ,1 ’ b1 ,1 ’ b0 ,
55 1 ’ b0 ,1 ’ b0 ,1 ’ b1 ,1 ’ b0 ,
56 1 ’ b1 ,1 ’ b1 ,1 ’ b0 ,1 ’ b0 ,
57 1 ’ b0 ,1 ’ b1 ,1 ’ b1 ,1 ’ b1 };
58 ...
59 endcase
60 always @ ( posedge clk_50mhz or posedge rst )
76 begin
Figure 9.18: Block diagram of the keypad and factory automation system.
INT_STS1 equ 0
ROW_REG equ 1
COL_REG equ 2
CHA_REG equ 3
TEMP_REG equ 8
PRES_REG equ 9
ALAR_REG equ 10
INT_STS2 equ 11
jmp Main
ISR : p u s h R0 # s a v e i n stk_mem
p u s h R1 # s a v e i n stk_mem
p u s h R2 # s a v e i n stk_mem
p u s h R3 # s a v e i n stk_mem
ISR1 : inp R0 , INT_STS1 # r e a d k e y p a d i n t .
cmpc R0 , 1 # check keypad i n t .
jne ISR2 # s e t alarm i f
out INT_STS1 , 0 # clear int .
468 Principles of Verilog Digital Design
9.4.3 TIMER
An embedded system often needs to respond at periodic intervals based on a real-
time clock. The programmable timer generates an interrupt to the processor, and
then the interrupt handler performs any required periodic procedures.
Example 9.3. Develop the Verilog model of a real-time clock controller. The con-
troller generates a timer with 20 µ s period derived from a 50 MHz system clock.
The programmable timer is implemented using a down counter loadable with an 8-
bit output register, called the counter value register. Writing to the counter value
register causes the counter to be loaded. After the down counter reaches 0, it reloads
the value from the counter value register and produces an interrupt. The counter has
an interrupt status register. Writing to the interrupt status register clears the inter-
rupt. The controller also has an interrupt mask register. When bit 0 of the register is
1, interrupts from the controller are masked, and when it is 0, they are enabled.
Solution: The Verilog model is displayed below. The interrupt status register,
interrupt mask register, and counter value register are placed at the port numbers, 16,
17, and 18.
3 clk_50mhz , rst );
4 output [7 : 0] dout ;
5 output int_req ;
6 input [4 : 0] addr ;
7 input [7 : 0] din ;
9 reg [9 : 0] base_cnt ;
10 wire time_20us ;
12 wire time_out ;
13 reg [7 : 0] dout ;
14 reg int_mask ;
15 reg int_req_reg ;
9.5 ACCELERATORS
The embedded processor sequentially handles all tasks. However, many time-
consuming or critical tasks can be accelerated by a customized hardware. Doing
so can also offload the load on a processor. The key to the acceleration performance
is parallelism: independent tasks can be performed in parallel.
A processor can benefit from instruction-level parallelism by performing fetching,
decoding and executing stages concurrently. That is, based on the pipelining tech-
nique, when fetching a new instruction, the preceding instruction can be decoded,
and the instruction before the preceding one can be executed at the same time. A
high-end processor might fetch, decode, and execute several instructions at once by
multiple decoding units and ALUs. Even so, the advantage of lower cost still makes
custom hardware accelerators an efficient solution for many critical tasks, particu-
lar for those regularly structured data, such as the video data. The performance of
accelerators is only limited by the data dependencies and the availability of data.
We can quantify the performance gain of an algorithm achieved by accelerating
the kernel, i.e., the critical part that is to be accelerated. Suppose a system takes time
t to execute the algorithm, and that a fraction, f , of that time is spent in executing
the kernel. Executing codes other than the kernel spends a fraction of 1 − f . Hence,
the original execution time can be written as
t = f t + (1 − f )t. (9.1)
If our accelerator speeds up execution of the kernel by a factor α , the total execution
time t ′ with accelerator reduces to
ft
t ′ = + (1 − f )t. (9.2)
α
The overall speedup s is the ratio of the original execution time to the reduced one as
t f t + (1 − f )t 1
s= ′
= ft = f
. (9.3)
t
α + (1 − f )t α + (1 − f )
This formula is also called Amdahl’s Law.
Example 9.4. Suppose we have two kernels in an algorithm. One takes 50% of the
execution time and another takes 10%. Using a hardware accelerator, we could speed
up execution of the first kernel by a factor of 2 or the second kernel by a factor of 5.
Which accelerator gives the best overall performance improvement?
Solution: The overall speedup s1 from the first kernel is
1
s1 = 0.5
≈ 1.33.
2 + (1 − 0.5)
Accelerating the second kernel gives an overall speedup s2 as
1
s2 = 0.1
≈ 1.09.
5 + (1 − 0.1)
Since s1 > s2 , accelerating the first kernel is more efficient.
I/O Interface 471
D D
S S S
There are two major schemes for implementing parallelism. The first technique
is simply duplicating components that perform on different independent data. Com-
pared to that without duplication, the speedup achieved is ideally equal to the number
of components that are replicated.
The second technique for implementing parallelism is to break the overall task
into a sequence of simpler stages where each stage can perform concurrently like a
pipeline, as shown in Figure 9.19 where a three-stage pipelined structure is displayed.
The overall execution time by the pipeline for a given data takes approximately the
same time as that a non-pipelined one. However, if one data can be supplied at ev-
ery clock cycle, the pipeline can complete one data every cycle. Thus, the speedup
compared to the non-pipelined version is ideally equal to the number of stages. This
scheme is suitable for applications that involve complex processing steps that can
be broken into simpler ones. Some applications contain independent complex tasks.
We can replicate the pipeline to obtain the benefits of both parallel and pipelining
schemes.
Moving data between memory and the accelerator by software is an inefficient
approach. Instead, the accelerator typically contains a direct memory access (DMA)
under software control to transfer data to and from memory automatically by hard-
ware. The software simply configures the DMA through control registers in an ac-
celerator, and then monitors the status register in the accelerator. If the DMA shares
the same bus with the processor for accessing the memory, an arbiter is required to
resolve the access conflicts, as shown in Figure 9.20.
Example 9.5. We want to design an 8-point FFT accelerator. The N-point FFT is a
fast algorithm for the discrete Fourier transform expressed as
N−1
X(k) = ∑ WNnk x(n), k = 0, 1, ..., N − 1,
n=0
√
where x(n) and X(k) are time- and frequency-domain data, respectively, j = −1
and WN = e− j2π /N . The structure of the 8-point FFT is shown in Figure 9.21. The
brute-forced discrete Fourier transform algorithm above requires N 2 complex mul-
tiplications, while the FFT algorithm has log2 (N) stages and each stage needs N/2
complex multiplications.
The interface of FFT accelerator is shown in Figure 9.22. The input
data[31 : 0] includes both real (xr ) and imaginary (xi ) parts of x, i.e., data
[31 : 0]={xr [7 : −8], xi [7 : −8]}, where xr and xi have fixed-point representation of
s(8.8). The output fft_di, i = 0, 1, ..., 7, also includes both real (Xr ) and imaginary
472 Principles of Verilog Digital Design
P A
(Xi ) parts of X, i.e., {Xr [7 : −8], Xi [7 : −8]}, where Xr and Xi have fixed-point repre-
sentation of s(8.8).
The interface timing diagram is displayed in Figure 9.23. The input data are se-
quential input while output data are parallel output.
Solution: The most fundamental unit of FFT is the butterfly operation shown
in Figure 9.24, where (·)∗ denotes the complex conjugate operation. Each butterfly
needs two input data and two output data. The output data simply overwrite the
registers that stores corresponding input data. This is called in-place FFT.
I/O Interface 473
Our design goal is to process input streaming data using a circuit area as small
as possible. In each stage of FFT, there are 4 butterfly operations. Consequently, the
total number of butterfly operations for a block in 3 stages is 4 × 3 = 12. There are 8
cycles to input a block of FFT data, resulting in 8 cycles to complete the processing
of a FFT block. Therefore, during each cycle, we must complete ⌈12/8⌉ = 2 butterfly
operations, where ⌈·⌉ denotes the ceiling function.
The timing diagram of the circuit with 2 butterflies is presented in Figure 9.25.
As shown, the inputs of two butterflies are scheduled in a regular form such that it is
simpler for the hardware design. We must store 2 blocks of FFT data because, when
the second block has arrived, the first block is still in progress. Therefore, when one
474 Principles of Verilog Digital Design
buffer is used for FFT operation, another buffer is receiving input data. The ping
pong buffers used to store 2 blocks (R[0]∼R[7] and R[8]∼R[15]) of FFT data can
overlap the I/O operation with the data processing operation.
The processing of each butterfly is regular and simple. However, if we use a state
machine to control both butterflies, the control unit becomes complicated because
there are too many states for the combination of the operations of two butterflies. By
contrast, it can be seen that the processing of a block for each butterfly is simple. For
butterfly 0, when a new block comes in, it just waits 5 cycles, then starts processing
points (R[0],R[4]) (0 and 4 in Figure 9.25), (R[1],R[5]) (1 and 5 in Figure 9.25),
(R[0],R[2]) (0 and 2 in Figure 9.25), (R[1],R[3]) (1 and 3 in Figure 9.25), (R[0],R[1])
(0 and 1 in Figure 9.25), and finally (R[2],R[3]) (2 and 3 in Figure 9.25) sequentially
in a fixed pattern. For butterfly 1, when a new block comes in, it just waits 6 cycles,
then starts processing points (R[2],R[6]) (2 and 6 in Figure 9.25), (R[3],R[7]) (3 and
7 in Figure 9.25), (R[4],R[6]) (4 and 6 in Figure 9.25), (R[5],R[7]) (5 and 7 in Figure
9.25), (R[4],R[5]) (4 and 5 in Figure 9.25), and finally (R[6],R[7]) (6 and 7 in Figure
9.25) sequentially. Butterflies 0 and 1 parallel execute and the operation of the same
block for butterfly 1 is one cycle later than that of butterfly 0.
The datapath of the FFT accelerator is presented in Figure 9.26(a). To make
things simpler, the input data are stored in a ping pong data buffer (R[0]∼R[7] and
R[8]∼R[15]) implemented using FIFO with 16 entries, i.e., R[0]∼R[15]. In Figure
9.26(b), the FIFO, data_fifo, has a write pointer, data_wr_ptr, but has one read pointer
for each butterfly, i.e., two read pointers, bf0_rd_ptr and bf1_rd_ptr, in total. Hence,
the processing of two butterflies can be decoupled. The proposed architecture can
make two butterflies temporarily process different blocks.
The write pointer, data_wr_ptr, of data FIFO advances when data_valid is true.
The read pointer, bf0_rd_ptr/bf1_rd_ptr, of data FIFO for butterfly 0/1 advances
when a block has been completed by butterfly 0/1. The accomplishment of a block of
butterfly 0 is indicated by bf0_cnt_sel== 11, and, for butterfly 1, it is indicated by
bf1_cnt_sel== 12. Therefore, the time instances that two butterflies accomplish the
same block are different, as shown in the timing diagram. When butterfly 0 is pro-
cessing one block, butterfly 1 can process another using the proposed architecture.
This makes two butterflies seem to independently perform their operations.
Also, we decouple the processing of two butterflies using two additional butterfly
FIFOs, bf0_fifo and bf1_fifo, one for each butterfly. Each FIFO has 2 entries, such as
bf0_cnt[0] and bf0_cnt[1] for butterfly 0, and bf1_cnt[0] and bf1_cnt[1] for butterfly
1. One entry is dedicated to one block, so that the states of two block data of a but-
terfly can be separately recorded and decoupled. Such a design makes the pipelining
of two neighboring blocks easier and more regular. The datum in the butterfly FIFO
is simply an up counter that is enabled when a new block comes in, indicated by
blk_valid, and advances automatically until it counts up to a value that its correspond
butterfly has accomplished its operation, i.e., 11/12 for butterfly 0/1, respectively.
Each butterfly FIFO has one read pointer and one write pointer. The write pointer,
bf_wr_ptr, is the same for both bf0_fifo and bf1_fifo and advances when a new block
has arrived, indicated by blk_valid. The read pointers, bf0_rd_ptr and bf1_rd_ptr, of
butterfly FIFOs are the same as those used for the data FIFO.
After the architecture design, we need to determine the bit widths of every vari-
ables. Both real (WN,r n ) and imaginary (W n ) parts of the twiddle factor, W n =
N,i N
n + jW n , have fixed-point representation of s(2.16). The outputs, c = c + jc
WN,r N,i r i
and d = dr + jdi , can be expressed using inputs, a = ar + jai and b = br + jbi , and
476 Principles of Verilog Digital Design
Figure 9.26: Architecture of the circuit: (a) datapath and (b) control unit. The bf0_fifo
contains two entries: bf0_cnt[0] and bf0_cnt[1]. The bf1_fifo contains two entries:
bf1_cnt[0] and bf1_cnt[1].
To reduce the quantization error, the fractional part of data registers, R[0]∼R[15],
in the data FIFO should be 16 bits. Hence, the bit width design of the datapath is
planned in Figure 9.27, where the block Q quantizes input using the truncation.
Finally, the RTL codes are written below. It can be observed that, according to the
timing diagram, only half of the next block data have input when the previous block
data are output. Therefore, the optimal number of registers needs only 12 registers,
i.e., R[0]∼R[11]. Consequently, to save the space of data buffer, the ping pong buffer
can be changed to a 12-entry FIFO. The new architecture is left as a problem at the
end of this chapter.
8 input data_valid ;
11 reg fft_valid ;
18 // Data queue
20 reg [3 : 0] data_wr_ptr ;
21 // Butterfly queue
22 reg [3 : 0] bf0_cnt [0 : 1] , bf1_cnt [0 : 1];
26 integer i ;
45 assign fft_d0 =
48 assign fft_d1 =
51 assign fft_d2 =
54 assign fft_d3 =
57 assign fft_d4 =
60 assign fft_d5 =
I/O Interface 479
66 assign fft_d7 =
71 case ( bf0_cnt_sel )
72 4 ’ d6 : begin
73 bf0_op1r =( bf0_rd_ptr )? Rr [8] : Rr [0];
74 bf0_op1i =( bf0_rd_ptr )? Ri [8] : Ri [0];
75 bf0_op2r =( bf0_rd_ptr )? Rr [12] : Rr [4];
76 bf0_op2i =( bf0_rd_ptr )? Ri [12] : Ri [4];
77 bf0_wr = wr0 ; bf0_wi = wi0 ;
78 end
79 4 ’ d7 : begin
80 bf0_op1r =( bf0_rd_ptr )? Rr [9] : Rr [1];
81 bf0_op1i =( bf0_rd_ptr )? Ri [9] : Ri [1];
82 bf0_op2r =( bf0_rd_ptr )? Rr [13] : Rr [5];
83 bf0_op2i =( bf0_rd_ptr )? Ri [13] : Ri [5];
84 bf0_wr = wr1 ; bf0_wi = wi1 ;
85 end
86 4 ’ d8 : begin
87 bf0_op1r =( bf0_rd_ptr )? Rr [8] : Rr [0];
88 bf0_op1i =( bf0_rd_ptr )? Ri [8] : Ri [0];
89 bf0_op2r =( bf0_rd_ptr )? Rr [10] : Rr [2];
90 bf0_op2i =( bf0_rd_ptr )? Ri [10] : Ri [2];
91 bf0_wr = wr0 ; bf0_wi = wi0 ;
92 end
93 4 ’ d9 : begin
94 bf0_op1r =( bf0_rd_ptr )? Rr [9] : Rr [1];
95 bf0_op1i =( bf0_rd_ptr )? Ri [9] : Ri [1];
96 bf0_op2r =( bf0_rd_ptr )? Rr [11] : Rr [3];
97 bf0_op2i =( bf0_rd_ptr )? Ri [11] : Ri [3];
98 bf0_wr = wr2 ; bf0_wi = wi2 ;
99 end
100 4 ’ d10 : begin
101 bf0_op1r =( bf0_rd_ptr )? Rr [8] : Rr [0];
102 bf0_op1i =( bf0_rd_ptr )? Ri [8] : Ri [0];
103 bf0_op2r =( bf0_rd_ptr )? Rr [9] : Rr [1];
104 bf0_op2i =( bf0_rd_ptr )? Ri [9] : Ri [1];
105 bf0_wr = wr0 ; bf0_wi = wi0 ;
106 end
107 4 ’ d11 : begin
480 Principles of Verilog Digital Design
192 // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** *
474 tmp_dr =( ar - br )* wr +( bi - ai )* wi ;
475 tmp_di =( ar - br )* wi +( ai - bi )* wr ;
476 end
477 endmodule
488 Principles of Verilog Digital Design
PROBLEMS
1. A 16-segment LED display, shown in Figure 9.28, can display alphabetic and
numeric characters. Extend the keypad I/O controller to display the pushed digit
on the LED display. You can add an additional register, char_reg[3 : 0], and a
port, led[15 : 0], in the I/O controller. The register is accessed through the port
number 3. The register char_reg[3 : 0] is then decoded to the signal, led[15 : 0],
to drive each segment of the LED.
2. Design the I/O controller for the factory automation system in Section 9.4.1.
3. Design the processor with interrupt using the instruction set in Section 9.4.2.
4. Write down the assembly codes for the keypad controller, alarm controller, and
real-time clock controller. The period of the real-time clock is 2 ms. The main
program starts by initializing controllers and interrupts. The interrupt handler
is located at address 1 of instruction memory. On responding to an interrupt,
it checks the interrupt status registers of each I/O controller to determine the
interrupt source, starting with the real-time clock controller. The handler then
proceeds to check for other interrupt sources before returning to the interrupted
program. The ISR of the real-time clock simply sets a flag rtc_flag which will
then be used by the processor for further real-time processing. After processing,
the processor clears the flag rtc_flag.
5. The original butterfly of FFT accelerator requires 4 real multipliers. Redesign the
butterfly using 3 real multipliers and evaluate the gate count that can be saved.
6. Change the output of 16-point FFT to be serial, which is suitable for SRAM
interface. Redesign the FFT accelerator using the minimum number of data reg-
isters.
7. Change the data buffer of 8-point FFT from the ping pong buffer with total of
16 memory space to a 12-entry FIFO.
8. Integrate the FFT accelerator in the previous problem with the simple processor
with interrupts in Section 9.4.2. The processor and the accelerator share the same
256 × 8 data memory as that shown in Figure 9.20. The arbiter gives the highest
490 Principles of Verilog Digital Design
priority to the processor. The accelerator has two output registers for the FFT
processing, including interrupt status register and command register. The inter-
rupt status register has done-field at bit 0, and other bits are reserved. The com-
mand register has start-field at bit 0, and block number-field at bits 7-5. Every
time-domain FFT block needs 16 × 32 bits, i.e., 64 Bytes. Hence, the maximum
block number in a 256-Byte data memory is 4. After the processor has prepared
the time-domain data for the accelerator, the processor programs the acceler-
ator via the command register through the start-field and block number-field.
When the start-field is set, DMA in the accelerator reads data and the FFT starts
processing according to that specified in the block number-field. When the ac-
celerator is done, the done-field of the interrupt status register is set, which then
interrupts the processor. The interrupt is cleared through writing the interrupt
status register. After that, the accelerator can wait for the next FFT calculation.
a. Redesign the timing diagram so that a block can be processed earlier and the
number of registers can be reduced.
b. Redesign the FFT accelerator.
c. Integrate the processor, FFT accelerator, and the data memory.
d. Write done the assembly codes to verify your design.
10 Logic Synthesis with Design
Compiler
A RTL design described with high-level constructs, such as always block and contin-
uous assignment, allows us to design according to its functionality without consider-
ing too much about the implementation method. Actually, you only need to describe
the circuit function you want without having to worry about how you are going to
implement the design in an early stage. The details, i.e., logic gates and their in-
terconnections, in implementing the circuit function will be determined later using
logic synthesis.
This does not mean that we can arbitrarily use all the constructs of HDL in our
designs. Many Verilog features are only suitable for the high-level behavioral mod-
eling in testbench, and cannot be synthesized into an equivalent gate-level circuit.
Consequently, it requires that the RTL models be written using a subset of Verilog
constructs, and only codes using a template structure can be inferred to its corre-
sponding hardware. For example, the synthesis tool, Synopsys DC, requires that
synchronous registers be expressed using always blocks with either a positive or
negative edge-triggered clock signal. This chapter assumes that you have a synthesis
tool and cell library at hand.
First, this chapter emphasizes design guidelines for synthesis. Next, the steps for
the synthesis methodology of timing, area, and power optimization, including read-
ing design, describing design environment, constraining design, compiling design,
reporting and analyzing design, and saving design, are introduced. Particularly, the
synthesis commands used for the dynamic and static power optimization are pre-
sented. Finally, synthesis skills for solving setup time violations, hold time viola-
tions, multiple port nets, large for loops, and naming rules are exemplified.
(+, −, *, /, %), relational (>, <, >=, <=), equality (==, !=), logical shift (≪,
≫), arithmetic shift (≪, ≫), concatenation ({}), replication({{}}), con-
ditional (?:) operators, etc.
• Design (or reference): a circuit that performs one or more logical functions.
• Cell: an instantiation of a design.
• Port: the input, output, or inout of a design.
• Pin: the input, output, or inout of a cell.
• Net: the wire that connects ports or pins.
• Clock: waveform applied to a port or pin identified as a clock source.
8 y <= ( x3 * h3 + x2 * h2 )+( x1 * h1 + x0 * h0 );
9 endmodule
r e a d _ f i l e − f o r m at v e r i l o g d e s i g n . v
r e a d _ f i l e − f o r m at v e r i l o g d e s i g n 1 . v d e s i g n 2 . v . . .
DC now supports the AutoRead as follows, where design is your top module name
and your_rtl_dir is the directory containing your design files.
r e a d _ f i l e − a u t o r e a d −top d e s i g n − r e c u r s i v e \
{ your_rtl_dir }
Alternatively, instead of directly reading the design, you can analyze the design
that checks Verilog for syntax and synthesizability. Subsequently, you can elaborate
the design to bring it into DC memory using GTECH components. For example, you
can analyze then elaborate your design as follows.
496 Principles of Verilog Digital Design
a n a l y z e − f o r m at v e r i l o g d e s i g n . v
elaborate design −architecture verilog
If there are multiple instances referencing the same design, you must enable the
DC to distinguish among them. Physical designs must each be unique by separating
their components, even they have the same function. Using the following command
will allow multiple distinctive instances referencing the same design for synthesis.
uniquify
If your input is driven by an input pad consisting of PDIDGZ in the TSMC 0.18
µ m process, you can set the input driving strength as follows. In this command, pin
Logic Synthesis with Design Compiler 497
C of input pad, PDIDGZ, in the I/O pad library, tpz973gvwc, drives the input port
named your_input_port_name.
s e t _ d r i v i n g _ c e l l − l i b r a r y t p z 9 7 3 g v w c − l i b _ c e l l PDIDGZ \
−pin {C} [ g e t _ p o r t s y o u r _ i n p u t _ p o r t _ n a m e ]
You can also appoint all primary input ports using [all_inputs] to the same driving
cell as follows.
s e t _ d r i v i n g _ c e l l − l i b r a r y t p z 9 7 3 g v w c − l i b _ c e l l PDIDGZ \
−pin {C} [ a l l _ i n p u t s ]
There are three paths: FF1 to FF2, FF2 to FF3, and FF3 to FF4 of the design illus-
trated in Figure 10.4. Provided that the clock is perfectly synchronous, to guarantee
that all signals are stable before the rising edge of the clocks, the path delay from
FF1 to FF2 will be tCQ + M + N = 1 + 4 + 6 = 11 ns. Considering the setup time,
tS = 0.5 ns, the clock period for FF1 to FF2 must be larger than 11 + 0.5 = 11.5 ns.
From FF2 to FF3, the path delay will be tCQ + X = 1 + 11 = 12 ns. Considering the
setup time, the clock period for FF2 to FF3 must be larger than 12 + 0.5 = 12.5 ns.
From FF3 to FF4, the path delay will be tCQ + S + T = 1 + 3 + 7 = 11 ns. Considering
the setup time, the clock period for FF3 to FF4 must be larger than 11 + 0.5 = 11.5
ns. Taking all three paths into consideration, the minimum clock period will be 12.5
ns.
In the above design, to correctly constrain the paths from FF1 to FF2 and FF3 to
FF4, you can specify the input and output max delays for the setup time as follows.
s e t _ i n p u t _ d e l a y − clo ck c l k −max 5 [ g e t _ p o r t s i n ]
s e t _ o u t p u t _ d e l a y − clo ck c l k −max 7 . 5 [ g e t _ p o r t s o u t ]
In addition to the setup time, requirements for the hold time must also be satisfied.
For the design in Figure 10.5, to correctly constrain the paths from FF1 and FF2 to
FF3, you can respectively specify the input max and min delays for the setup time
and hold time as follows.
498 Principles of Verilog Digital Design
Figure 10.5: Max and min input delays of the design. The max input delay is usually
used for the setup time check, and the min input delay is used for the hold time check.
s e t _ i n p u t _ d e l a y − clo ck c l k −max 8 . 4 [ g e t _ p o r t s i n ]
s e t _ i n p u t _ d e l a y − clo ck c l k −min 4 . 4 [ g e t _ p o r t s i n ]
The wiring of a design also contributes to the load of the outputs. To configure
the wire load, you can specify the wire load model using the following command. In
the command, the wire load model, tsmc18_wl10, in the slow.db library is chosen. It
should be noted that the wire load is determined using the area of your design in the
synthesis stage, which is not so accurate. An accurate wire load is unique for every
wires and can only be obtained after the placement and routing of your design has
been done.
There are three wire load modes: top, segmented, and enclosed. These are speci-
fied depending upon the area of your design. As shown in Figure 10.6, there is a wire
Logic Synthesis with Design Compiler 499
separated into three parts, a, b, and c. Besides, the wire load modes of each modules
according to their circuit areas are also displayed.
The top mode specifies that wires use the wire load model of the top-level design.
For example, if the wire load mode is top, the wire load models of a, b, and c will
all follow 12x12 model of the top-level design. The segmented mode specifies that
wires use the wire load models of the block that encloses each of them. For example,
if the wire load mode is segmented, the wire load models of a, b, and c will be
7x7 (of module2), 9x9 (of module1), and 3x3 (of module3) model, respectively. The
enclosed mode specifies that wires use the wire load model of the block that encloses
all of them. For example, if the wire load mode is enclosed, the wire load models of
a, b, and c will all be 9x9 (of module1). Consequently, the wire load mode of top is
the most conservative and more stringent, whereas the wire load mode of segment is
the most aggressive and less strict.
Figure 10.7: There is one external clock, ext_clk, and one generated clock, int_clk in
the clock report.
High fanout nets should be identified and reported in the following manner. The
capacitance of the high fanout nets designated as ideal networks will be shown to be
0.
r e p o r t _ n e t _ f a n o u t −high_fanout
After reading your designs, check_design command reports error and warning
messages that are important and should be carefully analysed. For example, if you
instantiate a module with more ports than its definition, an error message is reported,
while a warning is reported if a port is not connected to any nets. After setting con-
straints, check_timing should be used to verify that there are no unconstrained paths.
Clock gating (introduced later) should be reported as follows.
r e p o r t _ t i m i n g − d elay max
An example of the setup time report is shown in Figure 10.9. By default, only one
maximum or minimum delay path (depending on the operating condition) of the de-
sign is displayed. The option “-max_paths” is used to show more paths. In this report,
the start point is enable signal and the end point is timer/time_1ms_reg[0]/D. To ana-
lyze another path, specify the starting point using the option “-from” and the ending
point using the option “-to”. The column specified by “Incr” represents the incre-
mental delay of the combined net and cell delays of each component. The column
502 Principles of Verilog Digital Design
specified by “Path” represents the total path delay from the start point to the output
of a specific component. The letter, r or f, behind the path delay indicates rising or
falling transition for the output signal of a component. The SDF file can be referred
to for individual net and cell delays.
In this report, the start point, enable signal, has an input delay of 1 ns. The end
point (D input of flip-flop time_1ms_reg[0] or time_1ms_reg[0]/D) arrives at 4.6145
ns after the clock rising edge. The clock network (introduced later) has a period of 4
ns, a latency of 1 ns, and a clock uncertainty of 0.1 ns. According to the setup time
requirement of time_1ms_reg[0], the data time_1ms_reg[0]/D is required to arrive
at 4.8180 ns. Since the data arrives earlier than required, the timing constraint is
satisfied and the slack= 4.8180 − 4.6145 = 0.2035 is positive (or met).
An example of a hold time report is shown in Figure 10.10. The hold time
can be checked with the minimum delay in the best-case operating condition
using the fast.db library. The report shows a register (time_1ms_reg[1]/CK) to
register (time_1ms_reg[2]/D) path. The end point (time_1ms_reg[2]/D) arrives at
Logic Synthesis with Design Compiler 503
1.1319 ns after the clock rising edge. According to the hold time requirement of
time_1ms_reg[2], the data time_1ms_reg[2]/D is required to arrive at 1.1016 ns.
Since the data arrives later than required, the timing constraint is satisfied and the
slack= 1.1319 − 1.1016 = 0.0303 is positive (or met).
The following command will generate a timing report that reports only paths
which have setup-time violations. To report only paths with hold-time violations,
change the option to “-min”.
r e p o r t _ c o n s t r a i n t s − a l l _ v i o l −max − v er b o se
r e p o r t _ a r e a −hier
504 Principles of Verilog Digital Design
In the report shown in Figure 10.11, only the total cell area needs to be considered.
The net interconnect area will not be a problem because it depends on the wire load
model, which cannot yet be determined.
The chip area depends on the semiconductor process. Therefore, it is usually un-
fair to compare the chip areas of two designs fabricated using different processes.
On the contrary, the gate count is independent of the semiconductor process, and
will usually give a good impression on the area size of the circuit. The gate count is
roughly determined by the number of 2-input NAND (NAND2X1) gates, and can be
calculated by
The NAND2X1 area can be looked up in the document of the cell library.
r e p o r t _ po wer −hier
w r i t e − f o r m at v e r i l o g − h i e r a r c h y − o u t p u t d e s i g n . v g
The synthesis constraints can be verified using the pre-sim. You can write the SDF
of your design in “design.sdf” for pre-sim as follows.
w r i t e _ s d f −version 2 .1 −context v e r i l o g d e s i g n . s d f
must be respectively used as displayed below, where test.v, chip.vg, and library.v
are testbench, gate-level netlist of design, and cell library model, respectively. The
option, -R, enables the VCS to run the executable file immediately after VCS links
together the executable file, and the option, +v2k, enables new Verilog features in
the IEEE 1364-2001 standard.
The design constraints of the design can also be written into a script file, i.e.,
Synopsys design constraints (SDC) file for layout tool or STA, as follows.
write_sdc design.sdc
c r e a t e _ c l o c k − p e r i o d 10 [ g e t _ p o r t s c l k ]
The following command defines a clock, clk, with a period of 10 ns and duty cycle
of 40%.
Logic Synthesis with Design Compiler 507
c r e a t e _ c l o c k − p e r i o d 10 −waveform {0 4} \
[ get_ports clk ]
During the synthesis stage, the clock network must be ideal without requiring
the insertion of any buffers to reduce the load of the high-fanout clock net. High
fanout networks, such as clock and reset signals, are solved by inserting buffers
during the layout stage. The ideal network is indicated to the synthesizer via the
set_ideal_network as follows.
Additional, you must tell the synthesizer buffers are not to be inserted into the clock
network as follows.
It is best to set the driving of the clock network to infinity to get an ideal clock delay
as follows.
The clock latency in Figure 10.13 can be described below. Notably, the named
clock, ori_clk, is a virtual clock that has been defined but not associated with any
pin/port. The create_generated_clock specifies that clk1 and clk2 be generated using
a virtual clock, ori_clk, but clk1 and clk2 have the same frequency as ori_clk. In this
example, clk1 and clk2 have source latencies of 2 and 0.5 ns, respectively, and they
have clock tree delays of 1 and 1.5 ns, respectively.
c r e a t e _ c l o c k −name o r i _ c l k − p e r i o d 10
c r e a t e _ g e n e r a t e d _ c l o c k − so u r ce o r i _ c l k − d i v i d e _ b y 1 \
[ g et_p or ts clk1 ]
c r e a t e _ g e n e r a t e d _ c l o c k − so u r ce o r i _ c l k − d i v i d e _ b y 1 \
[ g et_p or ts clk2 ]
s e t _ c l o c k _ l a t e n c y − so u r ce 2 [ g e t _ p o r t s c l k 1 ]
s e t _ c l o c k _ l a t e n c y − so u r ce 0 . 5 [ g e t _ p o r t s c l k 2 ]
set_ clock _latency 1 [ get_ po rts clk1 ]
set_ clock _latency 1 .5 [ get_p or ts clk2 ]
s e t _ i np u t _t r a ns i t i o n 0 .5 [ get_ports clk ]
se t _cl ock_ tr ans i ti on 0 .1 [ get_ports clk ]
Hence, considering the clock uncertainty, clock-to-Q delay, and the setup time
constraints of a flip-flop, the clock period requirement should be modified as follows.
510 Principles of Verilog Digital Design
Likewise, considering the clock uncertainty and clock-to-Q delay, the hold time con-
straint must be modified as follows.
c r e a t e _ c l o c k − p e r i o d 10 −waveform {0 5} [ g e t _ p o r t s c l k ]
s e t _ c l o c k _ l a t e n c y −source 2 [ g e t _ p o r t s c l k ]
set_clock_latency 3 [ get_ports clk ]
set_clock_uncertainty 1 [ get_ports clk ]
Logic Synthesis with Design Compiler 511
Figure 10.18: Specified clock waveform for the setup time constraint.
Figure 10.19: Specified clock waveform for the hold time constraint.
The clock waveform is displayed in Figure 10.18. The worst case for the setup
time of the capturing clock is that it advances an amount of clock uncertainty, i.e.,
1 ns. If the setup time requirement is 0.5 ns, the D input of the capturing FF must
arrive before 13.5 ns.
The worst case for the hold time of the capturing clock is that it lags, producing
an amount of clock uncertainty, i.e., 1 ns. If the hold time requirement is 0.3 ns, the
D input of the capturing FF must arrive after 6.3 ns, as shown in Figure 10.19.
512 Principles of Verilog Digital Design
Figure 10.20: Impact of clock uncertainty. (a) The clock latency of the launching FF
is 0.3 ns shorter than that of the capturing FF. (b) The clock latency of the launching
FF is 0.3 ns longer than that of the capturing FF.
Nevertheless, the use of clock skew to solve the setup time problem requires spe-
cial care. It may impact the paths with start point from the capturing FF. Moreover,
the clock skew should be carefully confirmed under all operation modes, such as the
scan mode.
c r e a t e _ c l o c k − p e r i o d 10 [ g e t _ p o r t s e x t _ c l k ]
c r e a t e _ g e n e r a t e d _ c l o c k − so u r ce e x t _ c l k − d i v i d e _ b y 2 \
[ g e t _ p i n s CLK_GEN / u1 /Q]
set_ideal_network [ get_ports ext_clk ]
s e t _ i d e a l _ n e t w o r k [ g e t _ p i n s CLK_GEN / u1 /Q]
set_dont_touch_network [ get_ports ext_clk ]
s e t _ d o n t _ t o u c h _ n e t w o r k [ g e t _ p i n s CLK_GEN / u1 /Q]
set_drive 0 [ get_ports ext_clk ]
s e t _ d r i v e 0 [ g e t _ p i n s CLK_GEN / u1 /Q]
There are often multiple clock sources for a logic as well, as shown in Figure
10.22.
514 Principles of Verilog Digital Design
c r e a t e _ c l o c k − p e r i o d 10 [ g e t _ p o r t s e x t _ c l k ]
c r e a t e _ g e n e r a t e d _ c l o c k − so u r ce e x t _ c l k − d i v i d e _ b y 2 \
[ g e t _ p i n s CLK_GEN / u1 / Q]
c r e a t e _ g e n e r a t e d _ c l o c k − so u r ce e x t _ c l k − d i v i d e _ b y 4 \
[ g e t _ p i n s CLK_GEN / u2 / Q]
c r e a t e _ g e n e r a t e d _ c l o c k − so u r ce e x t _ c l k − d i v i d e _ b y 8 \
[ g e t _ p i n s CLK_GEN / u3 / Q]
c r e a t e _ g e n e r a t e d _ c l o c k − so u r ce e x t _ c l k − d i v i d e _ b y 16 \
[ g e t _ p i n s CLK_GEN / u4 / Q]
To analyze the timing in this case, the most stringent clock is selected for synthesis
as follows.
Sometimes, we prefer to manually instantiate the multiplexer for the clock generator.
In this case, you can tell the DC not to touch the cells you instantiated as follows.
s e t _ d o n t _ t o u c h [ g e t _ c e l l s CLK_GEN / u5 ]
If some cells in the library are not preferred, for instance, the JK flip-flops, they
can be excluded from the synthesis by using the following command.
Logic Synthesis with Design Compiler 515
s e t _ d o n t _ u s e [ g e t _ c e l l s slo w / JKFF * ]
s e t _ d o n t _ u s e [ g e t _ c e l l s f a s t / JKFF * ]
The paths between positive and negative edge clocks need not be specially indi-
cated, as shown in Figure 10.23. The DC can perform the STA correctly. Derived
negated clocks can be automatically identified by DC as well. You only have to de-
scribe the clock port as follows. Even so, it is still discouraged to use both the positive
and negative edges of a clock source because to do so the clock period must be re-
duced by a factor of 2. Moreover, derived negative-edge triggered FFs are difficult
for chip testing.
c r e a t e _ c l o c k − p e r i o d 10 [ g e t _ p o r t s c l k ]
c r e a t e _ c l o c k − p e r i o d 10 [ g e t _ p o r t s c l k 1 ]
c r e a t e _ c l o c k − p e r i o d 20 [ g e t _ p o r t s c l k 2 ]
c r e a t e _ c l o c k − p e r i o d 25 [ g e t _ p o r t s c l k 3 ]
516 Principles of Verilog Digital Design
Figure 10.25: Clock edges used to check the timing of launching and capturing FFs.
In Figure 10.24, the clocks are synchronous but have different frequencies. To check
the timing, the most stringent clock edges should be selected to guarantee that timing
constraints are met under all circumstances. For example, as displayed in Figure
10.25, from FF1 to FF2, the most critical timing is launched from the 2nd clock edge
of clk1 to the 2nd clock edge of clk2, i.e., 10 ns, instead of from the 1st clock edge
of clk1 to the 2nd clock edge of clk2, i.e., 20 ns. Likewise, the most critical timing
from FF2 to FF3 is 5 ns.
When clocks are asynchronous, the simple or FIFO synchronizer introduced in
Chapter 8 must be used. However, the most stringent clock edges may be extremely
close, making it hard to satisfy the timing constraints. Fortunately, those paths across
different clock domains are false paths that will be introduced later.
c r e a t e _ c l o c k −name o r i _ c l k − p e r i o d 10
c r e a t e _ g e n e r a t e d _ c l o c k − so u r ce o r i _ c l k − d i v i d e _ b y 1 \
[ get_ports clk ]
set_clock_latency 1 [ get_clocks clk ]
set_clock_uncertainty 0 .3 [ get_clocks clk ]
s e t _ i n p u t _ d e l a y − clo ck o r i _ c l k −max 5 −min 4 \
[ get_ports in ]
s e t _ o u t p u t _ d e l a y − clo ck o r i _ c l k −max 7 . 5 −min 6 \
[ get_ports out ]
Logic Synthesis with Design Compiler 517
After CTS, suppose that the real clock tree latency is found to be around 1.5 ns.
The constraints can now be modified by removing the estimated clock latency, skew,
and input and output delays. The set_propagated_clock uses the real clock skews of
FFs, so only the clock jitter in the clock uncertainty needs to be modeled. A new vir-
tual clock, ori_clk1, can be created to be referenced by the input and output signals.
A clock latency of 1.5 ns can be added to ori_clk1, so that the clock (ori_clk1) of the
input and output signals will be synchronous with that (clk) of the design. Conse-
quently, in addition to those constraints used before the CTS, additional constraints
can be specified as follows.
c r e a t e _ c l o c k −name o r i _ c l k 1 − p e r i o d 10
remove_clock_latency [ get_clocks clk ]
remove_clock_uncertainty [ get_clocks clk ]
remove_input_delay [ get_p or ts in ]
remove_output_delay [ get_p or ts out ]
set_clock_latency 1 .5 ori_clk1
set_clock_uncertainty 0 .1 [ get_clocks clk ]
set_propagated_clock [ get_clocks clk ]
s e t _ i n p u t _ d e l a y − clo ck o r i _ c l k 1 −max 5 −min 4 \
[ get_ports in ]
s e t _ o u t p u t _ d e l a y − clo ck o r i _ c l k 1 −max 7 . 5 −min 6 \
[ get_ports out ]
s e t _ m a x _ d e l a y 3 −from [ a l l _ i n p u t s ] −to [ a l l _ o u t p u t s ]
Of course, you can use a virtual clock, and input and output delays to constrain a
pure combinational circuit as the second method as follows.
c r e a t e _ c l o c k −name c l k − p e r i o d 5
s e t _ i n p u t _ d e l a y − clo ck c l k −max 1 \
[ all_inputs ]
s e t _ o u t p u t _ d e l a y − clo ck c l k −max 1 \
[ all_outputs ]
518 Principles of Verilog Digital Design
A false path is also a timing exception that timing constraints along the path will
be ignored. For example, if the timing paths from A, through C, to OUT are false
paths in Figure 10.27, we can set the false paths as follows.
The paths across different clock domains are commonly false paths, as shown in
Figure 10.28. The simple or FIFO synchronizer introduced in Chapter 8 guarantees
the functionality of signals crossing different clock domains.
We can set the false paths of all signals from clock domain CLKA to clock domain
CLKB as follows.
s e t _ f a l s e _ p a t h −from [ g e t _ c l o c k s CLKA] \
−to [ g e t _ c l o c k s CLKB]
If there are also signals from clock domain CLKB to clock domain CLKA, we can
set the false paths of all signals from clock domain CLKB to clock domain CLKA
as follows.
s e t _ f a l s e _ p a t h −from [ g e t _ c l o c k s CLKB] \
−to [ g e t _ c l o c k s CLKA]
In asynchronous interfaces, the clock edges of launching and capturing FFs can
become so small that timing violations will eventually happen. However, since the
unstable problem of signals across different clock domains has been solved by the
synchronizer, their timing violations are false alarms. These false alarms are con-
strained by false paths to the synthesis tool.
To eliminate false alarms during gate-level simulations with timing information
back annotated, the first FF (facing the asynchronous clock) in the synchronizer
should be replaced with a cell having the same original functionality but without
timing checks. The models of FFs without timing checks will need to be manually
crafted. Similarly, the cells referring to them in the gate-level netlist will have to be
manually replaced.
520 Principles of Verilog Digital Design
s e t _ m a x _ c a p a c i t a n c e 1 . 2 [ g e t _ p o r t s A /OUT]
s e t _ m a x _ t r a n s i t i o n 0 . 2 [ g e t _ p o r t s A/OUT]
s e t _ m a x _ f a n o u t 6 [ g e t _ p o r t s A/OUT]
The optimization of timing, area, and power to meet their constraints are presented
below.
the number of operations per clock period, or, conversely, to minimize the clock pe-
riod per operation. Changes made in the architecture exploration stage of the design
flow will have the greatest impact on the performance, such as the application of
parallelism, which is only limited by the data dependency. Since parallelism requires
additional resources which take up area and consume power, increasing parallelism
is contrary to reducing area and power consumption. By contrast, the pipelining tech-
nique can usually obtain a good performance under the premise of moderate increase
of area and power consumption. Clearly, tradeoffs must be made between parallelism
and pipelining techniques.
We need to estimate the achievable clock frequency because it is part of the perfor-
mance analysis for candidate architectures. In another way that, according to system
requirements, the clock frequency can also be specified in advance. Whatever way
we adopt, the clock period will be used as a design constraint for subsequent stages
of the design flow.
When we move in accordance with the design flow, the emphasis of the design
criteria will shift gradually from performance to timing. In a synchronous design, the
clock period constrains the propagation delay of the combinational circuit between
the registers. This includes paths from input ports through combinational logic to
register inputs, paths from register outputs to register inputs, paths from register out-
puts through combinational logic to output ports, and paths from input ports directly
through combinational logic to output ports, as shown in Figure 10.30.
Especially in cases in which different blocks are designed by different design-
ers, it is important to guarantee that the combined path from a register output in one
block to a register input in another block meets the clock period constraint. One way
to do this is to allocate a timing budget for each block by specifying the maximum
output delay of a path from a register output to its output port in one block, and the
522 Principles of Verilog Digital Design
maximum input delay of a path from an input port to a register input in another block.
Since it is sometimes difficult to accurately estimate the propagation delays of com-
binational circuits, it is a common practice to require that each block has registered
outputs. In a large high-speed design, where the wiring delay across different blocks
may be significant, it may also be appropriate to require that inputs are registered in
each blocks.
Optimizing and analyzing the timing of a design is typically performed using the
static timing analysis. In Figure 10.30, there are 4 kinds of timing paths for the static
timing analysis. The static timing analysis estimates the timing information of each
cells specified in the technology library, together with simple wire load models. A
typical compiling command using the medium effort is displayed below.
c o m p i l e − m a p _ e f f o r t medium
At the synthesis stage, since the design has not been placed and routed, the de-
lays of cells and wires can only be estimated. However, using the estimates will be
sufficient to guide the timing optimization at this stage. To cope with possible mis-
matches between the estimated and real delays, the clock period constraint can be
conservatively configured to 90% of its target. The static timing analysis determines
whether the clock period constraint has been met, and surely you can identify the
critical paths in your design. If necessary, you can then modify your design to reduce
its critical path delay.
In the physical design stage, we can configure the aspect ratio and (area) utiliza-
tion of the design and choose the locations of hard macros and soft designs via floor-
planning. After floorplanning, the interconnect wires between blocks can be globally
routed. If there is no routing congestion, detailed placement and routing can be per-
formed. However, this process is very computationally intensive. When the physical
design has been established, real delay values of components and wires can be ex-
tracted. We can then repeat the static timing analysis using the accurate delays to
verify the timing constraints again.
If the synthesized netlist or physical design do not meet the timing constraints,
we can still fine tune the timing using different synthesis or placement and routing
commands. However, if the constraints still can not be met by the design, there may
be no choice but to revisit earlier stages of the design flow and choose different
architectures at higher levels of abstraction.
all factors, the final chip cost is approximately proportional to the square of the chip
area. No doubt, if possible, the smaller the chip area the better.
Similar to the timing of a design, choices made earlier in the design flow affect the
chip area most. At the synthesis stage, we can specify constraints on the area in the
synthesis tool, as shown below. When the delays of the design have been optimized,
cells with smaller area will replace those in non-timing critical paths.
set_max_area 0
c o m p i l e − m a p _ e f f o r t medium
At the physical design stage, the chip area can be optimized through intervention
in the floorplanning, and placement and routing of the circuit. However, only fine
tuning of the chip area will be possible. For a design with simple wiring complex-
ity, a high (area) utilization can be achieved, while, for a design with high wiring
complexity, a low (area) utilization is usually traded-off for timing.
As Figure 10.31 shows, there are two main kinds of power consumption: dynamic
and static. The power model of a CMOS buffer has dynamic power consumption
which includes the dynamic switching power used to charge the output load CL (by
IL ) and internal load CIL (by IIL ) of a cell, and the short-circuit power consumption
(due to Ishort) in a cell. The dynamic switching power used for charging a load C is
characterized by
1 2
PD = α fCVDD , (10.2)
2
where α represents the toggle rate, f is the clock frequency, C is the capacitive load,
and VDD is the supply voltage. To reduce the dynamic power used to charge the loads,
one can reduce the switching activity, clock frequency, capacitive load, or the supply
voltage. The technique used to adapt to various operating requirements by adjusting
the clock frequency and supply voltage is called dynamic voltage and frequency
scaling (DVFS).
When charging the output or internal loads, both NMOS and PMOS may turn on
for a short period of time leading to a short current Ishort. Consequently, the short-
circuit power consumption occurs and, during the short circuit period, it is character-
ized by
PS = α f IshortVDD . (10.3)
The static power consumption includes the leakage power (due to Ileak ) even when
there is no signal transition. The leakage power is a complex function (and ignored
here) of the supply voltage VDD , the threshold voltage Vt of transistors, and the aspect
ratio WL .
Dynamic power optimization is achieved using clock gating for the RTL codes,
gate-level optimization using the synthesis tool, and the multi-VDD multi-supply
Logic Synthesis with Design Compiler 525
(MVMS) library. Static power optimization reduces power leakage by using the
multi-Vt library, including cells with different threshold voltages. Through the use
of a multi-Vt library, low Vt cells on critical paths can improve their timing, while
high Vt cells on non-critical paths can save power. Hence, the low-leakage and high-
performance design can be achieved together with a multi-Vt library.
Statistically, clock gating can save from 20% to 40% of dynamic power, depend-
ing on your design, while gate-level optimization using the synthesis tool can save
2% to 6% of dynamic power. Reducing the leakage power using the synthesis tool
can save 20% to 80% of static power.
Figure 10.34: Schematics (a) without clock gating and (b) with clock gating.
5 q2 <= enable ? a ^ b : q2 ;
6 always @ ( posedge clk )
7 case ( enable )
8 1 ’ b0 : q3 <= q3 ; // Redundant
9 1 ’ b1 : q3 <= a ^ b ;
10 endcase
The derived circuits with and without clock gating are displayed in Figure 10.34.
In addition to an AND gate, the clock gating uses a latch as well. In Figure 10.35,
the clock gating produces no glitches on the gated clock, gclk.
The script for automatic clock gating is displayed below.
insert_clock_gating
compile
compile −gated_clock
The clock gating can be manually inserted into the RTL codes, although this is gener-
ally discouraged. However, if it is determined that this is the preferable solution, the
following command should be used before compiling to synthesize the gated clock.
r e p l a c e _ c l o c k _ g a t e s −global
set_dynamic_optimization true
c o m p i l e − m a p _ e f f o r t medium − a r e a _ e f f o r t low \
−power_effort high
For the design in Figure 10.37 as another example, only one of the arithmetic
operations is required depending on the sel[1 : 0] signal at every clock cycle.
To stop data from feeding into the DW arithmetic components through operand
isolation, the DC can automatically insert activation logic, as the AND gates used
528 Principles of Verilog Digital Design
to isolate the operands of the adder shown in Figure 10.38. However, the operand
isolation might impact the logic depth of the combinational circuit.
The script for the operand isolation is displayed below. For weight=1 of the
set_operand_isolation_slack command, if the timing slack is 0.5 worse than before,
Logic Synthesis with Design Compiler 529
the operand isolation will be terminated. If weight=0, the tool can decide whether
the operand isolation will be performed.
set_dynamic_optimization true
c o m p i l e −inc
create_power_model
r e a d _ s a i f −input d e s i g n . s a i f −instance your_design
report_rtl_power
530 Principles of Verilog Digital Design
To reduce the synthesis time, multi-core synthesis is supported through the use of the
following command, where 4 CPU cores are available to be used for synthesis.
s e t _ h o s t _ o p t i o n s −max_cores 4
We can also direct the tool to optimize logics across block boundaries.
compile −boundary_optimization
Figure 10.39: (a) Unconnected output ports are removed, (b) redundant inverters are
optimized, and (c) constants are propagated to reduce logic.
Logic Synthesis with Design Compiler 531
compile_ultra
Alternatively, the design hierarchy of a design can be broken by flattening the design.
The logic flattening allows the compiler to obtain a better optimized design.
ungroup −all − f l a t t e n
compile −incremental_mapping −map_effort high
The implementation shown in Figure 10.40 has several critical constraints: a clock
period of 10 ns, clock-to-Q delay, tCQ , of 1 ns, input delay of 1 ns, and setup time,
tS , of 1 ns as well. If the worst negative slack of your design is still too large, or
there are too many paths with negative slack, it is best to go back and redesign at the
architectural exploration stage.
During the synthesis stage, it is often best to apply the optimize_registers for
retiming of the registers, as shown in Figure 10.41.
The compile or compile_ultra command only optimizes the combinational log-
ics, and does not change the location of registers. There are three commands for the
register retiming which can move registers: optimize_registers, pipeline_registers,
532 Principles of Verilog Digital Design
set_operating_conditions best
set_fix_hold
c o m p i l e −inc − o n l y _ h o l d _ t i m e
Typically, buffers are inserted in those paths with the hold time violation where
the path delay of combinational logic is too small. If the hold time constraint is not
met, the circuit may fail to operate at any clock speed due to the hold time violation.
Hence, hold time violations must be solved in all situations.
In these cases, the written Verilog gate-level netlist may contain assign constructs
as follows.
Unfortunately, layout tools may not be able to handle assign statements in the Ver-
ilog netlist. To ensure that your netlist does not contain assign statements, you
can separate the multiple port nets during compilation as follows. In the com-
mand, the option “-all” fixes both feedthrough signals and constants and the option
“-buffer_constants” buffers constants instead of duplicating them.
s e t _ f i x _ m u l t i p l e _ p o r t _ n e t s −all − b u f f e r _ c o n s t a n t s \
[ get_designs *]
s e t h d l i n _ w h i l e _ l o o p _ i t e r a t i o n s 8192
Unfortunately, though the iteration limit of a for loop can be extended, it still has
a hard maximum limit of 10000. If your design contains a for loop with iteration of
16384 shift registers shown below, the synthesis will eventually interrupt with the
above error message again.
To get around this issue, you can separate one big for loop into several smaller for
loops as follows.
s e t _ b u s _ i n f e r e n c e _ s t y l e {%s _%d_ }
s e t _ b u s _ n a m i n g _ s t y l e {%s _%d_ }
Then, you can apply new naming rules before writing out your netlist as follows.
ch an g e_ n am es − h i e r a r c h y − r u l e s s c r i p t _ o f _ y o u r _ r u l e s
G B
Based on the threshold, the data, bin_data, of binary image is determined by the
pixel data, gray_data[7 : 0], of grayscale image as
0, gray_data<threshold
bin_data = . (10.4)
1, gray_data≥threshold
The interface timing diagram is displayed in Figure 10.46. There are 64 grayscale
pixel data, d[i], i = 0, 1, ..., 63, and 64 binary data, b[i], i = 0, 1, ..., 63. Along with
binary data, the threshold is valid for 64 cycles.
Since the threshold can be determined only after 64 pixel data have been received,
64 pixel data must be stored into a FIFO to decide the binary data. The streamline
FIFO does not need flow control. Read access of 64 pixel data from FIFO follows
write access of 64 pixel data, and hence, only one pointer, wrrd_ptr, is required and
it is shared for both read and write accesses. The FIFO is implemented below, where
the macro, CLOG2, is defined in Chapter 3.
538 Principles of Verilog Digital Design
7 reg gray_valid_r ;
8 wire rd_stb ;
11 if (! rst_n )
12 wrrd_ptr <=0;
13 else if ( gray_valid | bin_valid )
14 wrrd_ptr <= wrrd_ptr +1 ’ b1 ;
15 assign bin_valid =~ gray_valid &( rd_stb |~( wrrd_ptr ==0));
The ATE algorithm is implemented below. Notably, the fraction part of thresh-
old is unconditionally truncated. The outputs, bin_valid, bin_data, and threshold
Logic Synthesis with Design Compiler 539
are combinational outputs. A pipeline stage can be inserted if registered outputs are
needed.
1 // ATE algorithm
2 reg [ BITS -1 : 0] max , min ;
3 wire [ BITS : 0] sum ;
5 reg bin_data ;
6 // Maximum pixel
21 // Binary data
22 always @ (*)
# ## C r e a t e C l o c k ###
s e t c y c l e 10
c r e a t e _ c l o c k −period $ c y c l e [ g e t _ p o r t s c l k ]
set_ideal_network [ get_ports clk ]
set_dont_touch_network [ get_ports clk ]
set_drive 0 [ get_ports clk ]
set_clock_uncertainty 1 [ get_ports clk ]
set_clock_latency 0 [ get_ports clk ]
set_fix_hold [ get_ports clk ]
# ## D e s i g n E n v i r o n m e n t ###
s e t _ i n p u t _ d e l a y − clo ck [ g e t _ c l o c k s ] −max 4 \
[ remove_from_collection [ a l l _ in p u t s ] [ get_clocks ]]
s e t _ i n p u t _ d e l a y − clo ck [ g e t _ c l o c k s ] −min 2 \
[ remove_from_collection [ a l l _ in p u t s ] [ get_clocks ]]
540 Principles of Verilog Digital Design
s e t _ o u t p u t _ d e l a y − clo ck [ g e t _ c l o c k s ] −max 4 \
[ all_outputs ]
s e t _ o u t p u t _ d e l a y − clo ck [ g e t _ c l o c k s ] −min 2 \
[ all_outputs ]
s e t _ l o a d [ l o a d _ o f " slo w / INVX1 /A" ] [ a l l _ o u t p u t s ]
s e t _ d r i v i n g _ c e l l − l i b r a r y slo w − l i b _ c e l l INVX1 \
−pin {Y} \
[ remove_from_collection [ a ll _i nputs ] [ get_clocks ]]
s et _ o p er at in g _ c on d i ti o n s −min_library f a s t \
−min f a s t − m a x _ l i b r a r y slo w −max slo w
s e t _ w i r e _ l o a d _ m o d e l −name tsm c1 8 _ wl1 0 − l i b r a r y slo w
# ## Co mp ile D e s i g n ###
c o m p i l e − m a p _ e f f o r t medium
PROBLEMS
1. Identify design objects for the circuit in Figure 10.47.
a. Make a list of all the ports in the design.
b. Make a list of all the cells that have the letter “U” in their names.
c. Make a list of all the nets ending with “CLK”.
d. Make a list of all the “Q” pins in the design.
e. Make a list of all the references.
1 module decode_3_8 (E , X , Y );
2 output [7 : 0] Y ;
3 input E;
4 input [2 : 0] X ;
5 wire E1 , G1 , G2 ;
6 not u0 ( E1 , X [2]);
7 and u1 ( G1 , E , X [2]);
8 and u2 ( G2 , E , E1 );
9 decoder_2_4 u0 ( G1 , X [1 : 0] , Y [7 : 4]);
10 decoder_2_4 u1 ( G2 , X [1 : 0] , Y [3 : 0]);
11 endmodule
12 module decoder_2_4 (Y , E , X );
13 output [3 : 0] Y ;
14 input E ;
542 Principles of Verilog Digital Design
15 input [1 : 0] X ;
16 assign Y = E ?1 ’ b1 < < X : 4 ’ h0 ;
17 endmodule
3. Synthesize the module, fir2, in Section 7.3.3 using the following steps.
a. Specify your constraint file with a clock frequency of 100 MHz and clock
uncertainty of 0.3 ns. The input delay constraint is 1 ns and the output delay
constraint is 1.5 ns.
b. Analyze your design and constraints by report_design, report_hierarchy, re-
port_port, report_clock, report_net_fanout, and check_timing.
c. Report timing.
d. Report area.
e. Report power.
f. Write Verilog gate-level netlist and SDF.
g. Run the dynamic timing analysis (pre-sim).
h. Use the VCD file to get a better power estimate of the gate-level design than
that of the RTL model.
4. Repeat Problem 3 for the module, fir2. However, the input and output delays
should reference the negative clock edge.
5. Repeat Problem 3 for the module, fir2. However, set the timing exception of false
path to those from the input coefficients of the FIR filter, h0 , h1 , h2 , and h3 , be-
cause they are constants and do not change during the operation.
6. Repeat Problem 3 for the module, fir2, but in this case the clock signal, clk, is
generated by the following clock generator, clk_gen. Suppose that the clock fre-
quencies of clk0, clk1, clk2, clk3 are 100, 200, 300, and 400 MHz, respectively.
a. Integrate modules, fir2 and clk_gen, into a module named chip.
b. Modify your script file to synthesize chip.
c. Run pre-sim to confirm that your design can operate at all clocks without any
timing violations.
4 input [1 : 0] sel ;
5 always @ (*)
6 case ( sel )
7 2 ’ b00 : clk = clk0 ;
8 2 ’ b01 : clk = clk1 ;
9 2 ’ b10 : clk = clk2 ;
10 2 ’ b11 : clk = clk3 ;
11 default : clk = clk0 ;
12 endcase
13 endmodule
Logic Synthesis with Design Compiler 543
7. Repeat Problem 3 for the module, fir2. However, the clock signal, clk, is gener-
ated by the following clock generator, clk_gen1. The clock, clk1, is a divide-by-2
generated clock from clk. Suppose that the clock frequency of clk0 is 100 MHz.
a. Integrate modules, fir2 and clk_gen1, into a module named chip.
b. Modify your script file to synthesize chip.
c. Run pre-sim to confirm that your design can operate under sel = 0, 1 without
any timing violations.
4 input sel ;
5 input rst_n ;
6 reg clk1 ;
11 case ( sel )
12 1 ’ b0 : clk = clk0 ;
13 1 ’ b1 : clk = clk1 ;
14 default : clk = clk0 ;
15 endcase
16 endmodule
8. Plot the timing diagrams of the circuits with and without clock gating in Figure
10.34. Verify the glitch-free clock gating in Figure 10.34(b).
9. Suppose a clocked synchronous design uses registers with a setup time of 1.2 ns
and a clock-to-Q delay of 0.6 ns. The clock has an uncertainty of 0.3 ns. Three
register-to-register paths in the combinational circuits have propagation delays of
2.6 ns, 1.9 ns, and 3.3 ns.
a. What is the maximum clock frequency at which the datapath can be operated?
b. If the path with a delay of 3.3 ns is optimized so that its delay is reduced to 2.3
ns, what is the maximum clock frequency for the optimized datapath?
10. Suppose a clocked synchronous design in Figure 10.48, in which registers have a
setup time of 200 ps and a clock-to-Q delay of 100 ps, has a timing constraint in
which the clock frequency is 800 MHz. The propagation delays through combi-
national elements in the datapath and control path are displayed in the figure. The
control path uses a Mealy FSM.
544 Principles of Verilog Digital Design
transitions from logic 0 to 1, its output will transition from logic 1 to 0 after the
propagation delay due to the load of the inverter.
Power dissipation is the power consumed by the gate that must be available from
the power supply. There are two main kinds of power consumption: dynamic and
static. The static power contains the leakage power. The dynamic power is consumed
due to signal transitions, including the dynamic switching power and transient short-
circuit power.
Spurious electrical signals can induce undesirable voltages on the connecting
wires due to the inductance between logic gates. These unwanted signals are re-
ferred to as noise. Noise margin is the maximum external noise added to an input of
a logic gate that does not cause an undesirable change in its output.
We briefly introduce several CMOS logic gates here. To understand their oper-
ations, we can know that 1) the NMOS conducts when its gate-to-source voltage
is positive (and larger than its threshold voltage), 2) the PMOS conducts when its
gate-to-source voltage is negative (and smaller than its threshold voltage), 3) either
NMOS or PMOS is turned off if its gate-to-source voltage is zero.
• Inverter
The basic CMOS logic gate is an inverter or NOT gate, which consists of
one PMOS transistor and one NMOS transistor. When the input is low, both
gates of PMOS and NMOS are 0 V. The gate-to-source voltages of NMOS,
Basic Logic Gates and User Defined Primitives 547
Figure A.2: NOT gate: (a) symbol, (b) description in Verilog, (c) truth table, and (d)
CMOS schematics.
VGSN , and PMOS, VGSP , are 0 V and −VDD V, respectively. Hence, NMOS
turns off while PMOS turns on. Under this situation, the output voltage
becomes VDD V. When the input is high, the reverse condition occurs, and
the output voltages is 0 V. The net result is the logical NOT function.
• Buffer gate
The buffer is constructed using two back-to-back inverters. Its Boolean
equation is the inverse of a NOT gate. The net result is the logical buffer
function. A buffer can decrease the propagation delay of a logic gate when
the gate is driving a large capacitive load.
• NAND gate
The NAND gate composes of two NMOS transistors connected in series
between GND and the drain-output, and ensures that the drain-output is
only driven low (logical 0) when both gate inputs, A and B, are high (logical
1). The complementary parallel-connection of the two PMOS transistors
between VDD and drain-output ensures that the drain-output is driven high
(logical 1) when one or both gate inputs are low (logical 0). The net result
is the logical NAND function.
• AND gate
The AND gate is constructed using a NAND gate and a NOT gate in series.
Its Boolean equation is the inverse of a NAND gate.
• NOR gate
548 Principles of Verilog Digital Design
Figure A.3: Buffer gate: (a) symbol, (b) description in Verilog, (c) truth table, and
(d) CMOS schematics.
Figure A.4: NAND gate: (a) symbol, (b) description in Verilog, (c) truth table, and
(d) CMOS schematics.
• XNOR gate
The XNOR gate composes of parallel-connection of the two NMOS tran-
sistors in series between GND and the drain-output, and ensures that the
drain-output is only driven low (logical 0) when the gate inputs, A and B,
have reverse logical values. The PMOS transistors are connected in a com-
plementary fashion to those of NMOS transistors. The net result is the log-
ical XNOR function.
• Transmission gate
Transmission gate can enable the bidirectional signal transmission. The
CMOS transmission gate works like a voltage-controlled switch. It consists
of one NMOS and one PMOS transistors with common source and drain
connections. The gates of the two transistors are controlled by E and E, re-
spectively. The idea is that both transistors are non-conducting when E = 1.
Hence, the output Y is in the high-impedance state. On the contrary, both
transistors are conducting when E = 0. In this situation, the output Y is di-
rectly connected to input A. The transmission gate is not a static CMOS gate
because its output is not a restored logic function of its input.
550 Principles of Verilog Digital Design
Figure A.5: AND gate: (a) symbol, (b) description in Verilog, (c) truth table, and (d)
CMOS schematics.
A question may arise, “why are there two parallel paths through NMOS and
PMOS concurrently?” First, it is obvious that parallel paths through NMOS
and PMOS can provide a larger current for drawing or sinking the output
load of a transmission gate.
Second, as shown in Figure A.12(a), when E = 0, A = 1, and Y = 0, the
charge passes through NMOS so Y changes from 0 V to VDD V. It must be
emphasized that the MOS, either NMOS or PMOS, used in a transmission
gate is symmetric that its source and drain terminals are interchangeable.
During this occasion, the source and drain terminals of NMOS are on the
right-hand-side and left-hand-side, respectively. The gate-to-source voltage
VGSN gradually decreases as well. However, the NMOS must conduct when
VGSN ≥ VT N , where VT N > 0 denotes the threshold voltage of NMOS. There-
fore, if the transmission gate pulls Y high through the NMOS, Y will be a
weak one because VSN ≤ VGN − VT N = VDD − VT N , which is less than VDD ,
where VSN and VGN are the source voltage and gate voltage of NMOS.
As shown in Figure A.12(b), if the charge passes through PMOS to pull
Y high when E = 0, A = 1, and Y = 0. The source and drain terminals
of PMOS are on the left-hand-side and right-hand-side, respectively. The
Basic Logic Gates and User Defined Primitives 551
Figure A.6: NOR gate: (a) symbol, (b) description in Verilog, (c) truth table, and (d)
CMOS schematics.
PMOS must conduct when VGSP ≤ VT P , where VT P < 0 denotes the thresh-
old voltage of PMOS. Since VGSP = 0 − VDD = −VDD ≤ VT P is a constant,
so that PMOS always conducts and the charge will pass through it until Y
reaches VDD , which is a strong one.
Similarly, as shown in Figure A.13(a), when E = 0, A = 0, and Y = 1, the
charge passes through NMOS so Y changes from VDD V to 0 V. During this
occasion, the source and drain terminals of NMOS are on the left-hand-side
and right-hand-side, respectively. The gate-to-source voltage VGSN = VDD is
a constant. Consequently the NMOS will always conduct and the current
flows through it until VY = 0 V, which yields a strong zero. By contrast, as
shown in Figure A.13(b), if the charge passes through PMOS to pull Y low
when E = 0, A = 0, and Y = 1. The source and drain terminals of PMOS
are on the right-hand-side and left-hand-side, respectively. The PMOS must
conduct until its gate-to-source voltage VGSP ≤ VT P . Therefore, Y will be a
weak zero because VSP ≥ VGP −VT P = −VT P > 0, which is higher than 0 V,
where VSP and VGP are the source voltage and gate voltage of PMOS.
In summary, in a transmission gate, the PMOS is good to pass logic 1 and
NMOS is good to pass logic 0. To provide both strong logic 1 and logic 0,
PMOS and NMOS are connected in parallel.
552 Principles of Verilog Digital Design
Figure A.7: OR gate: (a) symbol, (b) description in Verilog, (c) truth table, and (d)
CMOS schematics.
Figure A.8: Multiplexer gate: (a) symbol, (b) description in Verilog, (c) truth table,
and (d) schematics.
Figure A.9: XOR gate: (a) symbol, (b) description in Verilog, (c) truth table, and (d)
CMOS schematics.
Figure A.10: XNOR gate: (a) symbol, (b) description in Verilog, (c) truth table, and
(d) CMOS schematics.
Basic Logic Gates and User Defined Primitives 555
Figure A.11: Transmission gate: (a) symbol, (b) truth table, and (c) CMOS schemat-
ics.
1 primitive multiplexer (o , x , y , s );
2 output o ;
3 input x , y , s ;
4 table
5 // x , y , s : o
6 0 ? 1 : 0;
556 Principles of Verilog Digital Design
7 1 ? 1 : 1;
8 ? 0 0 : 0;
9 ? 1 0 : 1;
10 0 0 x : 0;
11 1 1 x : 1;
12 endtable
13 endprimitive
Figure A.14: Tristate buffer gate: (a) symbol, (b) description in Verilog, (c) truth
table, and (d) schematics.
Basic Logic Gates and User Defined Primitives 557
Figure A.15: D-Type edge-triggered flip-flop gate: (a) symbol, (b) description in Ver-
ilog, (c) function table, and (d) schematics.
1 reg clk ;
2 initial begin
3 clk =0;
4 forever begin
5 #10 clk =1;
6 #10 clk =0;
7 end
8 end
• while
A while loop executes a statement (or block of statements) as long as its
expression is true. Its syntax is shown below.
The following code segment describes the generation of a clock with period
of 20 time units.
1 reg clk ;
2 initial begin
3 clk =0;
4 while (1) begin
• repeat
A repeat loop executes a statement (or block of statements) a fixed number
of times. Its syntax is shown below. If the iteration_number is a constant,
it can be synthesized. However, it is recommended to use repeat loop in
testbench only.
• wait
In the following example, after the assertion, and then, deassertion of the
active-low reset signal, reset_n, normal function follows.
1 initial begin
2 wait (! reset_n ); // Wait for the assertion of
3 // active - low reset_n
4 wait ( reset_n ); // Wait for the deassertion of
5 // reset_n
6 // Normal function here
7 end
1 initial
2 fork
3 #2 a =1;
4 #4 b =2;
5 join
• event
You can define your own events and write Verilog codes in an event-driven
style. A named event is a data type that you can then trigger in a procedural
Non-Synthesizable Constructs 561
block to cause an action. It must be declared before you can reference it.
The -> operator is the trigger of the named event. The syntax of the named
event is displayed below.
1 event receive_data ;
2 event check_data_ fo r ma t ;
3 always @ ( posedge clk )
4 begin
14 always @ ( receive_data )
15 begin
18 always @ ( check_data_ f or m at )
19 begin
C.1 EXAMPLES
A net can physically have three states: 0, 1, and Z (high impedance). The high-
impedance state is symbolized by Z in Verilog. The high-impedance state is used to
model open circuit, which means that the net connected to an output of a logic gate
with high-impedance state appears to be disconnected, and inputs of other logic gates
connected to the high-impedance net are not affected by it.
The tri data type is used when all variables that drive the tri net must have a value
of high-impedance, Z, except one, which must be ensured by the designer, as shown
below.
When condition is 2’b00, a, b, and c are 1, Z, and Z, respectively. That is, only
a drives tri_out, and b and c are high impedance (Z). When conditions are 2’b01,
2’b10, and 2’b11, only b, c, and none drive tri_out, respectively. The tristate wire,
tri, with states (0, 1, Z) is implemented using a tristate buffer.
When multiple wires drive the same net, an unknown X happens. As shown below,
wires a and b drive the same wire, w.
1 wire w ;
2 assign w = a ;
3 assign w = b ; // Error exists
The conflict can be resolved by the wand or wor declarations depending on the
drive strengths of drivers. The wand net is used in the following situation.
1 wand w ;
2 assign w = a ;
3 assign w = b ;
Wires a and b are wired as an AND logic exists virtually to generate wire-and w
depending on the drive strengths of a and b, as shown in Figure C.1.
The wand net is also obtained by using the open-drain drivers, as shown in Figure
C.2. The wire w is logic 1 by connecting it to VDD only when both a and b are logic
0. Otherwise, w is logic 0. That is, the net result is the logic w = ab = ab.
The wor net is used in the following situation.
1 wor w ;
2 assign w = a ;
3 assign w = b ;
Advanced Net Data Types 565
Inputs a and b are wired as an OR logic exists virtually to generate wire-or w de-
pending on the drive strengths of a and b, as shown in Figure C.3.
Figure D.2: Circuits of unsigned number multiplication: (a) structure of the circuit,
(b) schematics for the blocks in the top row, and (c) schematics for the blocks in
remaining rows.
Figure D.4 illustrates method 3 for a regular 5 × 5 signed number multiplier using
the partial product, which is suitable for hardware implementation. As displayed,
method 3 requires 4 6-bit adders for all cases, which is slightly larger than method 2.
It must be emphasized here that when the multiplier is negative ((c) and (d)),
the last adder should be implemented using the subtractor, because multiplier Q is
negative and its sign bit represents a value of −2n−1. When multiplier Q is positive
((a) and (b)), the partial product produced by its sign bit is 0, which can also be
implemented using the subtractor, because subtracting 0 is the same as adding 0.
Hence, from above, the last partial product (produced by the sign bit of multiplier)
should be subtracted regardless of the sign of the multiplier.
Multiplication of signed numbers (5 × 5 multiplier) in hardware uses the systolic
array, as shown in Figure D.5. Schematics for the blocks in the top row and middle
row are the same as those in Figures D.2(b) and D.2(c), respectively. The bottom row
implements the subtractor for the last partial product by inverting every bits of the
last partial product (using the NAND gate) and connecting the carry in of the adder
to logic 1.
Signed Multipliers 569
Figure D.3: Two methods for the signed number multiplication (5 × 5 multiplication)
by hand: (a) sign extending to the maximum bit number you need and (b) sign ex-
tending partial products and partial sums. Multiplicand is negative and multiplier is
positive.
Figure D.4: Method 3: regular 5 × 5 signed number multiplier using the partial prod-
uct for hardware implementation. In this example, there are 4 different cases: (a)
positive multiplicand and positive multiplier, (b) negative multiplicand and positive
multiplier, (c) positive multiplicand and negative multiplier, and (d) negative multi-
plicand and negative multiplier.
570 Principles of Verilog Digital Design
Figure D.5: Circuits of signed number multiplication: (a) structure of the circuit and
(b) schematics for the blocks in the bottom row.
• Hardware mindset: Don’t forget you are designing the hardware. A work-
able digital circuit depends on two essential factors: function and timing.
Thinking about the functions of input and output can help bridge the gap
between concept and its physical realization. In fact, digital circuits are exe-
cuted in parallel. On the contrary, computer programs execute sequentially.
As Verilog is used to describe the behavioral of hardware, don’t treat Verilog
as another “software” language, such as C language.
• Architecture exploration: There exists tradeoffs between designs that
adopt different architectures, such as pipelining or parallelism, which have
different performances and require different hardware resources. Conse-
quently, the timing diagram should be designed accordingly. To this, you
must understand what your module will be synthesized to, analyze which
(and unrollable), the loop index must be constant rather than variable. The maximum
iteration limit of a for loop for synthesis is 4096. If your design contains a for loop
with iteration of 16384, you may separate one big for loop into several smaller for
loops.
Combinational loop is not allowed. You may need to declare different signal
names for the inputs and inputs of a combinational logic. Otherwise, sequential log-
ics can be adopted to break the timing loop.
It’s better to manually optimize the circuits rather than depending too much on
the tool. For example, the following two pieces of codes demonstrate the same func-
tionality. However, it’s a good practice to explicitly select the operands of a function,
sum3, as shown below.
1 wire [1 : 0] sel ;
2 wire [3 : 0] a , b , c ;
3 reg a_sel , b_sel , c_sel ;
4 wire [1 : 0] out ;
Rather, the following piece of codes is optimized by the synthesis tool to derive a
combinational logic, sum3, with its operands selected by the sel signal. Therefore,
the two pieces of codes may infer the same logic. Even so, at the first glance, sum3
seems to be called (or instantiated) 4 times.
Design Principles and Guidelines 575
4 reg [1 : 0] out ;
By contrast, the following description hides the next state and the combinational
circuit that produces it.
576 Principles of Verilog Digital Design
In additional to binary encoding of state machine, the Gray code encoding and
one-hot encoding can be adopted. The Gray code can reduce the dynamic power of
a state machine and one-hot encoding can achieve a faster sequential circuit because
the decoders used to generate the control signals of a state machine are not needed.
Grouping variables with the same control signals together in an always block can
save simulation time and is also a good coding style. For example:
Instead, put dissimilar variables into different always blocks. For the example in
Figure E.1:
For the example in Figure E.2, we assume that “state_cs==S0” and “enable” are
mutually exclusive. So, variables, a and b, are controlled by different signals and
they should better be described using two always blocks.
For the example in Figure E.3, it’s good to only describe changed conditions,
while unchanged condition is implicitly implied.
The if-else statement and conditional operator imply the multiplexer. Good style
takes advantage of if-else priority that can be commonly shared, as shown in Figure
E.4.
Sharing common parts of complex expressions using assignments to intermediate
variables is a good style. For example, in the following original codes:
It can be observed that the result (base1 × 256) + (base2 ≪ 4) is a common part of
the two expressions. We can declare an additional variable, base, for the common
part and share it in the two expressions as follows:
Design Principles and Guidelines 577
Figure E.1: Coding dissimilar flip-flops, a (with synchronous reset) and b (without
synchronous reset): (a) circuits, (b) bad codes putting dissimilar variables into the
same always block, and (c) good codes putting dissimilar variables into different
always blocks.
Figure E.2: Additional example: (a) bad codes putting dissimilar variables, a and b,
into the same always block and (b) good codes putting dissimilar variables, a and b,
into different always blocks.
578 Principles of Verilog Digital Design
Figure E.3: Another example: (a) bad codes describing unchanged condition first
and (b) good codes describing only changed conditions, while unchanged condition
is implicitly implied.
The case statement implies the multiplexer, and a design with balanced path de-
lays is good for timing. For example, a multiplexer inferred by the parallel case
statement has a more balanced delay (and hence a shorter longest delay) than that
using the if-else-if statement.
Moreover, balanced design using the bracket is good for timing. For example, the
expression Y 1 = (A + B) + (C + D), which has the critical path with two adders, is
better than Y 3 = A + B + C + D, which has the critical path with three adders, as
shown in Figure E.5.
The comparator might be implemented using the XNOR or XOR gates to reduce
the circuit delay. The XNOR and XOR gates compare whether two bit are the same
or different, respectively. In the following example, the signal a[7 : 0] is compared to
a constant 8’hA5 using XNOR.
580 Principles of Verilog Digital Design
flip-flops should be triggered by the external scan clock using the multiplexers as
follows.
For the modules with flip-flops, three additional ports are dedicated to the scan test
as shown below, including scan_en, scan_in, and scan_out, which are used by the
scan chains compiled by the synthesis tool.
4 ...
5 endmodule
These three ports will be automatically connected to scan flip-flops in Figure E.6
during the scan chain insertion. When scan_en is true, scan_in is selected. There-
fore, during the scan mode (configured by scan_mode=1), at every positive edge of
scan_clk, the scan_in is latched by the scan flip-flop. For those flip-flops in a scan
chain, their stored data will be shifted in and out at the clock edge. Doing so makes
all data stored in flip-flops controllable.
During the scan mode, when scan_en is false, the normal function input driven
by the output of a combinational logic is selected and captured subsequently at the
clock edge. Since the inputs of a combinational logic are connected to the outputs of
controllable scan flip-flops, the normal function input is predictable and can be used
to confirm if defects exist in combinational and sequential logics or not.
After the placement and routing, the scan chain may be reordered to reduce the
wire lengths according to their physical locations.
582 Principles of Verilog Digital Design
It is much more readable to use symbolic constants. Symbolic names can be con-
nected to constant values as either constants (using Verilog ‘define) or as parameters
(using the parameter statement).
1 ‘define RED_LIGHT_T IM E 9
2 if ( counter == ‘RED_LIGHT_T IM E )
3 ...
Design Principles and Guidelines 583
Often, a long signal is broken into a number of subfields. For example, a 16-
bit instruction may be split into an 4-bit opcode, 2 6-bit addresses. Consider the
following statement:
This is harder to understand, and there is a danger of getting wrong indices, particu-
larly for a new definition of the opcode subfield. Codes become much more readable
when symbolic names are defined for these subfields. We can declare a new signal,
opcode, as follows.
Naming skill is important for a project. For example, you can use lowercase letters
for all signal names and port names, while uppercase letters for constant names. For
active-low signals, the signal name is suffixed with an underscore followed by a
lowercase character, e.g., rst_n. Your code will be much more readable if your signal
names describe their functionality. Consider, for example, the statement
Statements are easier to understand if they fit in one line. While easily readable,
long names can make code appear cluttered. With an appropriate naming rules and
supporting documents, not-too-short or abbreviated names can still be readable as in
the following, where ba denotes the base address.
Interaction between initial (in the testbench) and always blocks (in the design)
may induce a race condition. To solve the race condition, we can either assign the
primary input of a design at a time instance other than clock edges or try to use
non-blocking assignments in testbench.
Only use timescale in the top (testbench) module, and it is inherited to all sub-
modules.
Well-written code should have lots of high-quality comments. Comments should
describe design rationale and goal. Consider the following code fragment:
584 Principles of Verilog Digital Design
These comments do not convey any information. They just describe the codes and
should be deleted. Also, at the first glance, there are 4 arithmetic units (2 for both
out1 and out2) in the codes. Now, consider the following new codes and comments:
New codes and comments give the big picture view of the codes. First, the operands,
op1 and op2, are factored manually and shared for out1 and out2. Second, arithmetic
units are optimized manually so that only one adder and one subtractor are required.
fabrication, 1, 10, 16, 26, 27, 522 register, 15, 17, 19, 31, 41, 107, 110,
layout, 1, 7, 9, 10, 12, 13, 16, 22, 23, 112, 194, 209, 215, 219, 222,
25, 26, 98, 99, 456, 505, 507, 225–229, 233, 236, 255, 257,
534, 580 260, 261, 306, 307, 312, 315,
lead frame, 1 316, 318, 319, 330, 380–382,
mask, 1, 9–11, 13 394, 399–401, 404, 409, 421,
package, 1, 26, 27, 297, 522 428, 431, 433, 442, 443, 445,
scan chain, 21, 580, 581 446, 449, 452, 456, 457, 459–
standard IC, 16 462, 464, 466, 468, 471, 472,
test, 1, 21, 27, 455, 515, 522, 580, 477, 491, 492, 502, 506, 518,
581 521, 522, 525, 531, 532, 535,
wafer, 1, 16, 27, 522 567, 575, 582
intellectual property (IP), 10, 16, 493 sequential, 15, 17, 19, 21, 23, 29,
interconnect, 257, 268, 270, 278, 284, 49, 51, 75, 77, 106–109, 112,
285, 391, 453 149, 159, 209, 210, 215, 219,
logic gate, 2, 4, 5, 9, 10, 13, 14, 17, 22, 221, 222, 224, 225, 231, 255,
25, 102–104, 149–151, 255, 259, 492, 509, 553, 571–575,
491, 545–547, 563, 580 580, 581
combinational, 15, 17, 19, 21, 23, logic synthesis, 1, 9, 10, 13–16, 21–23,
24, 29, 43, 44, 46–48, 51, 52, 25, 26, 75, 86, 186, 187, 219,
75, 77, 87, 106–109, 112, 149, 294, 491–494, 498, 507, 509,
152–155, 158, 210, 221, 222, 514, 522, 523, 525, 530, 531,
224, 225, 231, 237, 242, 255, 534, 535, 572, 574
259, 294, 303–307, 492, 517, area report, 503
518, 521, 522, 525, 528, 531, clock tree synthesis (CTS), 509,
532, 553, 571–575, 578, 580, 510, 512, 516, 517
581 Design Compiler (DC), 82, 115,
flip-flop (FF), 4, 11, 14, 15, 17, 19, 491, 493, 495, 496, 505, 506,
23, 24, 47, 51, 97, 107, 108, 514, 515, 520, 527, 531, 534,
112, 209, 210, 213–217, 219– 582
226, 228, 230, 231, 233, 235, design report, 499
236, 239, 242, 261, 294, 306, design rule constraint (DRC), 506
363–365, 367–371, 373–376, non-synthesizable, 39, 40, 45, 60,
492, 502, 507, 509–517, 519, 62, 75, 76, 88, 492, 553, 559,
523, 525, 552, 557, 572, 573, 575
575, 577, 580–582 power report, 504
gate-level netlist, 14, 16, 21–23, 48, Synopsys design constraints (SDC),
67, 69, 115, 158, 168, 169, 506
505, 519, 534–536, 572, 575 synthesis command, 491, 572
gate-level schematic, 15 synthesis constraint, 10, 21, 505
latch, 47, 51, 75, 77, 78, 81, 82, 108, synthesis environment, 493
209–215, 219–222, 361, 364– synthesis methodology, 491
366, 492, 526, 552, 575 synthesis report, 23
INDEX 587
synthesis script, 506 pipeline, 17, 19, 20, 23, 24, 31, 112, 255,
synthesis technology, 257 257–261, 263–265, 275, 278,
synthesis tool, 21, 22, 25, 26, 35, 306, 307, 316, 394, 404, 448,
48, 82, 98, 99, 115, 143, 150, 470, 471, 475, 493, 521, 567,
154, 155, 160, 185, 186, 255, 571–573, 580
257, 491, 504, 519, 523–525, post-sim, 10, 69, 115
527, 572, 574, 578, 581, 582 power model, 493, 524, 527
synthesizable, 17, 21, 30, 35, 39, 41, dynamic (switching) power, 4, 92,
45, 47, 50, 52, 59, 75, 86, 158, 93, 491, 504, 523–525, 527,
257, 455, 491, 571, 573 546
timing exception, 518 static power, 4, 93, 491, 524, 525,
timing report, 503 546
master-slave flip-flop (FF), 220, 364 switching activity interchange for-
memory system, 255, 286, 288–291, mat (SAIF), 529
298, 364 pre-sim, 10, 69, 115, 505
asynchronous SRAM, 297, propagated clock, 517
299–302 race condition, 108, 209, 244–246, 582,
dynamic random-access memory 583
(DRAM), 7, 8, 11, 285, 286, register-transfer level (RTL), 1, 14, 15,
361–363 17, 20, 23, 31, 37, 48, 186,
flash memory, 361, 363, 391 187, 255, 455, 491, 527
read-only memory (ROM), 11, 99, high-level RTL, 17, 30, 75, 76, 255
286, 303–305, 391, 395, 409, low-level RTL, 17, 30, 35, 76, 255,
449, 462 580
static random-access memory (SRAM), RTL code, 10, 21, 27, 50, 82, 115,
11, 229, 285, 286, 288, 293, 142, 149–151, 153, 156, 171,
294, 297, 361–363, 572, 580 174, 177, 178, 187, 188, 195,
synchronous SRAM (SSRAM), 286– 209, 223, 227, 232, 234–236,
288, 294, 418, 421–428, 430, 244, 255, 294, 302, 304, 306,
431, 433 311, 314, 316–319, 328, 330,
Moore’s law, 7, 8, 12 333, 371, 374, 375, 381, 385,
overflow, 57, 60, 87, 88, 137, 139–141, 389, 404, 428, 448, 456, 477,
150, 180, 182, 183, 187–191, 524, 525, 527, 531, 571, 573,
193, 194, 198, 200, 261, 262 575
negative overflow, 192, 200 RTL description, 17, 255
overflow detection, 149, 182, 183, RTL design, 15–17, 19, 21–23, 26,
190–193, 200 32, 75, 107, 112, 174, 255,
positive overflow, 191, 192, 198, 257, 306, 314, 361, 441, 573
200 RTL model, 15, 48, 56, 194, 491,
saturation arithmetic, 149, 190, 191, 505, 525, 572
200 RTL netlist, 575
overrun, 381, 382, 384, 387, 389 RTL simulation, 10, 100, 115, 237,
parallelism, 255, 259, 260, 470, 471, 239, 492, 509, 534
521, 571, 572 RTL statement, 257
physical implementation, 1, 8, 16, 510,
580
588 INDEX
standard delay format (SDF), 10, 69, 99, timing verification, 1, 14, 23, 69, 94
112, 115, 502, 505 transistor, 1–3, 5, 7–9, 11, 12, 14, 31, 75,
state machine, 230 76, 102, 103, 213, 285, 303,
synchronizer, 239, 361, 370–372, 374– 361, 362, 367, 368, 454, 455,
376, 380–385, 387–389, 391, 523, 524, 549
392, 516, 519 BJT, 4, 14
clock domain crossing (CDC), 371 CMOS, 4, 5, 13, 52, 213, 364, 365,
system-level design, 16, 26, 255, 257, 524, 525, 545–552, 554, 555
361 FET, 4, 365, 366
testbench, 12, 19, 20, 29, 33, 45–47, 66– MOS, 52, 550
68, 88, 91, 99, 101, 113–115, MOSFET, 4, 91
244, 491, 560, 571, 583 NMOS, 4, 91–93, 213, 214, 524,
threshold voltage, 1, 93, 366, 524, 525, 545–551, 555, 556
546, 550, 551 PMOS, 4, 91–93, 213, 214, 524,
timing constraint, 19, 22–26, 97, 99, 112, 545–551, 555, 556
209, 299, 363, 364, 502, 503, transistor-level netlist, 16
516, 518, 520, 522, 525 transistor-level schematic, 30, 31,
clock-to-Q delay, 109, 209, 217– 365
219, 373, 375, 509, 510, 531, transistor-transistor logic (TTL), 3,
580 4
hold time, 1, 15, 23–25, 96, 99, 100, underrun, 387–389
110–112, 209, 217, 219, 220, value change dump (VCD), 68, 529
287, 288, 294, 300, 306, 363, Verilog hardware description language
364, 369, 370, 381, 491, 497, (HDL), 12, 13, 15–17, 19, 29,
500, 502, 503, 510–512, 532, 32, 43, 243
573 always block, 21, 23, 29, 51, 75,
recovery time, 112 78–80, 86, 87, 89, 106–109,
removal time, 112, 113 152, 158, 222–225, 244, 255,
setup time, 1, 15, 23–25, 97, 99, 257, 333, 491, 492, 525, 573,
110–112, 209, 217, 218, 220, 575–577, 583
287, 288, 297, 300, 306, 363, behavioral description, 17, 30, 75,
364, 369, 370, 375, 381, 491, 154, 160, 162, 163, 166–168,
497, 501, 502, 509–513, 531, 170, 175, 190
532 blocking assignment, 29, 45, 49, 50,
width, 110, 111 75, 102, 103, 105–108, 152,
timing diagram, 19, 20, 23, 24, 47, 99, 153, 243, 244, 492, 572, 575,
104, 220, 221, 223–225, 227, 578
236, 255, 263–265, 275–278, case statement, 80
287, 300, 301, 305–308, 310, continuous assignment, 29, 39, 43,
328–330, 380, 382–384, 391, 44, 75, 87, 100, 109, 149, 151,
392, 401, 404, 405, 428, 429, 160, 215, 219, 243, 244, 255,
446, 447, 456, 457, 462, 463, 257, 297, 491, 492, 573, 580
472–475, 525, 537, 538, 571– dataflow description, 17, 30, 75,
573 169, 175, 176, 178
INDEX 589