Unit 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

CAO - UNIT 2

Fixed-Point Representation

This representation has fixed number of bits for integer part and for fractional part. For example,
if given fixed-point representation is IIII.FFFF, then you can store minimum value is 0000.0001
and maximum value is 9999.9999. There are three parts of a fixed-point number representation:
the sign field, integer field, and fractional field.

We can represent these numbers using:

 Signed representation: range from -(2(k-1)-1) to (2(k-1)-1), for k bits.


 1’s complement representation: range from -(2(k-1)-1) to (2(k-1)-1), for k bits.
 2’s complementation representation: range from -(2(k-1)) to (2(k-1)-1), for k bits.

2’s complementation representation is preferred in computer system because of unambiguous


property and easier for arithmetic operations.

Example −Assume number is using 32-bit format which reserve 1 bit for the sign, 15 bits for
the integer part and 16 bits for the fractional part.

Then, -43.625 is represented as following:

Where, 0 is used to represent + and 1 is used to represent. 000000000101011 is 15 bit binary


value for decimal 43 and 1010000000000000 is 16 bit binary value for fractional 0.625.

The advantage of using a fixed-point representation is performance and disadvantage


is relatively limited range of values that they can represent. So, it is usually inadequate for
numerical analysis as it does not allow enough numbers and accuracy. A number whose
representation exceeds 32 bits would have to be stored inexactly.
These are above smallest positive number and largest positive number which can be store in 32-
bit representation as given above format. Therefore, the smallest positive number is 2 -
16
≈ 0.000015 approximate and the largest positive number is (215-1)+(1-2-16)=215(1-2-16)
=32768, and gap between these numbers is 2-16.

We can move the radix point either left or right with the help of only integer field is 1.

Floating-Point Representation

This representation does not reserve a specific number of bits for the integer part or the
fractional part. Instead it reserves a certain number of bits for the number (called the mantissa
or significand) and a certain number of bits to say where within that number the decimal place
sits (called the exponent).

The floating number representation of a number has two part: the first part represents a signed
fixed point number called mantissa. The second part of designates the position of the decimal
(or binary) point and is called the exponent. The fixed point mantissa may be fraction or an
integer. Only the mantissa m and the exponent e are physically represented in the register
(including their sign). A floating-point binary number is represented in a similar manner except
that is uses base 2 for the exponent. A floating-point number is said to be normalized if the most
significant digit of the mantissa is 1.

So, actual number is (-1)s(1+m)x2(e-Bias), where s is the sign bit, m is the mantissa, e is the
exponent value, and Bias is the bias number.

Note that signed integers and exponent are represented by either sign representation, or one’s
complement representation, or two’s complement representation.

The floating point representation is more flexible. Any non-zero number can be represented in
the normalized form of ±(1.b1b2b3 ...)2x2n This is normalized form of a number x.
Example −Suppose number is using 32-bit format: the 1 bit sign bit, 8 bits for signed exponent,
and 23 bits for the fractional part. The leading bit 1 is not stored (as it is always 1 for a
normalized number) and is referred to as a “hidden bit”.

Then −53.5 is normalized as -53.5=(-110101.1)2=(-1.101011)x25 , which is represented as


following below,

Where 00000101 is the 8-bit binary value of exponent value +5.

Note that 8-bit exponent field is used to store integer exponents -126 ≤ n ≤ 127.

The smallest normalized positive number that fits into 32 bits is


(1.00000000000000000000000)2x2 =2 ≈1.18x10 , and largest normalized positive
-126 -126 -38

number that fits into 32 bits is (1.11111111111111111111111)2x2127=(224-1)x2104 ≈ 3.40x1038 .


These numbers are represented as following below,

The precision of a floating-point format is the number of positions reserved for binary digits
plus one (for the hidden bit). In the examples considered here the precision is 23+1=24.

The gap between 1 and the next normalized floating-point number is known as machine epsilon.
the gap is (1+2-23)-1=2-23 for above example, but this is same as the smallest positive floating-
point number because of non-uniform spacing unlike in the fixed-point scenario.

Note that non-terminating binary numbers can be represented in floating point representation,
e.g., 1/3 = (0.010101 ...)2 cannot be a floating-point number as its binary representation is non-
terminating.

IEEE Floating point Number Representation

IEEE (Institute of Electrical and Electronics Engineers) has standardized Floating-Point


Representation as following diagram.
So, actual number is (-1)s(1+m)x2(e-Bias), where s is the sign bit, m is the mantissa, e is the
exponent value, and Bias is the bias number. The sign bit is 0 for positive number and 1 for
negative number. Exponents are represented by or two’s complement representation.

According to IEEE 754 standard, the floating-point number is represented in following ways:

 Half Precision (16 bit): 1 sign bit, 5 bit exponent, and 10 bit mantissa
 Single Precision (32 bit): 1 sign bit, 8 bit exponent, and 23 bit mantissa
 Double Precision (64 bit): 1 sign bit, 11 bit exponent, and 52 bit mantissa
 Quadruple Precision (128 bit): 1 sign bit, 15 bit exponent, and 112 bit mantissa

Sign=1 (-ve number) Exponent:


Bias for 32 bit = 127 (28-1 -1 = 127) 127 + 4 = 131=100000112
Mantissa:
17 = 100012=1.0001 x 24
Fractional part=00010000000000000000000 -17 =1 10000011
000100000000000000000002

Fig 2.10: Floating point formats

Example : The IEEE-754 32-bit floating-point representation pattern is 0 10000000 110


0000 0000 0000 0000 0000. What is the number? Sign bit S = 0 (positive number)
Exponent E = 100000002 = 12810 (in normalized form)
Fraction is 1.112 (with an implicit leading 1) = 1 + 1×2-1 + 1×2-2 = 1.7510
The number is +1.75 × 2 (128-127) = +3.510

Example : Suppose that IEEE-754 32-bit floating-point representation pattern is 1 01111110


100 0000 0000 0000 0000 0000. Find the decimal number.
Sign bit S = 1 (negative number)
E = 0111 11102 = 12610 (in normalized form)
Fraction is1.12 (with an implicit leading 1) = 1 + 2-1 = 1.510 The number is -1.5 × 2^ (126-127)
= -0.75D
Example : Suppose that IEEE-754 32-bit floating-point representation pattern is 1 01111110
000 0000 0000 0000 0000 0001. What is the decimal number?
Sign bit S = 1 (negative number) E = 0111 11102 = 12610 (in normalized form) Fraction is
1.000 0000 0000 0000 0000 0001B (with an implicit leading 1) = 1 + 2-23
The number is - (1 + 2-23) × 2(126-127) = -0.500000059604644775390625

Ripple Carry Adder

The adder produce carry propagation delay while performing other arithmetic operations like
multiplication and divisions as it uses several additions or subtraction steps. This is a major
problem for the adder and hence improving the speed of addition will improve the speed of all
other arithmetic operations. Hence reducing the carry propagation delay of adders is of great
importance. There are different logic design approaches that have been employed to overcome
the carry propagation problem. One widely used approach is to employ a carry look-ahead
which solves this problem by calculating the carry signals in advance, based on the input signals.
This type of adder circuit is called a carry look-ahead adder. Here a carry signal will be
generated in two cases:

1. Input bits A and B are 1


2. When one of the two bits is 1 and the carry-in is 1.

In ripple carry adders, for each adder block, the two bits that are to be added are available
instantly. However, each adder block waits for the carry to arrive from its previous block. So,
it is not possible to generate the sum and carry of any block until the input carry is known.
The block waits for the block to produce its carry. So there will be a considerable time delay
which is carry propagation delay.

Consider the above 4-bit ripple carry adder. The sum is produced by the corresponding full
adder as soon as the input signals are applied to it. But the carry input is not available on its
final steady-state value until carry is available at its steady-state value. Therefore, though the
carry must propagate to all the stages in order that output and carry settle their final steady-state
value.

The propagation time is equal to the propagation delay of each adder block, multiplied by the
number of adder blocks in the circuit. For example, if each full adder stage has a propagation
delay of 20 nanoseconds, then will reach its final correct value after 60 (20 × 3) nanoseconds.
The situation gets worse, if we extend the number of stages for adding more number of bits.
1. Explain Carry Look Ahead adders in detail

A carry-look ahead adder (CLA) is a type of adder used in digital logic. A carry-lookahead
adder improves speed by reducing the amount of time required to determine carry bits. The
carry-look ahead adder calculates one or more carry bits before the sum, which reduces the wait
time to calculate the result of the larger value bits.

Operation Mechanism

Carry look ahead depends on two things:

1. Calculating, for each digit position, whether that position is going to propagate a carryif
one comes in from the right.
2. Combining these calculated values to be able to deduce quickly whether, for each groupof
digits, that group is going to propagate a carry that comes in from the right.

CLA – Concept

 To reduce the computation time, there are faster ways to add two binary numbers byusing
carry look ahead adders.
 They work by creating two signals P and G known to be Carry Propagator and Carry
Generator.
 The carry propagator is propagated to the next level whereas the carry generator is usedto
generate the output carry regardless of input carry.
 The block diagram of a 4-bit Carry Lookahead Adder is shown here below

The number of gate levels for the carry propagation can be found from the circuit of full adder.
The signal from input carry C in to output carry Cout requires an AND gate and an OR gate,
which constitutes two gate levels. So if there are four full adders in the parallel adder, the output
carry C5would have 2 X 4 = 8 gate levels from C1 to C5. For an n-bit parallel adder, there are
2n gate levels to propagate through.

The corresponding boolean expressions are given here to construct a carry look
ahead adder. In the carry-look ahead circuit we ned to generate the two signals carry
propagator(P) and carry generator(G),

 Pi = Ai ⊕ Bi
 Gi = Ai · Bi

The output sum and carry can be expressed as


 Sumi = Pi ⊕ Ci
 Ci+1 = Gi + ( Pi · Ci)

Having these we could design the circuit. We can now write the Boolean function
for the carry output of each stage and substitute for each Ci its value from the previous
equations:
 C1 = G0 + P0 · C0
 C2 = G1 + P1 · C1 = G1 + P1 · G0 + P1 · P0 · C0
 C3 = G2 + P2 · C2 = G2 P2 · G1 + P2 · P1 · G0 + P2 · P1 · P0 · C0
 C4 = G3 + P3 · C3 = G3 P3 · G2 P3 · P2 · G1 + P3 · P2 · P1 · G0 + P3 · P2 · P1 · P0 · C0

Algebraic calculations for carry out


• ci+1 = gi + picic1 = g0 + p0c0

c2 = g1 + p1c1

= g1 + p1(g0 + p0c0)

= g1 + p1g0 + p1p0c0

c3 = g2 + p2c2

= g2 + p2(g1 + p1g0 + p1p0c0)

= g2 + p2g1 + p2p1g0 + p2p1p0c0 c4 = g3 + p3c3


= g3 + p3(g2 + p2g1 + p2p1g0 + p2p1p0c0)

= g3 + p3g2 + p3p2g1 + p3p2p1g0 + p3p2p1p0c0

4 Bit – Carry look ahead adder circuit diagram


2.Explain IEEE 754 Format in detail

In mathematics, floating point number is otherwise called as real number representation.

Example,

 3.14159265… ten (pi)


 1.0ten × 10−9 (seconds in a nanosecond)
 3.15576ten × 109 (seconds in a typical century)

A number in scientific notation that has a single digit to the left of the decimal point has no leading 0s is called
a normalized number.

Example, 1.010 x 10-9 is in normalized scientific notation, but 0.110 x 10-8 & 10.010 x
10-10 are not.

Binary numbers in scientific notation: 1.02 X 2-1, here decimal point is called as binary point for clarity.

General syntax for Binary floating point number 1.xxxxxxxxx2 X 2yyyyFloating-Point Representation:

The representation has 2 parts that has fraction part and exponent part. The fraction part size is used to increase or
decrease precision. The exponent part size is used to increase or decrease the range of the number.

The representation has 1 sign bit where bit -0 represents +ve value and bit -0 representsnegative value.

In general, floating-point numbers are of the form

(-1)S x F x 2E

F involves the value in the fraction field


E involves the value in the exponent field

Overflow & Underflow

Overflow here means that the exponent is too large to be represented in the exponent field.
The smallest non zero fraction that can be represented in the fraction field, when it istoo small
to store in faction field then we call it as underflow.
Double precision representation:

The range of the single precision format 2.0 x 10 -38 to 2.0 x 10 38


The range of the double precision format 2.0 x 10 -308 to 2.0 x 10 308

To pack even more bits into the significand, IEEE 754 makes the leading 1-bit of normalized binary numbers
implicit.

Hence, the number is actually 24 bits long in single precision (implied 1 and a 23-bit fraction), and 53 bits long
in double precision (1 + 52).

So the new representation will be (-1)S x (1+ F) x 2E

The IEEE format uses NaN, for Not a Number to represent undefined values like infinity, calculation of 0/0
values etc.

Representation of exponent value eg 2 +1

Representation of exponent value eg 2 -1

The 2 power -1 is represented in 2's complement format, in fig (b) value -1 isrepresented like as if a large exponent
value is sitting in the exponent field, to avoid this confusion bias value is used.

IEEE 754 uses a bias of 127 for single precision, so an exponent of -1 is represented bythe bit pattern of the value
-1+ 127ten, or 126ten and +1 is represented by 1 + 127ten, or 128ten

So the value represented by a floating-point number is,

 (-1)S x (1+ F) x 2(Exponent - bias)

The range of single precision numbers is then from as

small as /+1.00000000000000000000000two x 2 -126


to as large as -/+1.11111111111111111111111two x 2+127
3.State and explain the arithmetic multiplication operations on floating point numbers.

Multiplying decimal numbers in scientific notation by hand: 1.110ten X 1010 X 9.200ten X 10-5
Assume that we can store only four digits of the significand and two digits of the exponent.

Step 1 : Unlike addition, we calculate the exponent of the product by simply adding theexponents of the
operands together:

New exponent = 10 + (-5) = 5


Step 2 : Next comes the multiplication of the significands:
1.110ten
× 9.200ten
0000
0000
2220
9990
10212000ten
There are three digits to the right of the decimal point for each operand, so the decimalpoint is placed six digits
from the right in the product significand:
10.212000ten
Assuming that we can keep only three digits to the right of the decimal point, the product is 10.212 X 105.

Step 3 : This product is un normalized, so we need to normalize it:


10.212ten x 105 = 1.0212ten x 106
check for overflow and underflow.

Step 4 : We assumed that the significand is only four digits long (excluding the sign), so we must round the
number. The number 1.0212ten X 106 is rounded to four digits in the significandto 1.021ten X 106.

Step 5 : The sign of the product depends on the signs of the original operands. If they are boththe same, the sign
is positive; otherwise, it’s negative. Hence, the product is +1.021ten X 106
example based on above algorithm -- using binary numbers Binary Floating-Point Multiplication
Multiply the number 0.5 ten and -0.4375 ten in binaryConvert the number to binary
 0.5 ten = 0.1two = 0.1two x 20 = 1.000 two x 2-1
 -0.4375 = -0.0111two = -0.0111two x 20 = -1.110 two x 2-2
Step 1. Adding the exponent with out bias: -1 + (-2) = -3
Step 2. Multiply the significand:
1.000two
x 1.110 two
0000
1000
1000
1000
1110000 two There are three digits to the right of the decimal point for each operand, so the product will be
1110000 two x 2-3 , we need to keeponly 4 - bits 1.110 two x 2-3 .
Step 3. Check whether the product is normalized, and check for underflow or overflow.
It is already normalized so, the product is 1.110 two x 2-3
Step 4. Round the product to 4 - bits, it already fits, so no action.
1.110 two x 2-3

Step 5. Since the sign of the operands are different so the show product as negative.
-1.110 two x 2-3
converting to decimal, -1.110 two x 2-3 = -0.001110two , After conversion we get -7
-5
x 2ten
= - 7/25ten = -7/32 ten = -0.21875 ten

Booth Multiplier

Booth algorithm gives a procedure for multiplying binary integers in signed 2’s complement
representation in efficient way, i.e., less number of additions/subtractions required. It operates
on the fact that strings of 0’s in the multiplier require no addition but just shifting and a string of
1’s in the multiplier from bit weight 2^k to weight 2^m can be treated as 2^(k+1 ) to 2^m. As in
all multiplication schemes, booth algorithm requires examination of the multiplier bits and
shifting of the partial product. Prior to the shifting, the multiplicand may be added to the partial
product, subtracted from the partial product, or left unchanged according to following rules:

1. The multiplicand is subtracted from the partial product upon encountering the first least
significant 1 in a string of 1’s in the multiplier

2. The multiplicand is added to the partial product upon encountering the first 0 (provided
that there was a previous ‘1’) in a string of 0’s in the multiplier.

3. The partial product does not change when the multiplier bit is identical to the previous
multiplier bit.

Hardware Implementation of Booths Algorithm – The hardware implementation of the booth


algorithm requires the register configuration shown in the figure below.
Booth’s Algorithm Flowchart –

We name the register as A, B and Q, AC, BR and QR respectively. Qn designates the least significant
bit of multiplier in the register QR. An extra flip-flop Qn+1is appended to QR to facilitate a
double inspection of the multiplier.The flowchart for the booth algorithm is shown below.

Flow chart of Booth’s Algorithm.

AC and the appended bit Qn+1 are initially cleared to 0 and the sequence SC is set to a number n equal
to the number of bits in the multiplier. The two bits of the multiplier in Qn and Qn+1are
inspected. If the two bits are equal to 10, it means that the first 1 in a string has been encountered.
This requires subtraction of the multiplicand from the partial product in AC. If the 2 bits are
equal to 01, it means that the first 0 in a string of 0’s has been encountered. This requires the
addition of the multiplicand to the partial product in AC. When the two bits are equal, the partial
product does not change. An overflow cannot occur because the addition and subtraction of the
multiplicand follow each other. As a consequence, the 2 numbers that are added always have a
opposite signs, a condition that excludes an overflow. The next step is to shift right the partial
product and the multiplier (including Qn+1). This is an arithmetic shift right (ashr) operation
which AC and QR to the right and leaves the sign bit in AC unchanged. The sequence counter
is decremented and the computational loop is repeated n times. Product of negative numbers is
important, while multiplying negative numbers we need to find 2’s complement of the number
to change its sign, because it’s easier to add instead of performing binary subtraction. product of
two negative number is demonstrated below along with 2’s complement.
Restoring Division Algorithm For Unsigned Integer

A division algorithm provides a quotient and a remainder when we divide two number. They are
generally of two types slow and fast algorithm.

Slow algorithm and Fast algorithm

Slow division algorithm are restoring, non-restoring, non-performing restoring, SRT algorithm and
under fast comes Newton–Raphson and Goldschmidt. In this article, will be performing restoring
algorithm for unsigned integer. Restoring term is due to fact that value of register A is restored
after each iteration.

Here, register Q contain quotient and register A contain remainder. Here, n-bit dividend is loaded in Q
and divisor is loaded in M. Value of Register is initially kept 0 and this is the register whose
value is restored during iteration due to which it is named Restoring.

Step Involved

 Step-1: First the registers are initialized with corresponding values (Q = Dividend, M =
Divisor, A = 0, n = number of bits in dividend)

 Step-2: Then the content of register A and Q is shifted left as if they are a single unit

 Step-3: Then content of register M is subtracted from A and result is stored in A

 Step-4: Then the most significant bit of the A is checked if it is 0 the least significant bit
of Q is set to 1 otherwise if it is 1 the least significant bit of Q is set to 0 and value of
register A is restored i.e the value of A before the subtraction with M

 Step-5: The value of counter n is decremented


 Step-6: If the value of n becomes zero we get of the loop otherwise we repeat from step
2

 Step-7: Finally, the register Q contain the quotient and A contain remainder

Non-Restoring Division For Unsigned Integer

In earlier post Restoring Division learned about restoring division. Now, here performing Non-Restoring
division, it is less complex than the restoring one because simpler operation are involved i.e.
addition and subtraction, also no restoring step is performed. In the method, relybon the sign bit
of the register which initially contain zero named as A. Here is the flowchart given below.

Let’s pick the step involved:

 Step-1: First the registers are initialized with corresponding values (Q = Dividend, M =
Divisor, A = 0, n = number of bits in dividend)
 Step-2: Check the sign bit of register A
 Step-3: If it is 1 shift left content of AQ and perform A = A+M, otherwise shift left AQ
and perform A = A-M (means add 2’s complement of M to A and store it to A)
 Step-4: Again the sign bit of register A
 Step-5: If sign bit is 1 Q[0] become 0 otherwise Q[0] become 1 (Q[0] means least
significant bit of register Q)
 Step-6: Decrements value of N by 1
 Step-7: If N is not equal to zero go to Step 2 otherwise go to next step
 Step-8: If sign bit of A is 1 then perform A = A+M
 Step-9: Register Q contains quotient and A contains remainder.

You might also like