Floating Point

• An IEEE floating point representation consists of

– A Sign Bit (no surprise)
– An Exponent (“times 2 to the what?”)
– Mantissa (“Significand”), which is assumed to be 1.xxxxx (thus, one
bit of the mantissa is implied as 1)
– This is called a normalized representation
• So a mantissa = 0 really is interpreted to be 1.0, and a
mantissa of all 1111 is interpreted to be 1.1111
• Special cases are used to represent denormalized
mantissas (true mantissa = 0), NaN, etc., as will be
Floating Point Standard
• Defined by IEEE Std 754-1985
• Developed in response to divergence of
– Portability issues for scientific code
• Now almost universally adopted
• Two representations
– Single precision (32-bit)
– Double precision (64-bit)
IEEE Floating-Point Format
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction

( 
1) 
(1 

• S: sign bit (0  non-negative, 1  negative)

• Normalize significand: 1.0 ≤ |significand| < 2.0
– Always has a leading pre-binary-point 1 bit, so no need to represent it
explicitly (hidden bit)
– Significand is Fraction with the “1.” restored
• Exponent: excess representation: actual exponent
+ Bias
– Ensures exponent is unsigned
– Single: Bias = 127; Double: Bias = 1203
Single-Precision Range
• Exponents 00000000 and 11111111 reserved
• Smallest value
– Exponent: 00000001
 actual exponent = 1 – 127 = –126
– Fraction: 000…00  significand = 1.0
– ±1.0 × 2–126 ≈ ±1.2 × 10–38
• Largest value
– exponent: 11111110
 actual exponent = 254 – 127 = +127
– Fraction: 111…11  significand ≈ 2.0
– ±2.0 × 2+127 ≈ ±3.4 × 10+38
Double-Precision Range
• Exponents 0000…00 and 1111…11 reserved
• Smallest value
– Exponent: 00000000001
 actual exponent = 1 – 1023 = –1022
– Fraction: 000…00  significand = 1.0
– ±1.0 × 2–1022 ≈ ±2.2 × 10–308
• Largest value
– Exponent: 11111111110
 actual exponent = 2046 – 1023 = +1023
– Fraction: 111…11  significand ≈ 2.0
– ±2.0 × 2+1023 ≈ ±1.8 × 10+308
Representation of Floating Point
• IEEE 754 single precision

31 30 23 22 0

Sign Biased exponent Normalized Mantissa (implicit 24th bit = 1)

Exponent Mantissa Object Represented

0 0 0

(-1)s  F  2E-127
0 non-zero denormalized
1-254 anything FP number
255 0 pm infinity
255 non-zero NaN
Why biased exponent?
• For faster comparisons (for sorting, etc.), allow integer
comparisons of floating point numbers:

• Unbiased exponent:

1/2 0 1111 1111 000 0000 0000 0000 0000 0000

2 0 0000 0001 000 0000 0000 0000 0000 0000

• Biased exponent:

1/2 0 0111 1110 000 0000 0000 0000 0000 0000

2 0 1000 0000 000 0000 0000 0000 0000 0000
Basic Technique

• Represent the decimal in the form +/- 1.xxxb x 2y

• And “fill in the fields”
– Remember biased exponent and implicit “1.” mantissa!
• Examples:
– 0.0: 0 00000000 00000000000000000000000
– 1.0 (1.0 x 2^0): 0 01111111 00000000000000000000000
– 0.5 (0.1 binary = 1.0 x 2^-1): 0 01111110 00000000000000000000000
– 0.75 (0.11 binary = 1.1 x 2^-1): 0 01111110 10000000000000000000000
– 3.0 (11 binary = 1.1*2^1): 0 10000000 10000000000000000000000
– -0.375 (-0.011 binary = -1.1*2^-2): 1 01111101 10000000000000000000000
– 1 10000011 01000000000000000000000 = - 1.01 * 2^4 = -20.0
Basic Technique

• One can compute the mantissa just similar to the way one would
convert decimal whole numbers to binary.
• Take the decimal and repeatedly multiply the fractional
component by 2. The whole number portion is the next binary
• For whole numbers, append the binary whole number to the
mantissa and shift the exponent until the mantissa is in
normalized form.
Floating-Point Example
• Represent –0.75
– –0.75 = (–1)1 × 1.12 × 2–1
– Fraction = 1000…002
– Exponent = –1 + Bias
• Single: –1 + 127 = 126 = 011111102
• Double: –1 + 1023 = 1022 = 011111111102
• Single: 1011111101000…00
• Double: 1011111111101000…00
Floating-Point Example
• What number is represented by the single-
precision float
– Fraction = 01000…002
– Fxponent = 100000012 = 129
• x = (–1)1 × (1 + 012) × 2(129 – 127)
= (–1) × 1.25 × 22
= –5.0
Converting to Floating Point
• E.g., Express 36.562510 as a 32-bit floating
point number (in hexadecimal)
• Step 1
– Express original value in binary
36.562510 =

• Step 2
– Normalize
100100.10012 =

1.0010010012 x 25
• Step 3
– Determine S, E, and M
+1.0010010012 x 25
n E = n + 127
= 5 + 127
= 132
= 100001002

S = 0 (because the value is positive)

• Step 4
– Put S, E, and M together to form 32-bit binary
0 10000100 001001001000000000000002
• Step 5
– Express in hexadecimal

0 10000100 001001001000000000000002 =

0100 0010 0001 0010 0100 0000 0000 00002 =

4 2 1 2 4 0 0 016

Answer: 4212400016
Converting from Floating Point
• E.g., What decimal value is represented by the
following 32-bit floating point number?
• Step 1
– Express in binary and find S, E, and M
C17B000016 =

1 10000010 111101100000000000000002

1 = negative
0 = positive
• Step 2
– Find “real” exponent, n
– n = E – 127
= 100000102 – 127
= 130 – 127
• Step 3
– Put S, M, and n together to form binary result
– (Don’t forget the implied “1.” on the left of the
-1.11110112 x 2n =
-1.11110112 x 23 =

• Step 4
– Express result in decimal
-15 2-1 = 0.5
2-3 = 0.125
2-4 = 0.0625

Answer: -15.6875
Denormal Numbers
• Exponent = 000...0  hidden bit is 0

( 
1) 
(0 Bias
 Smaller than normal numbers

 allow for gradual underflow, with diminishing

 Denormal with fraction = 000...0

( 
1) 
0)2 

Two representations
of 0.0!
Infinities and NaNs
• Exponent = 111...1, Fraction = 000...0
– ±Infinity
– Can be used in subsequent calculations, avoiding
need for overflow check
• Exponent = 111...1, Fraction ≠ 000...0
– Not-a-Number (NaN)
– Indicates illegal or undefined result
• e.g., 0.0 / 0.0
– Can be used in subsequent calculations
Representation of Floating Point
• IEEE 754 double precision
31 30 20 19 0

Sign Biased exponent Normalized Mantissa (implicit 53rd bit)

Exponent Mantissa Object Represented

0 0 0

(-1)s  F  2E-1023
0 non-zero denormalized
1-2046 anything FP number
2047 0 pm infinity
2047 non-zero NaN
Is FP addition associative?
• Associativity law for addition: a + (b + c) = (a + b) + c

• Let a = – 2.7 x 1023, b = 2.7 x 1023, and c = 1.0

• a + (b + c) = – 2.7 x 1023 + ( 2.7 x 1023 + 1.0 ) = – 2.7 x 1023 +

2.7 x 1023 = 0.0

• (a + b) + c = ( – 2.7 x 1023 + 2.7 x 1023 ) + 1.0 = 0.0 + 1.0 = 1.0

• Beware – Floating Point addition not associative!

• The result is approximate…

Floating point addition
S ta r t

1 . C o m p a r e th e e x p o n e n ts o f th e tw o n u m b e r s .
S h ift t h e s m a lle r n u m b e r t o th e r ig h t u n t il its
e x p o n e n t w o u ld m a t c h t h e la r g e r e x p o n e n t

2 . A d d t h e s ig n if ic a n d s

3 . N o r m a liz e t h e s u m , e ith e r s h iftin g r ig h t a n d

in c r e m e n tin g th e e x p o n e n t o r s h ift in g le ft
a n d d e c r e m e n t in g t h e e x p o n e n t

O v e r f lo w o r Ye s
u n d e r f lo w ?

No E x c e p tio n

4 . R o u n d t h e s ig n if ic a n d t o t h e a p p r o p r ia te
n u m b e r o f b its

N o
S t ill n o r m a liz e d ?

Ye s

D one
Floating-Point Addition
• Now consider a 4-digit binary example
– 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)
• 1. Align binary points
– Shift number with smaller exponent
– 1.0002 × 2–1 + –0.1112 × 2–1
• 2. Add significands
– 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
• 3. Normalize result & check for over/underflow
– 1.0002 × 2–4, with no over/underflow
• 4. Round and renormalize if necessary
– 1.0002 × 2–4 (no change) = 0.0625
FP Adder Hardware
• Much more complex than integer adder
• Doing it in one clock cycle would take too long
– Much longer than integer operations
• FP adder usually takes several cycles
– Can be pipelined
Floating Point Multiplication Algorithm
Floating-Point Multiplication
• Now consider a 4-digit binary example
– 1.0002 × 2–1 × –1.1102 × 2–2 (0.5 × –0.4375)
• 1. Add exponents
– Unbiased: –1 + –2 = –3
– Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127
• 2. Multiply significands
– 1.0002 × 1.1102 = 1.1102  1.1102 × 2–3
• 3. Normalize result & check for over/underflow
– 1.1102 × 2–3 (no change) with no over/underflow
• 4. Round and renormalize if necessary
– 1.1102 × 2–3 (no change)
• 5. Determine sign: +ve × –ve  –ve
– –1.1102 × 2–3 = –0.21875

