Slide 10.10: Floating-point addition

◀
Previous

Slide 10.9: Floating-point addition and subtraction algorithm
Slide 10.11: Floating-point subtraction
Home Print version

▶
Next

Floating-Point Addition

1.111₂×2^-1 + 1.011₂×2^-3 = 1.001₂×2⁰

The following steps show how to add the above numbers in scientific notation. For simplicity, we assume 4 bits of precision (or 3 bits of fraction).

Step 1. Making Exponents Equal

We cannot add significands because their exponents are not equal. In order to make exponents equal, shift the significand of the lesser exponent right until its exponent matches the larger number:

   1.011₂×2^-3 = 0.1011₂×2^-2 = 0.01011₂×2^-1

Step 2. Adding the Significands

Add the significands as the right. The result of addition is as follows:

   1.111₂×2^-1 + 0.01011₂×2^-1 = 10.00111₂×2^-1

     1.111
  +  0.01011
  ———————————
    10.00111

Step 3. Normalizing the Sum

The result 10.00111₂×2^-1 needs to be normalized as follows:

   10.00111₂×2^-1 = 1.000111₂×2⁰

Shifting right by 1 bit has to be followed by incrementing the exponent.

Step 4. Rounding the Significand

Round the significand to fit in appropriate number of bits. We assumed 4 bits of precision or 3 bits of fraction. Round the significand to nearest digit: 1.000111₂×2⁰ ≈ 1.001₂×2⁰

Step 5. Checking for Overflow or Underflow

Check whether exponent becomes too large (overflow) or too small (underflow).

◀
Previous

Slide 10.9: Floating-point addition and subtraction algorithm
Slide 10.11: Floating-point subtraction
Home Print version

▶
Next

“Prediction is very difficult, especially about the future.”
― Niels Bohr