• Floating-point arithmetic is an approximation of real arithmetic
  • It has the ability to represent factional values.
  • Floating-point numbers provide only an approximation, because there is an infinite number of possible real values, while floating-point representation uses a finite number of bits.
  • When floating-point format cannot represent a real value exactly it chooses the closest value that it can represent.
  • Fixed-point formats allow fractional representation at the expense of range. Floating-point solves this with dynamic range.
  • Disadvantages of floating-point – can’t represent exactly, multiple possible representations of the same number meaning less range, complicates arithmetic. Comparing numbers with imprecise precision lead to errors, need to check error or tolerance.

Components of a Floating Point Representation

  1. Signed bit
  2. Mantissa, significand – base value, that usually falls within a limited range such as 0 and 1. The length of the mantissa determines the precision to which numbers can be represented. Also, contains the radix position.
  3. Exponent
  4. Guard digits, guard bits – extra digits used during calculation to greatly enhance accuracy.

Approximation Techniques

  • Truncation
  • Rounding up
  • Rounding down
  • Rounding to nearest

IEE Floating-Point Formats

Single-Precision Floating-Point Format

  • 24-bit mantissa
  • 8-bit exponent
  • The HO bit of the mantissa is always assumed to be one, meaning the mantissa is usually between 1.0 and just less than 2
  • One’s complement is used for mantissa
  • With a 24-bit mantissa, you will get approximately 6¬†1/2¬†decimal digits of precision

Double-Precision Floating-Point Format

  • 11-bit exponent
  • 52-bit mantissa
  • Approximately 14 1/2 digits of precision.

Extended-Precision Floating-Point Format

  • Intel offers an extended 80-bit floating-point
  • 15-bit exponent
  • 64-bit mantissa