• Floating-point arithmetic is an approximation of real arithmetic
• It has the ability to represent factional values.
• Floating-point numbers provide only an approximation, because there is an infinite number of possible real values, while floating-point representation uses a finite number of bits.
• When floating-point format cannot represent a real value exactly it chooses the closest value that it can represent.
• Fixed-point formats allow fractional representation at the expense of range. Floating-point solves this with dynamic range.
• Disadvantages of floating-point – can’t represent exactly, multiple possible representations of the same number meaning less range, complicates arithmetic. Comparing numbers with imprecise precision lead to errors, need to check error or tolerance.

## Components of a Floating Point Representation

1. Signed bit
2. Mantissa, significand – base value, that usually falls within a limited range such as 0 and 1. The length of the mantissa determines the precision to which numbers can be represented. Also, contains the radix position.
3. Exponent
4. Guard digits, guard bits – extra digits used during calculation to greatly enhance accuracy.

## Approximation Techniques

• Truncation
• Rounding up
• Rounding down
• Rounding to nearest

## IEE Floating-Point Formats

### Single-Precision Floating-Point Format

• 24-bit mantissa
• 8-bit exponent
• The HO bit of the mantissa is always assumed to be one, meaning the mantissa is usually between 1.0 and just less than 2
• One’s complement is used for mantissa
• With a 24-bit mantissa, you will get approximately 6 1/2 decimal digits of precision

### Double-Precision Floating-Point Format

• 11-bit exponent
• 52-bit mantissa
• Approximately 14 1/2 digits of precision.

### Extended-Precision Floating-Point Format

• Intel offers an extended 80-bit floating-point
• 15-bit exponent
• 64-bit mantissa