Floating-point arithmetic is an approximation of real arithmetic

It has the ability to represent factional values.

Floating-point numbers provide only an approximation, because there is an infinite number of possible real values, while floating-point representation uses a finite number of bits.

When floating-point format cannot represent a real value exactly it chooses the closest value that it can represent.

Fixed-point formats allow fractional representation at the expense of range. Floating-point solves this with dynamic range.

Disadvantages of floating-point – can’t represent exactly, multiple possible representations of the same number meaning less range, complicates arithmetic. Comparing numbers with imprecise precision lead to errors, need to check error or tolerance.

Components of a Floating Point Representation

Signed bit

Mantissa, significand – base value, that usually falls within a limited range such as 0 and 1. The length of the mantissa determines the precision to which numbers can be represented. Also, contains the radix position.

Exponent

Guard digits, guard bits – extra digits used during calculation to greatly enhance accuracy.

Approximation Techniques

Truncation

Rounding up

Rounding down

Rounding to nearest

IEE Floating-Point Formats

Single-Precision Floating-Point Format

24-bit mantissa

8-bit exponent

The HO bit of the mantissa is always assumed to be one, meaning the mantissa is usually between 1.0 and just less than 2

One’s complement is used for mantissa

With a 24-bit mantissa, you will get approximately 6 ^{1}/_{2} decimal digits of precision