Floating-point arithmetic is an approximation of real arithmetic
It has the ability to represent factional values.
Floating-point numbers provide only an approximation, because there is an infinite number of possible real values, while floating-point representation uses a finite number of bits.
When floating-point format cannot represent a real value exactly it chooses the closest value that it can represent.
Fixed-point formats allow fractional representation at the expense of range. Floating-point solves this with dynamic range.
Disadvantages of floating-point – can’t represent exactly, multiple possible representations of the same number meaning less range, complicates arithmetic. Comparing numbers with imprecise precision lead to errors, need to check error or tolerance.
Components of a Floating Point Representation
Mantissa, significand – base value, that usually falls within a limited range such as 0 and 1. The length of the mantissa determines the precision to which numbers can be represented. Also, contains the radix position.
Guard digits, guard bits – extra digits used during calculation to greatly enhance accuracy.
Rounding to nearest
IEE Floating-Point Formats
Single-Precision Floating-Point Format
The HO bit of the mantissa is always assumed to be one, meaning the mantissa is usually between 1.0 and just less than 2
One’s complement is used for mantissa
With a 24-bit mantissa, you will get approximately 6 1/2 decimal digits of precision