4.6: Floating Point Numbers (FPNs): Representation and Operations

Last updated
Save as PDF

Page ID: 55648

Masayuki Yano, James Douglass Penn, George Konidaris, & Anthony T Patera
Massachusetts Institute of Technology via MIT OpenCourseWare

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Truncation and Representation

Floating point numbers represent a challenge both in how we represent these numbers and in how we perform arithmetic operations on these numbers. To begin, we express a number \(x\) in base 2 as \[x=\sigma_{1}\left(\sum_{k=0}^{\infty} b_{k} 2^{-k}\right) \times 2^{\sigma_{2} \mathbb{E}},\] in which the \(b_{k}\) are binary numbers \(-0\) or \(1, \mathbb{E}\) is an integer, and \(\sigma_{1}\) and \(\sigma_{2}\) are signs \(-\pm 1\). We assume that we have normalized the expansion such that \(b_{0}=1\). (In fact, we may express \(x\) in say base 10 rather than base 2 ; this in turn will lead to a different floating point format.)

In some cases, we may only require a finite sum - a sum with a finite number of nonzero terms - to represent \(x\). For example, \(x=2\) may be expressed by the single non-zero term \(b_{0}=1\) (and \(\mathbb{E}=+1)\). However more generally a finite number of non-zero \(b_{k}\) will not suffice - even \(1 / 10\) leads to a repeating binary fraction. We thus must truncate the series to develop the floating point number (FPN) approximation of \(x\) : \[x_{\mathrm{FPN}}=\sigma_{1}\left(\sum_{k=0}^{K} b_{k}^{\prime} 2^{-k}\right) \times 2^{\sigma_{2} \mathbb{E}^{\prime}}\] Here \(b_{k}^{\prime}=b_{k}, 1 \leq k \leq K\) - we perform truncation of our series - and \(\mathbb{E}^{\prime}\) is the minimum of \(\mathbb{E}\) and \(\mathbb{E}_{\max }\) - we truncate the range of the exponent.

We now represent or encode \(x_{\mathrm{FPN}}\) in terms of (a finite number of) 0’s and 1’s. Towards this end we assign one bit each to the signs \(\sigma_{1}\) and \(\sigma_{2}\); we assign \(p=K\) bits for the binary numbers \(b_{k}^{\prime}\), \(1 \leq k \leq K\), to represent the mantissa (or significand); we assign \(p_{\mathbb{E}}\) bits to represent the exponent \(\mathbb{E}\) (and hence \(\mathbb{E}_{\max }=2^{p_{\mathbb{E}}}\) ). (Our choice of base 2 makes the encoding of our approximation in 0’s and 1’s particularly simple.) In the 64-bit IEEE 754 binary double (now called binary64) floating point format, \(p=52\) and \(p_{\mathbb{E}}=10\) (corresponding to \(\mathbb{E}_{\max }=310\) such that in total - including the sign bits - we require \(2+52+10=64\) bits. (The storage scheme actually implemented in practice is slightly different: we need not store the leading unity bit and hence we effectively realize \(p=53\); the exponent sign \(\sigma_{2}\) is in fact represented as a shift.) There are two primary sources or types of error in the approximation of \(x\) by \(x_{\mathrm{FPN}}\) : the first is FPN truncation of the mantissa to \(p\) bits; the second is the FPN truncation of the exponent to \(p_{\mathbb{E}}\) bits. The former, FPN mantissa truncation, is generally rather benign given the rather large value of \(p\). However, in some cases, FPN mantissa truncation errors can be seriously amplified by arithmetic operations. The latter, FPN exponent truncation, takes the form of either overflow exponents larger than 310, represented in MATLAB as plus or minus Inf - which is typically an indication of ill-posedness, or underflow - exponents smaller than \(-310\), represented in MATLAB as 0 - which is typically less of a concern.

We note that the word "precision" is typically reserved to indicate the number of bits or digits with which a floating point number is approximated on any particular hardware (and IEEE format); typically we focus on the mantissa. For example, 64-bit precision, or "double-precision," corresponds to 52 (or 53 ) binary digits of precision - roughly 16 decimal digits of precision in the mantissa. Precision can also be characterized in term of "machine precision" or "machine epsilon" which is essentially the (relative) magnitude of the FPN truncation error in the worst case: we can find machine epsilon from the MATLAB built-in function eps, as we will illustrate below. We will define machine epsilon more precisely, and later construct a code to find an approximation to machine epsilon, once we have understood floating point arithmetic.

Oftentimes we will analyze a numerical scheme in hypothetical "infinite-precision" arithmetic in order to understand the errors due to numerical approximation and solution in the absence of finite-precision FPN truncation effects. But we must always bear in mind that in finite precision arithmetic additional errors will be incurred due to the amplification of FPN truncation errors by various arithmetic operations. We shortly discuss the latter in particular to identify the kinds of operations which we should, if possible, avoid.

Finally, we remark that there are many ways in which we may choose to display a number say in the command window. How we display the number will not affect how the number is stored in memory or how it is approximated in various operations. The reader can do \(\gg\) help format to understand the different ways to control the length of the mantissa and the form of the exponent in displayed floating point numbers. (Confusingly, format in the context of how we display a number carries a different meaning from format in the context of (IEEE) FPN protocol.)

Arithmetic Operations

We shall focus on addition since in fact this particular (simple) operation is the cause of most difficulties. We shall consider two numbers \(x_{1}\) and \(x_{2}\) which we wish to add: the first number has mantissa \(m_{1}\) and exponent \(\mathbb{E}_{1}\) and the second number has mantissa \(m_{2}\) and exponent \(\mathbb{E}_{2}\). We presume that \(\mathbb{E}_{1}>\mathbb{E}_{2}\) (if not, we simply re-define "first" and "second").

First, we divide the first mantissa by \(2^{\mathbb{E}_{1}-\mathbb{E}_{2}}\) to obtain \(m_{2}^{\prime}=m_{2} 2^{-\left(\mathbb{E}_{1}-\mathbb{E}_{2}\right)}\) : in this form, \(x_{1}\) now has mantissa \(m_{2}^{\prime}\) and exponent \(\mathbb{E}_{1}\). (Note this division corresponds to a shift of the mantissa: to obtain \(m_{2}^{\prime}\) we shift \(m_{1}\) by \(\mathbb{E}_{1}-\mathbb{E}_{2}\) places to the right - and pad with leading zeros.) At this stage we have lost no precision. However, in actual practice we can only retain the first \(p\) bits of \(m_{2}^{\prime}\) (since we only have \(p\) bits available for a mantissa): we denote by \(m_{1}^{\prime \prime}\) the truncation of \(m_{1}\) to fit within our \(p\)-bit restriction. Finally, we perform our FPN sum \(z=x_{1}+x_{2}: z\) has mantissa \(m_{1}+m_{2}^{\prime \prime}\) and exponent \(\mathbb{E}_{1}\). (Our procedure here is a simplification of the actual procedure - but we retain most of the key features.)

We can immediately see the difficulty: as we shift \(m_{2}\) to the right we are losing \(\mathbb{E}_{1}-\mathbb{E}_{2}\) bits of precision. If the two exponents \(\mathbb{E}_{1}\) and \(\mathbb{E}_{2}\) are very different, we could lost all the significant digits in \(x_{2}\). Armed with FPN we can in fact develop a simple definition of machine epsilon: the smallest epsilon such that \(1+\) epsilon \(=1\), where of course by + we now mean finite precision FPN addition. Later we will take advantage of this definition to write a short program which computes machine epsilon; for our purposes here, we shall simply use the MATLAB built-in function eps.

It is clear that finite-precision and infinite-precision arithmetic are different and will yield different results - the difference is commonly referred to as "round-off" error. Indeed, finite-precision arthmetic does not even honor all the usual (e.g., commutative, associative) rules. We consider the example (recall that in MATLAB operations are performed from left to right in the absence of any precedence rules):

>> mach_eps = eps 
mach_eps =
    2.2204e-16
>> (mach_eps/2 + 1 + mach_eps/2 - 1)/mach_eps
ans =
    0
>> (mach_eps/2 + mach_eps/2 + 1 - 1)/mach_eps
ans =
    1
>>

Clearly, in infinite precision arithmetic both expressions should evaluate to unity. However, in finite precision the order matters: in the first expression by definition mach_eps \(/ 2+1\) evaluates to 1 ; in the second expression, mach_eps/2 + mach_eps/2 adds two numbers of identical exponent - no loss in precision - which are then large enough (just!) to survive addition to 1 . This anomaly is a "bug" but can also be a feature: we can sometimes order our operations to reduce the effect of

But there are situations which are rather difficult to salvage. In the following example we approximate the derivative of \(\sin (x)\) by a forward first-order difference with increment \(\mathrm{dx}\) which is increasingly small:

>> cos(pi/4)
ans =
    0.707106781186548
>> dx = .01;
>> deriv_dx = (sin(pi/4 + dx) - sin(pi/4))/dx
deriv_dx =
    0.703559491689210 63
>> dx = 1e-8;
>> deriv_dx = (sin(pi/4 + dx) - sin(pi/4))/dx
deriv_dx =
    0.707106784236800
>> dx = 1e-20;
>> deriv_dx = (sin(pi/4 + dx) - sin(pi/4))/dx
deriv_dx =
    0
>>

We observe that what Newton intended - and the error bound we presented in Chapter 3 - is indeed honored: as dx tends to zero the finite difference (slope) approaches the derivative (cos \((\pi / 4)\) ). But not quite: as dx falls below machine precision, the numerator can no longer see the difference, and we obtain an \(O(1)\) error - precisely in the limit in which we should see a more and more accurate answer. (As the reader can no doubt guess, pi, sin, and cos are all MATLAB built-in functions.)

This is in fact very typical behavior. In order to make numerical errors small we must take smaller increments or many degrees-of-freedom, however if we go "too far" then finite-precision effects unfortunately "kick in." This trade-off could in fact be debilitating if machine precision were not sufficiently small, and indeed in the early days of computing with only a relatively few bits to represent FPNs it was a struggle to balance numerical accuracy with finite precision roundoff effects. These days, with the luxury of 64-bit precision, round-off errors are somewhat less of a concern. However, there are situations in which round-off effects can become important.

In particular, we note that the problem is our derivative example is not just the numerator but also the \(d x\) in the denominator. As a general rule, we wish to avoid - where possible - division by small numbers, which tends to amplify the effects of finite-precision truncation. (This relates to stability, which is an important theme which we will encounter in many, often related, guises in subsequent chapters.) We will see that even in much more sophisticated examples - solution of large linear systems - "avoid division by small numbers" remains an important guideline and often a feature (by construction) of good algorithms. The problem of course is aggravated when we must perform many operations as opposed to just a few operations.

We have focused our attention on addition since, as indicated, this operation is often the proximal cause of the round-off difficulties. Other operations are performed in the "obvious" way. For example, to multiply two numbers, we multiply the mantissas and add the exponents and then re-adjust to conform to the necessary representation. Division and exponentiation follow similar recipes. Obviously underflow and overflow can be undesired byproducts but these are typically easier to avoid and not "fundamental."