Skip to main content
Engineering LibreTexts

15.2: Dissecting a Float

  • Page ID
    43751
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    To understand what operation is involved in above addition, we must know how floats are internally represented in the computer: Pharo’s Float format is a wide spread standard found on most computers - IEEE 754-1985 double precision on 64 bits (See http://en.Wikipedia.org/wiki/IEEE_754-1985 for more details). With this format, a Float is represented in base 2 by this formula:

    \[ \mathit{sign} \cdot \mathit{mantissa} \cdot 2^{\mathit{exponent}} \nonumber \]

    • The sign is represented with 1 bit.
    • The exponent is represented with 11 bits.
    • The mantissa is a fractional number in base two, with a leading 1 before decimal point, and with 52 binary digits after fraction point. In Pharo, the method to obtain the mantissa is Float>>significand. We provide examples following in the chapter.

    For example, a series of 52 bits:

    0110010000000000000000000000000000000000000000000000
    

    means the mantissa is:

    1.0110010000000000000000000000000000000000000000000000

    which also represents the following fractions:

    \[ 1 + \dfrac{0}{2} + \dfrac{1}{2^2} + \dfrac{1}{2^3} + \dfrac{0}{2^4} + \dfrac{0}{2^5} + \dfrac{1}{2^6} + \ldots + \dfrac{0}{2^{52}} \nonumber \]

    The mantissa value is thus between 1 (included) and 2 (excluded) for normal numbers.

    1 + ((1 to: 52) detectSum: [:i | (2 raisedTo: i) reciprocal]) asFloat
        → 1.9999999999999998
    

    Building a float. Let us construct such a mantissa:

    (#(0 2 3 6) detectSum: [:i | (2 raisedTo: i) reciprocal]) asFloat.
        → 1.390625
    

    Now let us multiply by \( 2^3 \) to get a non null exponent:

    (#(0 2 3 6) detectSum: [:i | (2 raisedTo: i) reciprocal]) asFloat * (2 raisedTo: 3).
        → 11.125
    

    Or using the method timesTwoPower:

    (#(0 2 3 6) detectSum: [:i | (2 raisedTo: i) reciprocal]) asFloat timesTwoPower: 3.
        → 11.125
    

    In Pharo, you can retrieve these informations:

    11.125 sign.
        → 1
    11.125 significand.
        → 1.390625
    11.125 exponent.
        → 3
    

    In Pharo, there is no message to directly handle the normalized mantissa. Instead it is possible to handle the mantissa as an Integer after a 52 bits shift to the left. There is one good reason for this: operating on Integer is easier because arithmetic is exact. The result includes the leading 1 and should thus be 53 bits long for a normal number (that’s the float precision):

    11.125 significandAsInteger
        → 6262818231812096
    
    11.125 significandAsInteger printStringBase: 2.
        → '10110010000000000000000000000000000000000000000000000'
    
    '10110010000000000000000000000000000000000000000000000' size
        → 53
    
    11.125 significandAsInteger highBit.
        → 53
    
    Float precision.
        → 53
    

    You can also retrieve the exact fraction corresponding to the internal representation of the Float:

    11.125 asTrueFraction.
        → (89/8)
    (#(0 2 3 6) detectSum: [:i | (2 raisedTo: i) reciprocal]) * (2 raisedTo: 3).
        → (89/8)
    

    Until there we’ve retrieved the exact input we’ve injected into the Float. Are Float operations exact after all? Hem, no, we only played with fractions having a power of 2 as denominator and a few bits in numerator. If one of these conditions is not met, we won’t find any exact Float representation of our numbers. For example, it is not possible to represent 1/5 with a finite number of binary digits. Consequently, a decimal fraction like 0.1 cannot be represented exactly with above representation.

    (1/5) asFloat = (1/5).
        → false
    
    (1/5) = 0.2
        → false
    

    Let us see in detail how we could get the fractional bits of 1/5 i.e., 2r1/2r101. For that, we must lay out the division:

    Division layout.

    What we see is that we get a cycle: every 4 Euclidean divisions, we get a quotient 2r0011 and a remainder 1. That means that we need an infinite series of this bit pattern 0011 to represent 1/5 in base 2. Let us see how Pharo dealt to convert (1/5) to a Float:

    (1/5) asFloat significandAsInteger printStringBase: 2.
        → '11001100110011001100110011001100110011001100110011010'
    
    (1/5) asFloat exponent.
        → -3
    

    That’s the bit pattern we expected, except the last bits 001 have been rounded to upper 010. This is the default rounding mode of Float, round to nearest even. We now understand why 0.2 is represented inexactly in machine. It’s the same mantissa for 0.1, and its exponent is -4.

    0.2 significand
        → 1.6
    
    0.1 significand
        → 1.6
    
    0.2 exponent
        → -3
    
    0.1 exponent
        → -4
    

    So, when we entered 0.1 + 0.2, we didn’t get exactly (1/10) + (1/5). Instead of that we got:

    0.1 asTrueFraction + 0.2 asTrueFraction.
        → (10808639105689191/36028797018963968)
    

    But that’s not all the story... Let us inspect the bit pattern of above fraction, and check the span of this bit pattern, that is the position of highest bit set to 1 (leftmost) and position of lowest bit set to 1 (rightmost):

    10808639105689191 printStringBase: 2.
        → '100110011001100110011001100110011001100110011001100111'
    
    10808639105689191 highBit.
        → 54
    
    10808639105689191 lowBit.
        → 1
    
    36028797018963968 printStringBase: 2.
        → '10000000000000000000000000000000000000000000000000000000'
    

    The denominator is a power of 2 as we expect, but we need 54 bits of precision to store the numerator... Float only provides 53. There will be another rounding error to fit into Float representation:

    (0.1 asTrueFraction + 0.2 asTrueFraction) asFloat = (0.1 asTrueFraction + 0.2
        asTrueFraction).
        → false
    
    (0.1 asTrueFraction + 0.2 asTrueFraction) asFloat significandAsInteger.
        → '10011001100110011001100110011001100110011001100110100'
    

    To summarize what happened, including conversions of decimal representation to Float representation:

    (1/10) asFloat 0.1 inexact (rounded to upper)
    (1/5) asFloat 0.2 inexact (rounded to upper)
    (0.1 + 0.2) asFloat ... inexact (rounded to upper)

    3 inexact operations occurred, and, bad luck, the 3 rounding operations were all to upper, thus they did cumulate rather than annihilate. On the other side, interpreting 0.3 is causing a single rounding error (3/10) asFloat. We now understand why we cannot expect \( 0.1 + 0.2 = 0.3 \).

    As an exercise, you could show why \( 1.3 * 1.3 \neq 1.69 \).


    This page titled 15.2: Dissecting a Float is shared under a CC BY-SA 3.0 license and was authored, remixed, and/or curated by Alexandre Bergel, Damien Cassou, Stéphane Ducasse, Jannik Laval (Square Bracket Associates) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.