With our modern base 10 numeral system we indicate a
negative number by placing a minus
(-) sign in front of it.
When using binary we need to use a different system to
indicate negative numbers.
There is only one scheme in common use on modern
hardware, but C99 defines three acceptable methods for
negative value representation.
The most straight forward method is to simply say that
one bit of the number indicates either a negative or
positive value depending on it being set or not.
This is analogous to mathematical approach of having a
+ and
-. This is fairly logical,
and some of the original computers did represent negative
numbers in this way. But using binary numbers opens up some
other possibilities which make the life of hardware designers
easier.
However, notice that the value
0 now has two equivalent
values; one with the sign bit set and one without.
Sometimes these values are referred to as
+0 and
-0 respectively.
One's complement simply applies the
not operation to the positive number to
represent the negative number. So, for example the value
-90 (-0x5A) is represented by ~01011010 =
10100101
With this scheme the biggest advantage is that to add
a negative number to a positive number no special logic is
required, except that any additional carry left over must be
added back to the final value. Consider
Table 2.15. One's Complement Addition
| Decimal | Binary | Op |
|---|
| -90 | 10100101 | + |
| 100 | 01100100 | |
| --- | -------- | |
| 10 | 100001001 | 9 |
| | 00001010 | 10 |
If you add the bits one by one, you find you end up
with a carry bit at the end (highlighted above). By adding
this back to the original we end up with the correct value,
10
Again we still have the problem with two zeros being
represented. Again no modern computer uses one's
complement, mostly because there is a better scheme.
Two's complement is just like one's complement, except
the negative representation has one
added to it and we discard any left over carry bit. So to
continue with the example from before,
-90 would be
~01011010+1=10100101+1 =
10100110.
This means there is a slightly odd symmetry in the
numbers that can be represented; for example with an 8 bit
integer we have
2^8 =
256 possible values; with our sign bit
representation we could represent -127 thru 127 but with
two's complement we can represent -127 thru 128. This is
because we have removed the problem of having two zeros;
consider that "negative zero" is (~00000000
+1)=(11111111+1)=00000000 (note
discarded carry bit).
Table 2.16. Two's Complement Addition
| Decimal | Binary | Op |
|---|
| -90 | 10100110 | + |
| 100 | 01100100 | |
| --- | -------- | |
| 10 | 00001010 | |
You can see that by implementing two's complement
hardware designers need only provide logic for addition
circuits; subtraction can be done by two's complement
negating the value to be subtracted and then adding the new
value.
Similarly you could implement multiplication with
repeated addition and division with repeated subtraction.
Consequently two's complement can reduce all simple
mathematical operations down to addition!
All modern computers use two's complement
representation.
Because of two's complement format, when increasing
the size of signed value, it is important that the
additional bits be sign-extended;
that is, copied from the top-bit of the existing
value.
For example, the value of an 32-bit
int
-10 would be represented
in two's complement binary as
11111111111111111111111111110110.
If one were to cast this to a 64-bit long
long int, we would need to ensure that
the additional 32-bits were set to
1 to maintain the same
sign as the original.
Thanks to two's complement, it is sufficient to take
the top bit of the exiting value and replace all the added
bits with this value. This processes is referred to as
sign-extension and is usually handled
by the compiler in situations as defined by the language
standard, with the processor generally providing special
instructions to take a value and sign-extended it to some
larger value.
So far we have only discussed integer or whole numbers;
the class of numbers that can represent decimal values is
called floating point.
To create a decimal number, we require some way to
represent the concept of the decimal place in binary. The
most common scheme for this is known as the IEEE-754
floating point standard because the standard is
published by the Institute of Electric and Electronics
Engineers. The scheme is conceptually quite simple and is
somewhat analogous to "scientific notation".
In scientific notation the value
123.45 might commonly be
represented as
1.2345x102.
We call 1.2345 the
mantissa or
significand,
10 is the
radix and
2 is the
exponent.
In the IEEE floating point model, we break up the
available bits to represent the sign, mantissa and exponent of
a decimal number. A decimal number is represented by
sign × significand ×
2^exponent.
The sign bit equates to either
1 or
-1. Since we are working in
binary, we always have the implied radix of
2.
There are differing widths for a floating point value --
we examine below at only a 32 bit value. More bits allows
greater precision.
Table 2.17. IEEE Floating Point
| Sign | Exponent | Significand/Mantissa |
|---|
| S | EEEEEEEE | MMMMMMMMMMMMMMMMMMMMMMM |
The other important factor is bias
of the exponent. The exponent needs to be able to represent
both positive and negative values, thus an implied value of
127 is subtracted from the
exponent. For example, an exponent of
0 has an exponent field of
127,
128 would represent
1 and
126 would represent
-1.
Each bit of the significand adds a little more precision
to the values we can represent. Consider the scientific
notation representation of the value
198765. We could write this
as
1.98765x106,
which corresponds to a representation below
Table 2.18. Scientific Notation for 1.98765x10^6
| 100 | . | 10-1 | 10-2 | 10-3 | 10-4 | 10-5 |
|---|
| 1 | . | 9 | 8 | 7 | 6 | 5 |
Each additional digit allows a greater range of decimal
values we can represent. In base 10, each digit after the
decimal place increases the precision of our number by 10
times. For example, we can represent
0.0 through
0.9 (10 values) with one
digit of decimal place, 0.00
through 0.99 (100 values)
with two digits, and so on. In binary, rather than each
additional digit giving us 10 times the precision, we only get
two times the precision, as illustrated in the table below.
This means that our binary representation does not always map
in a straight-forward manner to a decimal
representation.
Table 2.19. Significands in binary
| 20 | . | 2-1 | 2-2 | 2-3 | 2-4 | 2-5 |
|---|
| 1 | . | 1/2 | 1/4 | 1/8 | 1/16 | 1/32 |
| 1 | . | 0.5 | 0.25 | 0.125 | 0.0625 | 0.03125 |
With only one bit of precision, our fractional precision
is not very big; we can only say that the fraction is either
0 or
0.5. If we add another bit
of precision, we can now say that the decimal value is one of
either 0,0.25,0.5,0.75. With
another bit of precision we can now represent the values
0,0.125,0.25,0.375,0.5,0.625,0.75,0.875.
Increasing the number of bits therefore allows us
greater and greater precision. However, since the range of
possible numbers is infinite we will never have enough bits to
represent any possible value.
For example, if we only have two bits of precision and
need to represent the value
0.3 we can only say that it
is closest to 0.25; obviously
this is insufficient for most any application. With 22 bits
of significand we have a much finer resolution, but it is
still not enough for most applications. A
double value increases the
number of significand bits to 52 (it also increases the range
of exponent values too). Some hardware has an 84-bit float,
with a full 64 bits of significand. 64 bits allows a
tremendous precision and should be suitable for all but the
most demanding of applications (XXX is this sufficient to
represent a length to less than the size of an atom?)
Example 2.4. Floats versus Doubles
1 $ cat float.c
#include <stdio.h>
int main(void)
5 {
float a = 0.45;
float b = 8.0;
double ad = 0.45;
10 double bd = 8.0;
printf("float+float, 6dp : %f\n", a+b);
printf("double+double, 6dp : %f\n", ad+bd);
printf("float+float, 20dp : %10.20f\n", a+b);
15 printf("dobule+double, 20dp : %10.20f\n", ad+bd);
return 0;
}
20 $ gcc -o float float.c
$ ./float
float+float, 6dp : 8.450000
double+double, 6dp : 8.450000
25 float+float, 20dp : 8.44999998807907104492
dobule+double, 20dp : 8.44999999999999928946
$ python
Python 2.4.4 (#2, Oct 20 2006, 00:23:25)
30 [GCC 4.1.2 20061015 (prerelease) (Debian 4.1.1-16.1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 8.0 + 0.45
8.4499999999999993
35 A practical example is illustrated above. Notice that
for the default 6 decimal places of precision given by
printf both answers are the
same, since they are rounded up correctly. However, when
asked to give the results to a larger precision, in this case
20 decimal places, we can see the results start to diverge.
The code using doubles has a
more accurate result, but it is still not
exactly correct. We can also see that
programmers not explicitly dealing with
float values still have
problems with precision of variables!
In scientific notation, we can represent a value in
many different ways. For example,
10023x10^0 =
1002.3x101 =
100.23x102. We
thus define the normalised version as
the one where 1/radix <= significand <
1. In binary this ensures that the
leftmost bit of the significand is always
one. Knowing this, we can gain an extra bit of
precision by having the standard say that the leftmost bit
being one is implied.
Table 2.20. Example of normalising 0.375
| 20 | . | 2-1 | 2-2 | 2-3 | 2-4 | 2-5 | Exponent | Calculation |
|---|
| 0 | . | 0 | 1 | 1 | 0 | 0 | 2^0 | (0.25+0.125) × 1 = 0.375 |
| 0 | . | 1 | 1 | 0 | 0 | 0 | 2^-1 | (0.5+0.25)×.5=0.375 |
| 1 | . | 1 | 0 | 0 | 0 | 0 | 2^-2 | (1+0.5)×0.25=0.375 |
As you can see above, we can make the value normalised
by moving the bits upwards as long as we compensate by
increasing the exponent.
A common problem programmers face is finding the
first set bit in a bitfield. Consider the bitfield
0100; from the right the
first set bit would be bit
2 (starting from zero, as
is conventional).
The standard way to find this value is to shift
right, check if the uppermost bit is a
1 and either terminate or
repeat. This is a slow process; if the bitfield is 64
bits long and only the very last bit is set, you must go
through all the preceding 63 bits!
However, if this bitfield value were the signficand
of a floating point number and we were to normalise it,
the value of the exponent would tell us how many times it
was shifted. The process of normalising a number is
generally built into the floating point hardware unit on
the processor, so operates very fast; usually much faster
than the repeated shift and check operations.
The example program below illustrates two methods of
finding the first set bit on an Itanium processor. The
Itanium, like most server processors, has support for an
80-bit extended floating point type,
with a 64-bit significand. This means a
unsigned long neatly fits
into the significand of a long
double. When the value is loaded it is
normalised, and and thus by reading the exponent value
(minus the 16 bit bias) we can see how far it was
shifted.
Example 2.5. Program to find first set bit
1 #include <stdio.h>
int main(void)
{
5
int i = 0x8000;
int count = 0;
while ( !(i & 0x1) ) {
10 count ++;
i = i >> 1;
}
printf("First non-zero (slow) is %d\n", count);
15
long double d = 0x8000UL;
long exp;
20 asm ("getf.exp %0=%1" : "=r"(exp) : "f"(d));
printf("The first non-zero (fast) is %d\n", exp - 65535);
25 }
In the example code below we extract the components of
a floating point number and print out the value it
represents. This will only work for a 32 bit floating point
value in the IEEE format; however this is common for most
architectures with the
float type.
Example 2.6. Examining Floats
1 #include <stdio.h>
#include <string.h>
#include <stdlib.h>
5
int two_to_pos(int n)
{
if (n == 0)
return 1;
10 return 2 * two_to_pos(n - 1);
}
double two_to_neg(int n)
{
15 if (n == 0)
return 1;
return 1.0 / (two_to_pos(abs(n)));
}
20 double two_to(int n)
{
if (n >= 0)
return two_to_pos(n);
if (n < 0)
25 return two_to_neg(n);
return 0;
}
30
double calc_float(int m, int bit)
{
35 if (bit > 23)
return 0;
if ((m >> bit) & 1)
40 return 1.0L/two_to(23 - bit) + calc_float(m, bit + 1);
return calc_float(m, bit + 1);
}
45
int main(int argc, char *argv[])
{
float f;
int m,i,sign,exponent,significand;
50
if (argc != 2)
{
printf("usage: float 123.456\n");
exit(1);
55 }
if (sscanf(argv[1], "%f", &f) != 1)
{
printf("invalid input\n");
60 exit(1);
}
65
memcpy(&m, &f, 4);
70 sign = (m >> 31) & 0x1;
exponent = ((m >> 23) & 0xFF) - 127;
75
significand = (m & 0x7FFFFF) | 0x800000;
80 printf("%f = %d * (", f, sign ? -1 : 1);
for(i = 23 ; i >= 0 ; i--)
{
if ((significand >> i) & 1)
printf("%s1/2^%d", (i == 23) ? "" : " + ",
85 23-i);
}
printf(") * 2^%d\n", exponent);
90 printf("%f = %d * (", f, sign ? -1 : 1);
for(i = 23 ; i >= 0 ; i--)
{
if ((significand >> i) & 1)
printf("%s1/%d", (i == 23) ? "" : " + ",
95 (int)two_to(23-i));
}
printf(") * 2^%d\n", exponent);
100 printf("%f = %d * %.12g * %f\n",
f,
(sign ? -1 : 1),
calc_float(significand, 0),
two_to(exponent));
105
printf("%f = %.12g\n",
f,
(sign ? -1 : 1) *
110 calc_float(significand, 0) *
two_to(exponent)
);
return 0;
115 }
Sample output of the value
8.45, which we previously
examined, is shown below.
Example 2.7. Analysis of 8.45
$ ./float 8.45
8.450000 = 1 * (1/2^0 + 1/2^5 + 1/2^6 + 1/2^7 + 1/2^10 + 1/2^11 + 1/2^14 + 1/2^15 + 1/2^18 + 1/2^19 + 1/2^22 + 1/2^23) * 2^3
8.450000 = 1 * (1/1 + 1/32 + 1/64 + 1/128 + 1/1024 + 1/2048 + 1/16384 + 1/32768 + 1/262144 + 1/524288 + 1/4194304 + 1/8388608) * 2^3
8.450000 = 1 * 1.05624997616 * 8.000000
8.450000 = 8.44999980927
From this example, we get some idea of how the
inaccuracies creep into our floating point numbers.