This little C program converts a decimal value — represented as a string — into a double-precision floating-point number:
#include <string.h> int main (void) { double intPart = 0, fracPart = 0, conversion; unsigned int i; char decimal[] = "3.14159"; i = 0; /* Left to right */ while (decimal[i] != '.') { intPart = intPart*10 + (decimal[i] - '0'); i++; } i = strlen(decimal)-1; /* Right to left */ while (decimal[i] != '.') { fracPart = (fracPart + (decimal[i] - '0'))/10; i--; } conversion = intPart + fracPart; }
The conversion is done using the elegant Horner’s method, summing each digit according to its decimal place value. So why do I call it “quick and dirty?” Because the binary floating-point value it produces is not necessarily the closest approximation to the input decimal value — the so-called correctly rounded result. (Remember that most real numbers cannot be represented exactly in floating-point.) Most of the time it will produce the correctly rounded result, but sometimes it won’t — the result will be off in its least significant bit(s). There’s just not enough precision in floating-point to guarantee the result is correct every time.
I will demonstrate this program with different input values, some of which convert correctly, and some of which don’t. In the end, you’ll appreciate one reason why library functions like strtod() exist — to perform efficient, correctly rounded conversion.
Testing the Conversions
How do you test whether a quick and dirty conversion is correct? First you need to know what the correctly rounded value is, and then you need to compare it to the quick and dirty value.
Here are two ways to find the correctly rounded value:
- Assign the decimal constant to a double and let your compiler compute it for you.
- Enter the decimal constant in my arbitrary-precision decimal/binary converter and then round it by hand to 53 significant bits (the number of significant bits in a double).
I did both, because I know that sometimes even compilers do the conversions incorrectly (see my articles on Visual C++ and gcc/glibc). In any case, I verified that all the compiler conversions in the examples below are correct.
Next we need to compare the quick and dirty value to the correctly rounded value — but how? Comparing floating-point values for equality is not reliable. That means we should display the values and compare them manually. But we can’t depend on printf(), because in many implementations it won’t print all the significant digits, either before the decimal point or after the decimal point (this is true of Visual C++, the compiler I am using). This leaves several options:
- Print the raw IEEE representation of the double, using my functions print_raw_double_hex() or print_raw_double_binary().
- Print the value of the double as a hexadecimal floating-point constant.
- Print the value of the double in binary scientific notation, using my function print_double_binsci().
- Print the value of the double in binary, using my function fp2bin().
I think seeing the binary representation is the most enlightening; of the three binary display options, I chose fp2bin().
The Test Code
I modified the code in the introduction of this article to allow for testing, as described; here it is
#include <string.h> #include <stdio.h> #include "fp2bin.h" int main (void) { char binString[FP2BIN_STRING_MAX]; double intPart = 0, fracPart = 0, conversion; unsigned int i; char decimal[] = "3.14159"; /* Converted manually */ double compiler_conversion = 3.14159; /* Converted by compiler */ i = 0; /* Left to right */ while (decimal[i] != '.') { intPart = intPart*10 + (decimal[i] - '0'); i++; } i = strlen(decimal)-1; /* Right to left */ while (decimal[i] != '.') { fracPart = (fracPart + (decimal[i] - '0'))/10; i--; } conversion = intPart + fracPart; /* Print the converted values, in binary */ fp2bin(compiler_conversion,binString); printf("Correct = %s\n",binString); fp2bin(conversion,binString); printf("Q and D = %s\n",binString); }
When run (I ran on an Intel Core Duo processor), the output shows that the compiler and quick and dirty conversions of 3.14159 are both correct:
Correct = 11.00100100001111110011111000000011011100001100110111 Q and D = 11.00100100001111110011111000000011011100001100110111
Their value in decimal, obtained from my binary/decimal converter, is 3.14158999999999988261834005243144929409027099609375.
Incorrect Conversions
Incorrect conversions are due to the limitations of floating-point arithmetic; specifically, these three things affect the conversion calculations:
- Large integers require more bits than can fit in a floating-point variable.
Multiplying a binary integer by ten is exact, but once the integer exceeds the largest integer that a double can represent exactly, rounding errors can occur.
- Division by ten in binary arithmetic is inexact.
Each time the fractional value is divided by ten, there is a chance of roundoff error.
- Addition of integers and fractions can incur rounding error.
When an integer and fraction are added, some of the bits of the fraction can be lost.
(Also, values near DBL_MAX can convert incorrectly to “infinity,” although I’m not considering overflow here.)
The example values below — which I discovered by both random testing and trial and error — are all rounded incorrectly by my quick and dirty program.
The Integer 18014398509481993
The number 18,014,398,509,481,993 (254 + 9) is not converted correctly by the quick and dirty program; this is its output (the lines I changed are char decimal[] = “18014398509481993.0”; and double compiler_conversion = 18014398509481993.0;):
Correct =
1000000000000000000000000000000000000000000000000001000.0
Q and D =
1000000000000000000000000000000000000000000000000001100.0
Their values, in decimal, are
- Correct = 18014398509481992 = 254 + 8
- Q and D = 18014398509481996 = 254 + 12
These values differ by 4, which is one unit in the last place (the last place for a double is the 53rd significant bit).
The Fraction 0.9199
The number 0.9199 is not converted correctly by the quick and dirty program; this is its output (the lines I changed are char decimal[] = “0.9199”; and double compiler_conversion = 0.9199;):
Correct =
0.11101011011111101001000011111111100101110010010001111
Q and D =
0.1110101101111110100100001111111110010111001001000111
Their values, in decimal, are
- Correct = 0.91990000000000005098144129078718833625316619873046875
= 8285722594436239/253
- Q and D = 0.9198999999999999399591388282715342938899993896484375
= 4142861297218119/252 = 8285722594436238/253
These values differ by 2-53, which is one unit in the last place.
The Combined Integer and Fraction 1.89
The number 1.89 is not converted correctly by the quick and dirty program; this is its output (the lines I changed are char decimal[] = “1.89”; and double compiler_conversion = 1.89;):
Correct =
1.1110001111010111000010100011110101110000101000111101
Q and D =
1.111000111101011100001010001111010111000010100011111
Their values, in decimal, are
- Correct = 1.8899999999999999023003738329862244427204132080078125
= 1 + 4008203668359741/252
- Q and D = 1.890000000000000124344978758017532527446746826171875
= 1 + 2004101834179871/251 = 1 + 4008203668359742/252
These values differ by 2-52, which is one unit in the last place.
The interesting thing is that 0.89 by itself converts correctly; it’s only when it’s added to 1 that the final result becomes incorrect.
The Fraction 3.50582559e-71
The number 3.50582559e-71 is not converted correctly by the quick and dirty program; this is its output (in the program source, I wrote 3.50582559e-71 as 0.0000000000000000000000000000000000000000000000000000000000000000000000350582559):
Correct =
0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011110111110001010011001011001000010111111010110010001
Q and D =
0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011110111110001010011001011001000010111111010110000011
Their values, in decimal, are
- Correct =
0.00000000000000000000000000000000000000000000000000000000000000000000003505825590000000127666442084156194149833722363535945086623257683235283315238231093157098293845387261005915287514287812242780435016915881911641225954878731368200574173594143601202788307347191221197135746479034423828125
= 8717642643535249/2287
- Q and D =
0.00000000000000000000000000000000000000000000000000000000000000000000003505825589999994497525328603499564935354782628876135671208090707866566080104779573609823758603905804048854042779018384586485878398467052729618730030977116116317972699506139481161193227620742618455551564693450927734375
= 8717642643535235/2287
These values differ by 14/2287, which is 14 units in the last place. (This is the largest error I found — in fact, the only example of an error this large — in the over 100 million random conversions I tried).
Discussion
In practice, you’ll depend on your compiler or interpreter or a library call for correctly rounded conversion. The de facto standard code used by many implementations is David Gay’s strtod() function in dtoa.c. Take a look at it to get a sense of how involved correctly rounded conversion is. (dtoa.c handles scientific notation, and also includes code for correct conversion in the other direction — floating-point to decimal).
For a little more detailed discussion of Horner’s method, see my PHP functions bin2dec_i() and bin2dec_f(). Those two functions implement the same algorithms as the quick and dirty program, only using decimal arithmetic and powers of two (there’s nothing “dirty” about them though — they’re binary to decimal and arbitrary precision).
Conversion to Single-Precision
Conversion to floats works similarly, also producing correctly and incorrectly rounded results. There’s one thing I observed, which you might find surprising: a given decimal value may convert incorrectly to a double but correctly to a float!
Interesting article, although the problems are not as pronounced as you make them seem.
Highlighting the decimals which are not right before rounding is not proper. In your last example (1.89), the problem does not manifest in the third decimal, but in the eighteenth! This is because if you’d show only the first three decimals, you would round to three, not truncate to three.
Okke,
With the shading, I didn’t mean to imply those decimal digits were imprecise — just that they didn’t match. But you’re right, it is misleading when compared to the bit shading, which DOES show where the imprecision starts in binary (well, it does now — I was off by one bit in two cases).
As for rounding, my point was to show the exact values in the variables, not something rounded by printf() for display.
Thanks for the feedback. (I
removed the decimal shadingrewrote the decimals as dyadic fractions and shaded them where they differed.)I added a fourth example, 3.50582559e-71, to show how large of an error is possible (14 ULPs in this case).