Decimal to Floating-Point Needs Arbitrary Precision

In my article “Quick and Dirty Decimal to Floating-Point Conversion” I presented a small C program that converts a decimal string to a double-precision binary floating-point number. The number it produces, however, is not necessarily the closest — or so-called correctly rounded — double-precision binary floating-point number. This is not a failing of the algorithm; mathematically speaking, the algorithm is correct. The flaw comes in its implementation in limited precision binary floating-point arithmetic.

The quick and dirty program is implemented in native C, so it’s limited to double-precision floating-point arithmetic (although on some systems, extended precision may be used). Higher precision arithmetic — in fact, arbitrary precision arithmetic — is needed to ensure that all decimal inputs are converted correctly. I will demonstrate the need for high precision by analyzing three examples, all taken from Vern Paxson’s paper “A Program for Testing IEEE Decimal–Binary Conversion”.

Correct Rounding

A correctly rounded conversion from a decimal string to a double-precision floating-point number is the 53 significant bit value closest to the decimal input — and in the event of a tie, the value with significant bit number 53 equal to 0. (This is the prevailing definition, being based on the default IEEE 754 rounding mode. But to be truly correctly rounded — theoretically at least — all four IEEE 754 rounding modes must be honored.) The closest 53-bit number is obtained by rounding what would be the full-precision binary number to 53 bits. This can be done correctly only if the correct value of bit 54 — and sometimes the correct values of an arbitrary number of bits beyond — is known. Here are the rules:

  • If bit 54 is 0, round down
  • If bit 54 is 1
    • If any bit beyond bit 54 is 1, round up
    • If all bits beyond bit 54 are 0 (meaning the number is halfway between two floating-point numbers)
      • If bit 53 is 0, round down
      • If bit 53 is 1, round up

The rules seem simple enough, but there’s a problem: the calculation needs to be precise enough so that the true value of bit 54 and the appropriate number of bits beyond is known. Decimal numbers that are close to halfway between two floating-point numbers require the most precision to get right; the three examples that follow are such numbers.

Example 1: 7.8459735791271921 x 1065

7.8459735791271921 x 1065 is shorthand for this number:


In binary, it equals


Rounded correctly to 53 bits it equals


which in binary scientific notation equals

1.110111001101000000001000100111000001001100010100111 x 2218

Why This Conversion is Difficult in Floating-Point

The full-precision binary value above was obtained using my decimal to binary converter. The converter uses arbitrary precision decimal arithmetic, so every bit in the result is correct. Since bit 54 (the first highlighted bit above) is 0, it is rounded down to get the correctly rounded 53-bit result.

When the conversion is done in floating-point, things get a bit murky. With insufficient precision, it’s impossible to tell whether bit 54 is really 0, since it’s a tiny fraction away from being 1. The conversion algorithm cannot make the distinction unless the calculation is known to be precise through the 119th bit (the second highlighted bit above).

Converting with Varying Floating-Point Precision

To demonstrate the effect of floating-point precision, I modified my quick and dirty conversion program to use GMP floating-point arithmetic. I increased the precision of the GMP conversions incrementally until I reached the intermediate result with sufficient precision; for each increment, I printed the result in binary and rounded correctly by hand.

I converted 7.8459735791271921 x 1065 to double precision using the quick and dirty program (53-bit precision) and GMP with 64, 96, and 128 bits of precision (on my system, GMP changes precision internally at 32-bit granularity, plus a few extra bits); here are the results, expressed in binary scientific notation:

  • Quick and Dirty (53-bit precision):

    1.110111001101000000001000100111000001001100010101 x 2218

  • GMP, with 64 and 96 bit precision:

    1.110111001101000000001000100111000001001100010100111 x 2218

  • GMP, with 128-bit precision:

    1.110111001101000000001000100111000001001100010100111 x 2218

The quick and dirty result is incorrect, and is off by two units in the last place (ULPs). All three GMP values are correct, but the 64 and 96 bit values are correct by chance; before rounding, the intermediate result (not shown) has bit 53 equal to 0 and bit 54 equal to 1, with all bits beyond equal to 0. That makes it look like a halfway case, so it rounds down to the correct value — by sheer luck. The 128-bit GMP calculation is precise enough that the 119th bit is known to be 0, meaning bit 54 is 0 and that the result should be rounded down.

Example 2: 3.571 x 10266

3.571 x 10266 is an 886 bit binary number, so I’ll only show the relevant portion — its first 77 bits:


Rounded correctly to 53 bits it equals

1.011000100110010001001100011000011101010000011010101 x 2885

Here are the quick and dirty and GMP conversions:

  • Quick and Dirty (53-bit precision):

    1.011000100110010001001100011000011101010000011011001 x 2885

  • GMP, with 64-bit precision:

    1.0110001001100100010011000110000111010100000110101001 x 2885

  • GMP, with 96-bit precision:

    1.011000100110010001001100011000011101010000011010101 x 2885

The quick and dirty and 64-bit GMP results are both incorrect; the quick and dirty result is off by eight ULPs, and the 64-bit GMP result is off by one ULP. The 64-bit conversion goes wrong because GMP computes this intermediate value (it’s 66 bits, since GMP can carry extra bits beyond the expected precision):


At 64-bit precision, bit 54 appears to be 0, and hence the result is rounded down.

As expected, the 96-bit GMP result is correct, since only 77 bits of precision are needed to decide the fate of bit 54.

Example 3: 3.08984926168550152811 x 10-32

3.08984926168550152811 x 10-32 is a decimal fraction which, when written in binary, is non-terminating; for our purposes, we only need to see its first 234 places (104 leading 0s, and significant bits 1-53 and 54-130):


Rounded correctly to 53 bits it equals

1.0100000011011110010010000110011101100110010100111011 x 2-105

Here are the quick and dirty and GMP conversions:

  • Quick and Dirty (53-bit precision):

    1.010000001101111001001000011001110110011001010011101 x 2-105

  • GMP, with 64 and 96 bit precision:

    1.010000001101111001001000011001110110011001010011101 x 2-105

  • GMP, with 128-bit precision:

    1.0100000011011110010010000110011101100110010100111011 x 2-105

  • GMP, with 160-bit precision:

    1.0100000011011110010010000110011101100110010100111011 x 2-105

The quick and dirty, 64-bit GMP, and 96-bit GMP results are all incorrect; they are all the same, and are off by 1 ULP. (The Visual C++ compiler gets this conversion wrong too, getting the same wrong answer!)

The 128-bit GMP result is correct by accident. The intermediate value used to compute it (not shown) has a 0 at bit 53, a 1 at bit 54, and then a string of 74 0s followed by a 1 (yes, 129 bits). The full-precision result has 75 0s followed by a 1. The difference does not matter though, since both bit patterns indicate that the result should be rounded up.

The 160-bit GMP result is correct, which is expected since 130 bits of the full-precision binary result are needed to guarantee the right rounding decision is made.

How Much Precision is Enough?

In the examples above, I knew how much precision was needed for correct rounding because I already knew the full-precision binary equivalents of the decimal inputs — by external means. But how would a floating-point conversion routine know how much precision is enough? Naively, it would have to assume the worst case, computing with enough precision to match the input (for integer values, this would mean computing the full-precision binary equivalent). That would be overkill, since most of the time much less precision is needed to make a correct rounding decision.

David Gay’s strtod() function in dtoa.c — the de facto standard conversion routine — avoids this problem by avoiding high precision floating-point. It uses standard double-precision floating-point for “easy” conversions, and arbitrary precision integer arithmetic for “hard” conversions; it has the smarts to know when to use one or the other.


Thanks to Vern Paxson for helping me understand the “very close to but less than 1/2 ULP case.”



  1. Interesting. I’m curious to know where you got the example numbers that you use (especially the first one).

  2. Mark,

    All three examples are from Vern Paxson’s paper (I linked to it in the introduction). I put them in normalized scientific notation form; here’s how they appear in his paper:

    Example 1: Page 18: 78459735791271921 x 10+049

    Example 2: Page 19: 3571 x 10+263

    Example 3: Page 19: 308984926168550152811 x 10−052

    He has tables with other examples, although I found that a lot of them just work out by accident due to “round to nearest even” (I think that might have to do with being unable to control GMP precision at tighter granularity).

    The paper discusses an extension to a program called Testbase that generates these hard cases automatically.

    (P.S. I’m sorry your comments are moderated. I have no idea why that happens Now I have a guess as to why this happens — we’ll see if I’m right if your next comment makes it through.)

  3. Ah yes, you did say where the examples came from in the article. Sorry for not reading carefully!

  4. Under the section “Why This Conversion is Difficult in Floating-point”, you mention that the conversion algorithm cannot make the distinction unless the calculation is known to be precise thru the 119th bit.

    Why the 119th bit? Is this purely hypothetical since the double-precision binary arithmetic that the quick & dirty conversion program uses cannot “see” that far anyway. Thanks

  5. Or to put it another way, how many bits after bit 54 must be a 1 before bit 54 is considered to be a 1.

  6. @Sean,

    Counting 1s is not the right way to look at it. The precise calculation (the one I show from my decimal to binary converter) shows definitively what bit 54 and beyond is. In the example you cite, many 1s follow bit 54 = 0, which means bit 54 is very close to not being 0. This is an indication that an imprecise calculation may get it wrong.

    The quick and dirty algorithm can’t compute bit 54 and beyond accurately, so in a way including it in this discussion is unfair. It’s the higher precision GMP calculations that are more illustrative of the difficulty of handling the closeness to halfway.

Comments are closed.

Copyright © 2008-2024 Exploring Binary

Privacy policy

Powered by WordPress