Using Integers to Check a Floating-Point Approximation

For decimal inputs that don’t qualify for fast path conversion, David Gay’s strtod() function does three things: first, it uses IEEE double-precision floating-point arithmetic to calculate an approximation to the correct result; next, it uses arbitrary-precision integer arithmetic (AKA big integers) to check if the approximation is correct; finally, it adjusts the approximation, if necessary. In this article, I’ll explain the second step — how the check of the approximation is done.


The goal of David Gay’s strtod() is to perform efficient, correctly rounded decimal to floating point conversion. To that end, it minimizes the overhead due to arbitrary-precision arithmetic, which is unavoidable for some conversions. In particular, strtod() avoids division of big integers, as would be required by a full-blown big integer algorithm. Instead of dividing, it multiplies, adds, subtracts, and compares.

The key to the algorithm is the floating-point approximation. It can be compared to the original decimal input to see if it’s too small, correct, or too big (it will be within 10 ULPs or so of correct). The approximation can be adjusted accordingly.

(My discussion will deal only with normal double-precision numbers; strtod() handles subnormal numbers as well.)

Correctly Rounded Conversions

For a floating-point number b to be considered the correctly rounded conversion of a decimal number d, it must be the closest floating-point number to d. That definition sounds simple enough, but there’s a problem: the meaning of “closest” can depend on the value of b. In any case, the tests for closeness are all based on a measure called a ULP — a binary (for our purposes) unit in the last place. A ULP is the gap between adjacent floating-point numbers.

I’ll present the tests as three separate cases, although in code they must be considered together.

Case 1: d is less than one-half ULP away from b

If d is less than one-half of a ULP from b, then b is the correctly rounded conversion of d (this test is ambiguous when b is a power of two — I’ll explain below). Mathematically, this is stated as

b – h < d < b + h ,

where h = 1/2·u, and where u equals one ULP.

Here’s what the relationship looks like:

Correct rounding when d is less than one-half ULP away from b
Correct rounding when d is less than one-half ULP away from b

Equivalently, this relationship can be written as

h < db < h ,


|db| < h .

Case 2: d is one-half ULP away from b

If d is one-half of a ULP away from b, then b is the correctly rounded conversion of d only if b’s least significant bit is 0 (this test is ambiguous when b is a power of two — I’ll explain below). This is called round-half-to-even rounding.

The relationship is written as

|db| = h

Correct rounding when d is one-half ULP away from b
Correct rounding when d is one-half ULP away from b

Case 3: b is a power of two

If b is a power of two, two different ULP values come into play, making our half ULP test ambiguous. One ULP below b is half as much as one ULP above b, since a ULP doubles at each increasing power of two boundary. This means that if b is a power of two, it is the correctly rounded conversion of d only if d is no more than one-half ULP above b or no more than one-quarter of that same (upper) ULP below b; here is the inequality:

h/2 ≤ dbh .

This is what it looks like:

Correct rounding when b is a power of two
Correct rounding when b is a power of two

We use the “≤” signs in the inequality because b is the correctly rounded result even if d = bh/2 or d = b + h. This is because the significands of (normal) powers of two are zero.

Comparing Using Big Integers

To explain how the algorithm works, I will use this example: d = 1.7864e-45 and b = 0x1.465a72e467d88p-149 (the latter is expressed as a hexadecimal floating-point constant). Here’s how d and b are related:

1.7864 x 10^-45 In Relation to Its Binary Floating-Point Approximation 0x1.465a72e467d88p-149
1.7864 x 10-45 In Relation to Its Binary Floating-Point Approximation, 0x1.465a72e467d88p-149

The diagram shows that b is the closest floating-point number to d — but how do we check that programmatically, using binary arithmetic? We need to apply the above test: |db| < h. (I know a priori this is not a special case — I wanted to keep my example simple.) But there’s a problem: how do we represent d? Answer: as a big integer. If we also represent b and h as big integers, such that they are proportional to d, we can apply the test given by the inequality.

I’ve broken the process into four steps:

Step 1: Represent the significant digits of d and b as integers

The first step is to represent d as an integer times a power of ten, and b as an integer times a power of two:

d = dInt x 10dExp
b = bInt x 2bExp

d was already put in this form when the floating point approximation b was computed (but for the purposes of the comparison, all digits of d are used, not just the first 16). b can be put in this form simply, by accessing its underlying floating-point representation: we “unnormalize” its significand into an integer by decreasing its power of two exponent. We represent the significant digits as a 53-bit integer, which means we subtract 52 from its power of two exponent.

For our example, d = 1.7864e-45 is represented as 17864 x 10-49, and b = 0x1.465a72e467d88p-149, which in binary scientific notation is

1.0100011001011010011100101110010001100111110110001 x 2-149 ,

is represented (in decimal numerals) as 5741268244528520 x 2-201.

Step 2: Represent h as a power of two

Since bInt is 53-bit integer, u = 2bExp, and h is half that: 2bExp-1. For consistency in my discussion, I’ll substitute the variable hExp for bExp-1, giving

h = 2hExp .

Step 3: Scale the inequality

To complete the conversion to integers, we need to get rid of any negative powers; we do this by scaling both sides of the inequality. Let’s look at the inequality again:

|db| < h .

Here are the three values as we expressed them above:

d = dInt x 10dExp
b = bInt x 2bExp
h = 2hExp

For efficiency, the scaling process will factor out common powers of two from the three powers; to facilitate this, we rewrite 10dExp as 2dExp x 5dExp; now we have

d = dInt x 2dExp x 5dExp
b = bInt x 2bExp
h = 2hExp

If any of the exponents dExp, bExp, or hExp are negative, we make them nonnegative by scaling. To reflect any scaling, let’s rename d, b, and h to dS, bS, and hS, respectively. The scaled inequality is now written

|dSbS| < hS ,

and the initial values of dS, bS, and hS are

dS = dInt x 2dExp x 5dExp
bS = bInt x 2bExp
hS = 2hExp

To show how scaling works, let’s continue with our example above; here are the initial values of the variables:

dS = 17864 x 2-49 x 5-49
bS = 5741268244528520 x 2-201
hS = 2-202

In this example, all three exponents — dExp, bExp, and hExp — are negative; we scale the values with a common factor S that makes them all nonnegative. The least common factor that does this is S = 2202 x 549; our variables then become

dS = 17864 x 2153
bS = 5741268244528520 x 21 x 549
hS = 549

Multiplied out, these are

dS = 203970822259994138521801764465966248930731085529088
bS = 203970822259994122305215569213032722473144531250000
hS = 17763568394002504646778106689453125

Step 4: Apply the test

Now all we have to do is check the inequality, |dSbS| < hS:

|dSbS| = 16216586195252933526457586554279088, which is less than hS = 17763568394002504646778106689453125. This proves that b is the correct conversion of d.

(In the diagram for this example above, I’ve drawn the relationship of d and b proportionally: (dSbS)/hS is approximately 0.91, which means that d is about 0.91h above b.)

Computing the Scaling Exponents In Code

In code, the scaling is done by operating on native integer variables representing the exponents of the powers of two and five in each of the three scaled values. After the exponents are created, common factors of two are removed to reduce the size of the resulting big integers. Here is how I coded it in C (this is essentially how it’s done in David Gay’s strtod() and Java’s FloatingDecimal class; I’ve renamed the variables and structured it a bit differently to make it clearer):

//Initialize scaling exponents
dS_Exp2 = 0;
dS_Exp5 = 0;
bS_Exp2 = 0;
bS_Exp5 = 0;
hS_Exp2 = 0;
hS_Exp5 = 0;

//Adjust for decimal exponent
if (dExp >= 0) {
   dS_Exp2 += dExp;
   dS_Exp5 += dExp;
else {
   bS_Exp2 -= dExp;
   bS_Exp5 -= dExp;
   hS_Exp2 -= dExp;
   hS_Exp5 -= dExp;

//Adjust for binary exponent
if (bExp >= 0) {
    bS_Exp2 += bExp;
else {
   dS_Exp2 -= bExp;
   hS_Exp2 -= bExp;

//Adjust for half ulp exponent   
if (hExp >= 0) {
   hS_Exp2 += hExp;
else {
   dS_Exp2 -= hExp;
   bS_Exp2 -= hExp;

//Remove common power of two factor from all three scaled values
common_Exp2 = min(dS_Exp2, min(bS_Exp2, hS_Exp2));
dS_Exp2 -= common_Exp2; 
bS_Exp2 -= common_Exp2;
hS_Exp2 -= common_Exp2;

Here is how the code executes for our example (I’ll show the state of the exponent variables after adjusting for each exponent):

//After initializing scaling exponents
dS_Exp2 = 0;
dS_Exp5 = 0;
bS_Exp2 = 0;
bS_Exp5 = 0;
hS_Exp2 = 0;
hS_Exp5 = 0;

//After adjusting for decimal exponent (dExp = -49)
dS_Exp2 = 0;
dS_Exp5 = 0;
bS_Exp2 = 49;
bS_Exp5 = 49;
hS_Exp2 = 49;
hS_Exp5 = 49;

//After adjusting for binary exponent (bExp = -201)
dS_Exp2 = 201;
dS_Exp5 = 0;
bS_Exp2 = 49;
bS_Exp5 = 49;
hS_Exp2 = 250;
hS_Exp5 = 49;

//After adjusting for half ulp exponent (hExp = -202)
dS_Exp2 = 403;
dS_Exp5 = 0;
bS_Exp2 = 251;
bS_Exp5 = 49;
hS_Exp2 = 250;
hS_Exp5 = 49;

//After removing common power of two factor of 2^250
dS_Exp2 = 153;
dS_Exp5 = 0;
bS_Exp2 = 1;
bS_Exp5 = 49;
hS_Exp2 = 0;
hS_Exp5 = 49;

After this code executes, the powers are multiplied out as big integers and used to create big integers dS, bS, and hS.

When d, b, and h are already integers

When all three exponents are nonnegative, d, b, and h start out as integer values. In this case, they are scaled down, by their common power of two factor.

Testing the Inequality In Code

In strtod(), the checking centers around a comparison of |db| (actually, |bd|, which is equivalent) to h, which results in three outcomes: |db| < h, |db| = h, or |db| > h. The special cases are worked into the first two outcomes. (Java’s FloatingDecimal class does similar checks, but it is structured differently.)

Regarding the required check at a power of two boundary (comparing db to –h/2), strtod() scales db by two — to ensure that all values remain integers. This gives the equivalent comparison of 2(db) to –h. (In many cases, h is divisible by two; strtod() could just as easily divide h by two instead of multiplying db by two. Java’s FloatingDecimal class does the former.)

Adjusting b

strtod() makes these checks in a loop, and adjusts b when |db| > h.

strtod()’s Optimization Called “Bigcomp”

strtod() has an optimization called “bigcomp”, which limits the big integer value of d to 40 decimal digits. For decimal inputs of 40 significant digits or less, all 40 are used for d; for decimal inputs greater than 40 digits, only the first 18 are used. Compensating for this truncation of d leads to a more complicated test procedure. (This optimization is on by default; you have to #define variable “NO_STRTOD_BIGCOMP” to turn it off.)




  1. Hi Rick,

    For case 3, are there some testcases match d < b and b is a power of 2 ? I can't create such a testcase to test my code.

  2. @Aman,

    Here’s an example of how to generate a testcase:

    Start with a power of two, like 2-10. An “upper ULP” is 2-62 (the value of the 53rd significant bit), so a “lower ULP” is 2-63. Halfway between the lower ULP and b is 2-64. Pick d between 2-10 – 2-64 = 0.0009765624999999999457898913757247782996273599565029144287109375 and 2-10 = 0.0009765625.

    Let me know how it turns out.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

(Cookies must be enabled to leave a reduces spam.)

Copyright © 2008-2024 Exploring Binary

Privacy policy

Powered by WordPress