For decimal inputs that don’t qualify for fast path conversion, David Gay’s strtod() function does three things: first, it uses IEEE double-precision floating-point arithmetic to calculate an approximation to the correct result; next, it uses arbitrary-precision integer arithmetic (AKA big integers) to check if the approximation is correct; finally, it adjusts the approximation, if necessary. In this article, I’ll explain the second step — how the check of the approximation is done.

## Overview

The goal of David Gay’s strtod() is to perform efficient, correctly rounded decimal to floating point conversion. To that end, it minimizes the overhead due to arbitrary-precision arithmetic, which is unavoidable for some conversions. In particular, strtod() avoids *division* of big integers, as would be required by a full-blown big integer algorithm. Instead of dividing, it multiplies, adds, subtracts, and compares.

The key to the algorithm is the floating-point approximation. It can be compared to the original decimal input to see if it’s too small, correct, or too big (it will be within 10 ULPs or so of correct). The approximation can be adjusted accordingly.

(My discussion will deal only with normal double-precision numbers; strtod() handles subnormal numbers as well.)

## Correctly Rounded Conversions

For a floating-point number *b* to be considered the correctly rounded conversion of a decimal number *d*, it must be the *closest* floating-point number to *d*. That definition sounds simple enough, but there’s a problem: the meaning of “closest” can depend on the value of *b*. In any case, the tests for closeness are all based on a measure called a ULP — a binary (for our purposes) *unit in the last place*. A ULP is the gap between adjacent floating-point numbers.

I’ll present the tests as three separate cases, although in code they must be considered together.

### Case 1: *d* is less than one-half ULP away from *b*

If *d* is less than one-half of a ULP from *b*, then *b* is the correctly rounded conversion of *d* (this test is ambiguous when *b* is a power of two — I’ll explain below). Mathematically, this is stated as

b – *h* < d < b + *h* ,

where *h* = 1/2·*u*, and where *u* equals one ULP.

Here’s what the relationship looks like:

Equivalently, this relationship can be written as

–*h* < *d* – *b* < *h* ,

or

|*d* – *b*| < *h* .

### Case 2: *d* is one-half ULP away from *b*

If *d* is one-half of a ULP away from *b*, then *b* is the correctly rounded conversion of *d* only if *b*’s least significant bit is 0 (this test is ambiguous when *b* is a power of two — I’ll explain below). This is called round-half-to-even rounding.

The relationship is written as

|*d* – *b*| = *h*

### Case 3: *b* is a power of two

If *b* is a power of two, two different ULP values come into play, making our half ULP test ambiguous. One ULP *below* *b* is half as much as one ULP *above* *b*, since a ULP doubles at each increasing power of two boundary. This means that if *b* is a power of two, it is the correctly rounded conversion of *d* only if *d* is no more than one-half ULP above *b* or no more than one-quarter of that same (upper) ULP below *b*; here is the inequality:

–*h*/2 ≤ *d* – *b* ≤ *h* .

This is what it looks like:

We use the “≤” signs in the inequality because *b* is the correctly rounded result even if *d* = *b* – *h*/2 or *d* = *b* + *h*. This is because the significands of (normal) powers of two are zero.

## Comparing Using Big Integers

To explain how the algorithm works, I will use this example: *d* = 1.7864e-45 and *b* = 0x1.465a72e467d88p-149 (the latter is expressed as a hexadecimal floating-point constant). Here’s how *d* and *b* are related:

The diagram shows that *b* is the closest floating-point number to *d* — but how do we check that programmatically, using binary arithmetic? We need to apply the above test: |*d* – *b*| < *h*. (I know a priori this is not a special case — I wanted to keep my example simple.) But there’s a problem: how do we represent *d*? Answer: as a big integer. If we also represent *b* and *h* as big integers, such that they are proportional to *d*, we can apply the test given by the inequality.

I’ve broken the process into four steps:

### Step 1: Represent the significant digits of *d* and *b* as integers

The first step is to represent *d* as an integer times a power of ten, and *b* as an integer times a power of two:

*d* = *dInt* x 10^{dExp}

*b* = *bInt* x 2^{bExp}

*d* was already put in this form when the floating point approximation *b* was computed (but for the purposes of the comparison, *all* digits of *d* are used, not just the first 16). *b* can be put in this form simply, by accessing its underlying floating-point representation: we “unnormalize” its significand into an integer by decreasing its power of two exponent. We represent the significant digits as a 53-bit integer, which means we subtract 52 from its power of two exponent.

For our example, *d* = 1.7864e-45 is represented as 17864 x 10^{-49}, and *b* = 0x1.465a72e467d88p-149, which in binary scientific notation is

1.0100011001011010011100101110010001100111110110001 x 2^{-149} ,

is represented (in decimal numerals) as 5741268244528520 x 2^{-201}.

### Step 2: Represent *h* as a power of two

Since *bInt* is 53-bit integer, *u* = 2^{bExp}, and *h* is half that: 2^{bExp-1}. For consistency in my discussion, I’ll substitute the variable *hExp* for *bExp*-1, giving

*h* = 2^{hExp} .

### Step 3: Scale the inequality

To complete the conversion to integers, we need to get rid of any negative powers; we do this by scaling both sides of the inequality. Let’s look at the inequality again:

|*d* – *b*| < *h* .

Here are the three values as we expressed them above:

*d* = *dInt* x 10^{dExp}

*b* = *bInt* x 2^{bExp}

*h* = 2^{hExp}

For efficiency, the scaling process will factor out common powers of two from the three powers; to facilitate this, we rewrite 10^{dExp} as 2^{dExp} x 5^{dExp}; now we have

*d* = *dInt* x 2^{dExp} x 5^{dExp}

*b* = *bInt* x 2^{bExp}

*h* = 2^{hExp}

If any of the exponents *dExp*, *bExp*, or *hExp* are negative, we make them nonnegative by scaling. To reflect any scaling, let’s rename *d*, *b*, and *h* to *dS*, *bS*, and *hS*, respectively. The scaled inequality is now written

|*dS* – *bS*| < *hS* ,

and the initial values of *dS*, *bS*, and *hS* are

*dS* = *dInt* x 2^{dExp} x 5^{dExp}

*bS* = *bInt* x 2^{bExp}

*hS* = 2^{hExp}

To show how scaling works, let’s continue with our example above; here are the initial values of the variables:

*dS* = 17864 x 2^{-49} x 5^{-49}

*bS* = 5741268244528520 x 2^{-201}

*hS* = 2^{-202}

In this example, all three exponents — *dExp*, *bExp*, and *hExp* — are negative; we scale the values with a common factor *S* that makes them all nonnegative. The least common factor that does this is *S* = 2^{202} x 5^{49}; our variables then become

*dS* = 17864 x 2^{153}

*bS* = 5741268244528520 x 2^{1} x 5^{49}

*hS* = 5^{49}

Multiplied out, these are

*dS* = 203970822259994138521801764465966248930731085529088

*bS* = 203970822259994122305215569213032722473144531250000

*hS* = 17763568394002504646778106689453125

### Step 4: Apply the test

Now all we have to do is check the inequality, |*dS* – *bS*| < *hS*:

|*dS* – *bS*| = 16216586195252933526457586554279088, which is less than *hS* = 17763568394002504646778106689453125. This proves that *b* is the correct conversion of *d*.

(In the diagram for this example above, I’ve drawn the relationship of *d* and *b* proportionally: (*dS* – *bS*)/*hS* is approximately 0.91, which means that *d* is about 0.91*h* above *b*.)

## Computing the Scaling Exponents In Code

In code, the scaling is done by operating on native integer variables representing the exponents of the powers of two and five in each of the three scaled values. After the exponents are created, common factors of two are removed to reduce the size of the resulting big integers. Here is how I coded it in C (this is essentially how it’s done in David Gay’s strtod() and Java’s FloatingDecimal class; I’ve renamed the variables and structured it a bit differently to make it clearer):

//Initialize scaling exponents dS_Exp2 = 0; dS_Exp5 = 0; bS_Exp2 = 0; bS_Exp5 = 0; hS_Exp2 = 0; hS_Exp5 = 0; //Adjust for decimal exponent if (dExp >= 0) { dS_Exp2 += dExp; dS_Exp5 += dExp; } else { bS_Exp2 -= dExp; bS_Exp5 -= dExp; hS_Exp2 -= dExp; hS_Exp5 -= dExp; } //Adjust for binary exponent if (bExp >= 0) { bS_Exp2 += bExp; } else { dS_Exp2 -= bExp; hS_Exp2 -= bExp; } //Adjust for half ulp exponent if (hExp >= 0) { hS_Exp2 += hExp; } else { dS_Exp2 -= hExp; bS_Exp2 -= hExp; } //Remove common power of two factor from all three scaled values common_Exp2 = min(dS_Exp2, min(bS_Exp2, hS_Exp2)); dS_Exp2 -= common_Exp2; bS_Exp2 -= common_Exp2; hS_Exp2 -= common_Exp2;

Here is how the code executes for our example (I’ll show the state of the exponent variables after adjusting for each exponent):

//After initializing scaling exponents dS_Exp2 = 0; dS_Exp5 = 0; bS_Exp2 = 0; bS_Exp5 = 0; hS_Exp2 = 0; hS_Exp5 = 0; //After adjusting for decimal exponent (dExp = -49) dS_Exp2 = 0; dS_Exp5 = 0; bS_Exp2 = 49; bS_Exp5 = 49; hS_Exp2 = 49; hS_Exp5 = 49; //After adjusting for binary exponent (bExp = -201) dS_Exp2 = 201; dS_Exp5 = 0; bS_Exp2 = 49; bS_Exp5 = 49; hS_Exp2 = 250; hS_Exp5 = 49; //After adjusting for half ulp exponent (hExp = -202) dS_Exp2 = 403; dS_Exp5 = 0; bS_Exp2 = 251; bS_Exp5 = 49; hS_Exp2 = 250; hS_Exp5 = 49; //After removing common power of two factor of 2^250 dS_Exp2 = 153; dS_Exp5 = 0; bS_Exp2 = 1; bS_Exp5 = 49; hS_Exp2 = 0; hS_Exp5 = 49;

After this code executes, the powers are multiplied out as big integers and used to create big integers *dS*, *bS*, and *hS*.

### When *d*, *b*, and *h* are already integers

When all three exponents are nonnegative, *d*, *b*, and *h* start out as integer values. In this case, they are scaled *down*, by their common power of two factor.

## Testing the Inequality In Code

In strtod(), the checking centers around a comparison of |*d* – *b*| (actually, |*b* – *d*|, which is equivalent) to *h*, which results in three outcomes: |*d* – *b*| < *h*, |*d* – *b*| = *h*, or |*d* – *b*| > *h*. The special cases are worked into the first two outcomes. (Java’s FloatingDecimal class does similar checks, but it is structured differently.)

Regarding the required check at a power of two boundary (comparing *d* – *b* to –*h*/2), strtod() scales *d* – *b* by two — to ensure that all values remain integers. This gives the equivalent comparison of 2(*d* – *b*) to –*h*. (In many cases, *h* is divisible by two; strtod() could just as easily divide *h* by two instead of multiplying *d* – *b* by two. Java’s FloatingDecimal class does the former.)

### Adjusting *b*

strtod() makes these checks in a loop, and adjusts *b* when |*d* – *b*| > *h*.

## strtod()’s Optimization Called “Bigcomp”

strtod() has an optimization called “bigcomp”, which limits the big integer value of *d* to 40 decimal digits. For decimal inputs of 40 significant digits or less, all 40 are used for *d*; for decimal inputs greater than 40 digits, only the first 18 are used. Compensating for this truncation of *d* leads to a more complicated test procedure. (This optimization is on by default; you have to #define variable “NO_STRTOD_BIGCOMP” to turn it off.)

## References

- David Gay’s dtoa.c (see the strtod() function).
- David Gay’s Technical Report “Correctly Rounded Binary-Decimal and Decimal-Binary Conversions”.
- William D. Clinger’s paper “How to Read Floating Point Numbers Accurately” (see ‘algorithmR’).
- FloatingDecimal.java (see the doubleValue() method).
- Python’s version of David Gay’s dtoa.c.

(Cookies must be enabled to leave a comment...it reduces spam.)