strtod()’s Initial Decimal to Floating-Point Approximation

David Gay’s strtod() function does decimal to floating-point conversion using both IEEE double-precision floating-point arithmetic and arbitrary-precision integer arithmetic. For some inputs, a simple IEEE floating-point calculation suffices to produce the correct result; for other inputs, a combination of IEEE arithmetic and arbitrary-precision arithmetic is required. In the latter case, IEEE arithmetic is used to calculate an approximation to the correct result, which is then refined using arbitrary-precision arithmetic. In this article, I’ll describe the approximation calculation, which is based on a form of binary exponentiation.

Overview of the Approximation Calculation

Conceptually, the approximation calculation involves two floating-point values: the significant digits of the decimal number, expressed as an integer, and a corresponding power of ten. For example, for the decimal string 163.118762e+109, the calculation is 163118762 x 10¹⁰³; for the decimal string 8.453127e-67, the calculation is 8453127 x 10^-73. If there are more than 16 significant digits, only the first 16 are used. The decimal string 6.2187331579177550499956283e+100, for example, would be treated as 6.218733157917755e+100, making the calculation 6218733157917755 x 10⁸⁵.

The powers of ten can be very small or very large. For double-precision floating-point (I’m considering normal numbers only), they range from 10^-323 (e.g., for DBL_MIN = 2.2250738585072014e-308) to 10³⁰⁸ (e.g., for 1e+308). The goal is to compute them efficiently.

Right-to-Left Binary Exponentiation

An efficient way to compute powers of ten is using right-to-left binary exponentiation. Conceptually, the process is this: given a power bⁿ to compute, write n as a sum of powers of two; then, using the laws of exponents, rewrite bⁿ as a product of b raised to each power of two. For example:

10¹⁰³ = 10^{1 + 2 + 4 + 32 + 64} = 10¹ * 10² * 10⁴ * 10³² * 10⁶⁴ .

Evaluating this expression requires only ten multiplications — six to compute the powers of ten 10² through 10⁶⁴ (by successive squaring), and four to multiply the five required powers together. If 10¹⁰³ were computed naively, it would require 102 multiplications.

As an example for computing negative exponents, consider

10^-73 = 10^{-(1 + 8 + 64)} = 10^-1 * 10^-8 * 10^-64 .

This requires eight multiplications — six to compute the powers of ten 10^-2 through 10^-64, and two to multiply the three required powers together. The naive calculation would do 72 multiplications.

Required Set of Powers

For the range of exponents of double-precision values, the required set of “binary powers” (powers of ten raised to powers of two) for right-to-left binary exponentiation are

{10^-1, 10^-2, 10^-4, 10^-8, 10^-16, 10^-32, 10^-64, 10^-128, 10^-256} for negative exponents, and
{10¹, 10², 10⁴, 10⁸, 10¹⁶, 10³², 10⁶⁴, 10¹²⁸, 10²⁵⁶} for positive exponents.

Instead of computing these powers for each conversion, they can be computed once and stored in a table. This reduces the number of multiplications further. In the case of our examples, 10¹⁰³ would now require only four multiplications, and 10^-73 would require only two.

The Approximation Calculation in strtod()

The approximation calculation in strtod() is based on the calculations I just described, but there are a few differences. The significant digits and power of ten are not built in two variables; one variable is initialized with the significant digits, and then the power of ten is factored in incrementally. This allows for a two-phased calculation: the first phase is a fast-path like calculation that handles any component power of ten between 10^-15 and 10¹⁵; the second phase uses binary exponentiation to finish the computation of the power of ten, for component binary powers 10^-16 and smaller and 10¹⁶ and greater.

In the first phase of the calculation, only positive powers of ten, 10¹ to 10¹⁵, are used. When negative powers of ten 10^-1 to 10^-15 are called for, the calculation is changed to the equivalent calculation that divides the significant digits by the corresponding positive power of ten. A separate table contains each of the powers of ten from 10¹ to 10¹⁵, which are looked up directly. The one division or multiplication replaces the up to three multiplications that might be needed for binary exponentiation (10¹ * 10² * 10⁴ * 10⁸). This not only is potentially more efficient, but reduces the number of possibly inexact operations (operations in which the result must be rounded).

Here are two examples: 163.118762e+109, which is parsed as 163118762 x 10¹⁰³, is calculated as 163118762 * 10⁷ * 10³² * 10⁶⁴; 8.453127e-67, which is parsed as 8453127 x 10^-73, is calculated as 8453127/10⁹ * 10^-64.

The Calculation In Code

I’ve taken strtod()‘s approximation code and distilled it into my own C program. The main difference between my code and strtod() is that strtod() handles overflow (for values near DBL_MAX) and underflow (for values that are near DBL_MIN or are subnormal). I’ve also renamed a few variables and made a few other small changes to make the code more readable.

double strtod_approx(char* decimal)
{
 int exponent, sign, i;
 long long sigDigits16;

 static const double fastPosTens[] = {
   1e0, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9,
   1e10, 1e11, 1e12, 1e13, 1e14, 1e15};
 static const double binaryPosTens[] = {
   1e16, 1e32, 1e64, 1e128, 1e256};
 static const double binaryNegTens[] = {
   1e-16, 1e-32, 1e-64, 1e-128, 1e-256};

 double d;

 parseDecimalApprox(decimal,&sign,&sigDigits16,&exponent);

 d = sigDigits16;
 if (sign)
   d = -d;

 if (exponent > 0)
   {
    if (i = exponent & 0xF)
      d *= fastPosTens[i];
    if (exponent >>= 4)
      for(i = 0; exponent > 0; i++, exponent >>= 1)
        if (exponent & 1)
          d *= binaryPosTens[i];
   }
 else 
   if (exponent < 0)
     {
      exponent = -exponent;
      if (i = exponent & 0xF)
        d /= fastPosTens[i];
      if (exponent >>= 4)
        for(i = 0; exponent > 0; i++, exponent >>= 1)
          if (exponent & 1)
            d *= binaryNegTens[i];
     }

 return d;
}

Notes

This function calls an auxiliary function I wrote called parseDecimalApprox(), which I’ve not shown. It returns the sign of the number, the value of its first 16 significant digits, and its power of ten exponent.
There are three tables of powers of ten, all precomputed at compile time:
- fastPosTens: This holds the 15 exactly representable powers of ten needed for the first phase of the calculation (1e0 is not used — it’s just a placeholder).
- binaryPosTens: This holds the five positive powers of ten used for binary exponentiation.
- binaryNegTens: This holds the five negative powers of ten used for binary exponentiation.
If the exponent is zero, no power of ten is factored in (it would be 10⁰ = 1).

Realized in code, you can see why the algorithm to compute the power of ten is called right-to-left binary exponentiation. Converting the exponent to a sum of powers of two is the same as converting it to binary; in code, the exponent is a binary integer. The bits of the exponent are accessed from right to left — least significant to most significant — using the right shift operator; a 1-bit means the corresponding power of two is part of the exponent. The exponent will be a maximum of 9 bits long.

Analysis

There are at most five floating-point multiplications and divisions in this calculation: three multiplications plus one multiplication or division for the worst case powers of ten, and one multiplication to combine the significant digits and the power of ten. If the significant digits start out inexact — either because they are truncated to 16 digits, or they are 16 digits (truncated or otherwise) but represent a value greater than 2⁵³ – 1 — that’s a maximum of six inexact operations.

Using my code above, I measured approximations as far as 10 ULPs off; 1.00431469722921494e-140 is one example. (This is consistent with my analysis of David Gay’s code, where the maximum difference I found was 11 ULPs. Why the difference? It could be due to the code that handles underflow and overflow, or it could be because I didn’t run the same exact tests in both cases.)

Contrast this with my quick and dirty decimal to floating-point conversion program, which does a multiplication or division for every digit. Using that program, I found an example that was 14 ULPs off.

Discussion

A Program That Converts Decimal Values Uses Decimal Values?

The code converts decimal numbers to floating-point, and yet it itself needs decimal numbers converted to floating-point — isn’t that circular? Not really. The compiler’s implementation would have to be different, or at least modified so that the decimal constants are specified in some other way — say as hexadecimal floating-point constants.

This raises another issue: what if the compiler converts these decimal literals incorrectly? After all, this is part of a larger conversion routine whose goal is correct conversion. My guess is that it does not matter. The worst that can happen is that the approximation will be a little less accurate; it will ultimately be corrected — by the arbitrary-precision based reconciliation process in strtod(). (For what it’s worth, I checked — Visual C++ got all of the conversions right.)

Java’s FloatingDecimal Class

The same approximation calculation is done in the doubleValue() method of Java’s FloatingDecimal Class, which is modeled on David Gay’s strtod().

Using Division For the Negative Exponent Case

For the negative exponent case of the binary exponentiation calculation, I tried using division (by positive powers of ten) instead of multiplication (by negative powers of ten). The division-based calculation ran a little slower, which was expected. But interestingly, the approximations were a little less accurate, on average. I don’t know why.

2 comments

John Harrison says:

March 9, 2013 at 3:50 am

Two questions.

1) Why use multiplication by negative powers of ten, why not division by positive powers of ten? Is it efficiency, accuracy or just arbitrary?

2) David Gay’s code doesn’t always respect significant digits. For instance given the input “12340e-35” the initial approximation is calculated from 12340 and -35 not 1234 and -34. From my very limited testing sometimes this gives a less accurate initial approximation. Any thoughts?
Rick Regan says:

March 9, 2013 at 10:50 am

John,

1) I had a similar question when I wrote this article. I had modified the code to do division instead of multiplication and found the initial approximations were slightly less accurate (and took longer to compute). I don’t know if that was the rationale behind it.

Another possibility is that this is required to handle underflow. In the real code, strtod() scales 10^-256 (by 2¹⁰⁶) to, as the comment says, “avoid setting the underflow flag unnecessarily”.

2) In a sense you’ve made the 0 significant by putting it there, so you get what you deserve 🙂 . On the other hand, there’s no reason why strtod() couldn’t strip it away. It’s probably not a very common case though. (If strtod() did check for insignificant trailing 0s it could add a new fast path case. For example, 12340e-23 is not computed as a fast path, but the equivalent 1234e-22 is.)

Comments are closed.