I’ve written about two implementations of decimal string to double-precision binary floating-point conversion: David Gay’s strtod(), and glibc’s strtod(). GCC, the GNU Compiler Collection, has yet another implementation; it uses it to convert decimal floating-point literals to double-precision. It is much simpler than David Gay’s and glibc’s implementations, but there’s a hitch: limited precision causes it to produce some incorrect conversions. Nonetheless, I wanted to explain how it works, since I’ve been studying it recently. (I looked specifically at the conversion of floating-point literals in C code, although the same code is used for other languages.)

(**Update 12/3/13:** GCC has replaced this conversion code with a correct, MPFR-based algorithm.)

## Overview

gcc’s conversion code is written in C. It lives in file gcc/real.c (and real.h), primarily in the real_from_string() function. Unlike the two strtod() implementations, it doesn’t use big integers. It uses “biggish” floats; that is, its own multiple-precision floating-point implementation, which it calls the *real* type.

gcc’s conversion algorithm is very simple: All digits of the input are converted to an integer in *real* format, which is then multiplied or divided by a positive power of ten (if necessary), also in *real* format. For example:

- 1.789e6 is converted by multiplying 1789 by 10
^{3}. - 1.23e-5 is converted by dividing 123 by 10
^{7}. - 456.833e2 is converted by dividing 456833 by 10.
- 9007199254740992.456 is converted by dividing 9007199254740992456 by 10
^{3}.

gcc initializes a *real* *n* with the digits of the input using the standard algorithm to convert decimal integer strings to binary: multiply by ten, add a digit, repeat. For conversions involving a power of ten, the power of ten is computed using binary exponentiation. Powers of ten of the form 10^{2i} are computed on the fly as needed, but once one computed, are stored in a table.

For conversions requiring multiplication by a power of ten, gcc multiplies *n*, in succession, by the appropriate powers of ten 10^{2i}. For example, multiplication by 10^{13} is done by multiplying *n* by 10^{1}, 10^{4}, and then 10^{8}. For conversions requiring *division* by a power of ten, the power of ten is created in its own *real*, *d*. gcc then divides *n* by *d* using its own division routine.

The conversion produces a *real* result that can be transformed into one of several hardware floating-point formats, double-precision being one of them (single-precision and extended-precision are also supported). For doubles, any result with more than 53 significant bits is rounded by looking at bits 54 and beyond; the round-to-nearest/ties-to-even rule is used. (gcc handles subnormal numbers, which for doubles have 52 bits or less; it rounds accordingly.)

## GCC’s Real Type

gcc *reals* are conceptually similar to IEEE floats or doubles, yet there are several differences:

- The significand has greater precision.
- The exponent has a wider range.
- The significand has no implicit leading 1 bit.
- The significand is normalized into the range [0.5, 1.0); in other words, the bits are all to the right of the implied radix point, and the first bit is a 1.

The integer 1789, for example, is stored as 0.11011111101 x 2^{11}.

gcc implements its own operations on *reals* — add, multiply, subtract, divide, shift, compare, etc. It does no rounding during these operations.

### Limited Precision

The precision *p* of a *real* is 160 bits on 32-bit systems and 192 bits on 64-bit systems (I will be assuming 192-bit precision for the rest of this article.) This is not enough to do all conversions correctly.

For starters, this means that only inputs of up to 57 or 58 digits can be represented exactly; the integer constructed from its digits is limited to 192 bits (log_{10}(2^{192}) is approximately 57.8). Also, of the stored powers of ten 10^{2i}, only those up to 10^{64} are exact. This is because 5^{64} requires only 149 bits, but 5^{128} requires 298 bits.

## The Division Process

The gcc *real* type implements division as a series of steps of shifts and/or subtracts. Each step produces one bit of the quotient. It is essentially binary long division. One difference in gcc’s algorithm is that it shifts the dividend (the numerator of the fraction) left at each step, rather than shift the divisor (the denominator of the fraction) right. Also, gcc operates on fixed-sized fields (192 bits), so the subtracts involve the entire dividend, rather than a working prefix. The two methods are logically equivalent.

The radix points (power of two exponents) are ignored during division. The significands of the operands are treated as unsigned multi-precision integers (each is three 64-bit words — 192/64). The exponent of the result is set at the end, based on the exponents of the operands and whether the result needs normalization.

Here is the division algorithm, which I’ve written in C-like pseudocode:

quotient = 0 for i = 191 to 0 if i < 191 //Not first step msb = most significant bit of dividend dividend << 1 if msb == 1 || divisor <= dividend dividend -= divisor quotient[i] = 1

A left shift (<<) shifts the most significant bit out and shifts a 0 into the least significant bit; the variable *msb* captures this bit so it’s not lost. It is needed to ensure that the subtraction works. Conceptually, you may need 193 bits in the dividend for the subtraction to work, so that’s why the msb is saved. On the other hand, the actual subtraction is performed with only 192 bits. This means that, when *msb* is set, the subtraction may be done even if the shifted dividend is less than the divisor.

So how does the subtraction work out when this happens? It works automatically due to the modular nature of unsigned integer subtraction. Consider subtraction of 8-bit unsigned integers. For example, subtract 240 from 30: the answer is 46. This differs from the signed answer by -2^{8} = -256. But think of what the value of *msb* is at this precision: 256. So logically we’re subtracting 240 from the 9-bit value 256 + 30 = 286, which is 46.

## Example

For input 9.1, gcc creates the fraction 91/10, or 1011011/1010 in binary. First, let’s see how you would divide it by hand, using binary long division:

1001.000110011001... --------------------- 1010 )1011011.000000000000 1010 ---- 1011 1010 ---- 10000 1010 ----- 1100 1010 ----- 10000 1010 ----- 1100 1010 ----- 10000 . . .

(I generated 16 bits of the quotient in order to correspond to my illustration of gcc’s calculations below. But when doing this by hand you would stop when trying to generate the 12th bit, recognizing the repeat of the subtraction 10000 – 1010; you would write the answer as 1001.00011.)

### gcc’s Conversion

To keep this example manageable, I’m going to pretend gcc uses a precision of 16 bits. In this case, it would represent the dividend as 0.1011011000000000 x 2^{7}, and the divisor as 0.1010000000000000 x 2^{4}. It initializes the quotient to 0 and sets its power of two exponent (to be ignored for now) to 7 – 4 + 1 = 4. For the division, it ignores the exponents and treats the significands as integers. Here are the steps (each entry shows the result of a step):

Step | Divisor | Dividend | Quotient | Operations |
---|---|---|---|---|

0 | 1010000000000000 | 1011011000000000 | 0000000000000000 | (initial) |

1 | 1010000000000000 | 0001011000000000 | 1000000000000000 |
sub |

2 | 1010000000000000 | 0010110000000000 | 1000000000000000 | shift |

3 | 1010000000000000 | 0101100000000000 | 1000000000000000 | shift |

4 | 1010000000000000 | 0001000000000000 | 1001000000000000 |
shift, sub |

5 | 1010000000000000 | 0010000000000000 | 1001000000000000 | shift |

6 | 1010000000000000 | 0100000000000000 | 1001000000000000 | shift |

7 | 1010000000000000 | 1000000000000000 | 1001000000000000 | shift |

8 | 1010000000000000 | 0110000000000000 | 1001000100000000 |
shift, sub |

9 | 1010000000000000 | 0010000000000000 | 1001000110000000 |
shift, sub |

10 | 1010000000000000 | 0100000000000000 | 1001000110000000 | shift |

11 | 1010000000000000 | 1000000000000000 | 1001000110000000 | shift |

12 | 1010000000000000 | 0110000000000000 | 1001000110010000 |
shift, sub |

13 | 1010000000000000 | 0010000000000000 | 1001000110011000 |
shift, sub |

14 | 1010000000000000 | 0100000000000000 | 1001000110011000 | shift |

15 | 1010000000000000 | 1000000000000000 | 1001000110011000 | shift |

16 | 1010000000000000 | 0110000000000000 | 1001000110011001 |
shift, sub |

(A highlighted bit in the quotient corresponds to the step in which it was set; in other words, when there was a subtraction.)

The resulting quotient has its most significant bit set, so it doesn’t need normalization; its initial exponent stands. In gcc’s real format (modified to 16 bits of precision) the quotient is 0.1001000110011001 x 2^{4}, representing 1001.000110011001. (You can verify this with my binary calculator.) The dividend at the final step is non-zero, indicating a remainder; the division is inexact.

Step 8 is an example of unsigned subtraction. The most significant bit of the dividend was set but was shifted out, resulting in the dividend (0000000000000000, or 0) being less than the divisor (1010000000000000, or 40960). Unsigned integer subtraction produces 0 – 40960 = 24576. This is the correct result, because if you take the most significant bit into account, the dividend is really 2^{16} + 0 = 65536 + 0 = 65536, making the real subtraction 65536 – 40960 = 24576.

## Is It Slower Than strtod()?

I would guess that, for most inputs, gcc’s conversion is slower than both strtod()s. For conversions that require high precision, gcc could be faster; but that’s a bit of an “apples and oranges” comparison, since gcc’s conversions may be incorrect.

Why goes gcc do this? Having an alternate conversion routine that is less accurate and possibly slower seems rather pointless. Unless strtod() was a major bottleneck for compilation times it seems like this is a bad idea, even if it is harmless in most cases.

Bruce,

I don’t know the history, but check out the bug report “real.c rounding not perfect”. It has discussion about why using David Gay’s or glibc’s implementations are not appropriate. It seems like their consensus would be to use MPFR, if they were to fix it.