How GCC Converts Decimal Literals to Floating-Point

I’ve written about two implementations of decimal string to double-precision binary floating-point conversion: David Gay’s strtod(), and glibc’s strtod(). GCC, the GNU Compiler Collection, has yet another implementation; it uses it to convert decimal floating-point literals to double-precision. It is much simpler than David Gay’s and glibc’s implementations, but there’s a hitch: limited precision causes it to produce some incorrect conversions. Nonetheless, I wanted to explain how it works, since I’ve been studying it recently. (I looked specifically at the conversion of floating-point literals in C code, although the same code is used for other languages.)

(Update 12/3/13: GCC has replaced this conversion code with a correct, MPFR-based algorithm.)


gcc’s conversion code is written in C. It lives in file gcc/real.c (and real.h), primarily in the real_from_string() function. Unlike the two strtod() implementations, it doesn’t use big integers. It uses “biggish” floats; that is, its own multiple-precision floating-point implementation, which it calls the real type.

gcc’s conversion algorithm is very simple: All digits of the input are converted to an integer in real format, which is then multiplied or divided by a positive power of ten (if necessary), also in real format. For example:

  • 1.789e6 is converted by multiplying 1789 by 103.
  • 1.23e-5 is converted by dividing 123 by 107.
  • 456.833e2 is converted by dividing 456833 by 10.
  • 9007199254740992.456 is converted by dividing 9007199254740992456 by 103.

gcc initializes a real n with the digits of the input using the standard algorithm to convert decimal integer strings to binary: multiply by ten, add a digit, repeat. For conversions involving a power of ten, the power of ten is computed using binary exponentiation. Powers of ten of the form 102i are computed on the fly as needed, but once one computed, are stored in a table.

For conversions requiring multiplication by a power of ten, gcc multiplies n, in succession, by the appropriate powers of ten 102i. For example, multiplication by 1013 is done by multiplying n by 101, 104, and then 108. For conversions requiring division by a power of ten, the power of ten is created in its own real, d. gcc then divides n by d using its own division routine.

The conversion produces a real result that can be transformed into one of several hardware floating-point formats, double-precision being one of them (single-precision and extended-precision are also supported). For doubles, any result with more than 53 significant bits is rounded by looking at bits 54 and beyond; the round-to-nearest/ties-to-even rule is used. (gcc handles subnormal numbers, which for doubles have 52 bits or less; it rounds accordingly.)

GCC’s Real Type

gcc reals are conceptually similar to IEEE floats or doubles, yet there are several differences:

  • The significand has greater precision.
  • The exponent has a wider range.
  • The significand has no implicit leading 1 bit.
  • The significand is normalized into the range [0.5, 1.0); in other words, the bits are all to the right of the implied radix point, and the first bit is a 1.

The integer 1789, for example, is stored as 0.11011111101 x 211.

gcc implements its own operations on reals — add, multiply, subtract, divide, shift, compare, etc. It does no rounding during these operations.

Limited Precision

The precision p of a real is 160 bits on 32-bit systems and 192 bits on 64-bit systems (I will be assuming 192-bit precision for the rest of this article.) This is not enough to do all conversions correctly.

For starters, this means that only inputs of up to 57 or 58 digits can be represented exactly; the integer constructed from its digits is limited to 192 bits (log10(2192) is approximately 57.8). Also, of the stored powers of ten 102i, only those up to 1064 are exact. This is because 564 requires only 149 bits, but 5128 requires 298 bits.

The Division Process

The gcc real type implements division as a series of steps of shifts and/or subtracts. Each step produces one bit of the quotient. It is essentially binary long division. One difference in gcc’s algorithm is that it shifts the dividend (the numerator of the fraction) left at each step, rather than shift the divisor (the denominator of the fraction) right. Also, gcc operates on fixed-sized fields (192 bits), so the subtracts involve the entire dividend, rather than a working prefix. The two methods are logically equivalent.

The radix points (power of two exponents) are ignored during division. The significands of the operands are treated as unsigned multi-precision integers (each is three 64-bit words — 192/64). The exponent of the result is set at the end, based on the exponents of the operands and whether the result needs normalization.

Here is the division algorithm, which I’ve written in C-like pseudocode:

quotient = 0
for i = 191 to 0
  if i < 191 //Not first step
    msb = most significant bit of dividend
    dividend << 1
  if msb == 1 || divisor <= dividend
     dividend -= divisor
     quotient[i] = 1

A left shift (<<) shifts the most significant bit out and shifts a 0 into the least significant bit; the variable msb captures this bit so it’s not lost. It is needed to ensure that the subtraction works. Conceptually, you may need 193 bits in the dividend for the subtraction to work, so that’s why the msb is saved. On the other hand, the actual subtraction is performed with only 192 bits. This means that, when msb is set, the subtraction may be done even if the shifted dividend is less than the divisor.

So how does the subtraction work out when this happens? It works automatically due to the modular nature of unsigned integer subtraction. Consider subtraction of 8-bit unsigned integers. For example, subtract 240 from 30: the answer is 46. This differs from the signed answer by -28 = -256. But think of what the value of msb is at this precision: 256. So logically we’re subtracting 240 from the 9-bit value 256 + 30 = 286, which is 46.


For input 9.1, gcc creates the fraction 91/10, or 1011011/1010 in binary. First, let’s see how you would divide it by hand, using binary long division:

1010 )1011011.000000000000

(I generated 16 bits of the quotient in order to correspond to my illustration of gcc’s calculations below. But when doing this by hand you would stop when trying to generate the 12th bit, recognizing the repeat of the subtraction 10000 – 1010; you would write the answer as 1001.00011.)

gcc’s Conversion

To keep this example manageable, I’m going to pretend gcc uses a precision of 16 bits. In this case, it would represent the dividend as 0.1011011000000000 x 27, and the divisor as 0.1010000000000000 x 24. It initializes the quotient to 0 and sets its power of two exponent (to be ignored for now) to 7 – 4 + 1 = 4. For the division, it ignores the exponents and treats the significands as integers. Here are the steps (each entry shows the result of a step):

gcc’s Computation of 91/10
Step Divisor Dividend Quotient Operations
0 1010000000000000 1011011000000000 0000000000000000 (initial)
1 1010000000000000 0001011000000000 1000000000000000 sub
2 1010000000000000 0010110000000000 1000000000000000 shift
3 1010000000000000 0101100000000000 1000000000000000 shift
4 1010000000000000 0001000000000000 1001000000000000 shift, sub
5 1010000000000000 0010000000000000 1001000000000000 shift
6 1010000000000000 0100000000000000 1001000000000000 shift
7 1010000000000000 1000000000000000 1001000000000000 shift
8 1010000000000000 0110000000000000 1001000100000000 shift, sub
9 1010000000000000 0010000000000000 1001000110000000 shift, sub
10 1010000000000000 0100000000000000 1001000110000000 shift
11 1010000000000000 1000000000000000 1001000110000000 shift
12 1010000000000000 0110000000000000 1001000110010000 shift, sub
13 1010000000000000 0010000000000000 1001000110011000 shift, sub
14 1010000000000000 0100000000000000 1001000110011000 shift
15 1010000000000000 1000000000000000 1001000110011000 shift
16 1010000000000000 0110000000000000 1001000110011001 shift, sub

(A highlighted bit in the quotient corresponds to the step in which it was set; in other words, when there was a subtraction.)

The resulting quotient has its most significant bit set, so it doesn’t need normalization; its initial exponent stands. In gcc’s real format (modified to 16 bits of precision) the quotient is 0.1001000110011001 x 24, representing 1001.000110011001. (You can verify this with my binary calculator.) The dividend at the final step is non-zero, indicating a remainder; the division is inexact.

Step 8 is an example of unsigned subtraction. The most significant bit of the dividend was set but was shifted out, resulting in the dividend (0000000000000000, or 0) being less than the divisor (1010000000000000, or 40960). Unsigned integer subtraction produces 0 – 40960 = 24576. This is the correct result, because if you take the most significant bit into account, the dividend is really 216 + 0 = 65536 + 0 = 65536, making the real subtraction 65536 – 40960 = 24576.

Is It Slower Than strtod()?

I would guess that, for most inputs, gcc’s conversion is slower than both strtod()s. For conversions that require high precision, gcc could be faster; but that’s a bit of an “apples and oranges” comparison, since gcc’s conversions may be incorrect.



  1. Why goes gcc do this? Having an alternate conversion routine that is less accurate and possibly slower seems rather pointless. Unless strtod() was a major bottleneck for compilation times it seems like this is a bad idea, even if it is harmless in most cases.

Comments are closed.

Copyright © 2008-2024 Exploring Binary

Privacy policy

Powered by WordPress