For my recent search for short examples of double rounding errors in decimal to double to float conversions I wrote a Kotlin program to generate and test random decimal strings. While this was sufficient to find examples, I realized I could do a more direct search by generating only decimal strings with the underlying double rounding error bit patterns. I’ll show you the Java BigDecimal based Kotlin program I wrote for this purpose.
In my previous exploration of double rounding errors in decimal to float conversions I showed two decimal numbers that experienced a double rounding error when converted to float (single-precision) through an intermediate double (double-precision). I generated the examples indirectly by setting bit combinations that forced the error, using their corresponding exact decimal representations. As a result, the decimal numbers were long (55 digits each). Mark Dickinson derived a much shorter 17 digit example, but I hadn’t contemplated how to generate even shorter numbers — or whether they existed at all — until Per Vognsen wrote me recently to ask.
The easiest way for me to approach Per’s question was to search for examples, rather than try to find a way to construct them. As such, I wrote a simple Kotlin1 program to generate decimal strings and check them. I tested all float-range (including subnormal) decimal numbers of 9 digits or fewer, and tens of billions of random 10 to 17 digit float-range (normal only) numbers. I found example 7 to 17 digit numbers that, when converted to float through a double, suffer a double rounding error.
I’ve written about the formulas used to compute the number of decimal digits in a binary integer and the number of decimal digits in a binary fraction. In this article, I’ll use those formulas to determine the maximum number of digits required by the double-precision (double), single-precision (float), and quadruple-precision (quad) IEEE binary floating-point formats.
The maximum digit counts are useful if you want to print the full decimal value of a floating-point number (worst case format specifier and buffer size) or if you are writing or trying to understand a decimal to floating-point conversion routine (worst case number of input digits that must be converted).
The binary fraction 0.101 converts to the decimal fraction 0.625; the binary fraction 0.1010001 converts to the decimal fraction 0.6328125; the binary fraction 0.00111011011 converts to the decimal fraction 0.23193359375. In each of those examples, the binary fraction converts to a decimal fraction — that is, a terminating decimal representation — that has the same number of digits as the binary fraction has bits.
One digit per bit? We know that’s not true for binary integers. But it is true for binary fractions; every binary fraction of length n has a corresponding equivalent decimal fraction of length n.
This is the reason why you get all those “extra” digits when you print the full decimal value of an IEEE binary floating-point fraction, and why glibc strtod() and Visual C++ strtod() were once broken.
In a computer, decimal floating-point numbers are converted to binary floating-point numbers for calculation, and binary floating-point numbers are converted to decimal floating-point numbers for display or storage. In general, these conversions are inexact; they are rounded, and rounding is governed by the spacing of numbers in each set.
Floating-point numbers are unevenly spaced, and the spacing varies with the base of the number system. Binary floating-point numbers have power of two sized gaps that change size at power of two boundaries. Decimal floating-point numbers are similarly spaced, but with power of ten sized gaps changing size at power of ten boundaries. In this article, I will discuss the spacing of decimal floating-point numbers.
An IEEE 754 binary floating-point number is a number that can be represented in normalized binary scientific notation. This is a number like 1.00000110001001001101111 x 2-10, which has two parts: a significand, which contains the significant digits of the number, and a power of two, which places the “floating” radix point. For this example, the power of two turns the shorthand 1.00000110001001001101111 x 2-10 into this ‘longhand’ binary representation: 0.000000000100000110001001001101111.
The significands of IEEE binary floating-point numbers have a limited number of bits, called the precision; single-precision has 24 bits, and double-precision has 53 bits. The range of power of two exponents is also limited: the exponents in single-precision range from -126 to 127; the exponents in double-precision range from -1022 to 1023. (The example above is a single-precision number.)
Limited precision makes binary floating-point numbers discontinuous; there are gaps between them. Precision determines the number of gaps, and precision and exponent together determine the size of the gaps. Gap size is the same between consecutive powers of two, but is different for every consecutive pair.
(Updated June 22, 2015: added a tenth display form, “decimal integer times a power of ten”.)
In the strictest sense, converting a decimal number to binary floating-point means putting it in IEEE 754 format — a multi-byte structure composed of a sign field, an exponent field, and a significand field. Viewing it in this raw form (binary or hex) is useful, but there are other forms that are more enlightening.
I’ve written an online converter that takes a decimal number as input, converts it to floating-point, and then displays its exact floating-point value in ten forms (including the two raw IEEE forms). I will show examples of these forms in this article.
I’ve written about two implementations of decimal string to double-precision binary floating-point conversion: David Gay’s strtod(), and glibc’s strtod(). GCC, the GNU Compiler Collection, has yet another implementation; it uses it to convert decimal floating-point literals to double-precision. It is much simpler than David Gay’s and glibc’s implementations, but there’s a hitch: limited precision causes it to produce some incorrect conversions. Nonetheless, I wanted to explain how it works, since I’ve been studying it recently. (I looked specifically at the conversion of floating-point literals in C code, although the same code is used for other languages.)
I’ve discussed how David Gay’s strtod() function computes an initial floating-point approximation and then uses a loop to check and correct it, but I have not discussed how the correction is made. That is the last piece of the strtod() puzzle, and I will cover it in this article.
In decimal, “0.9 repeating”, or 0.9, equals 1. In binary, a similar thing is true: “0.1 repeating”, or 0.1, equals 1. I’ll show you three ways to prove it, using the three bicimal to fraction conversion algorithms I described recently.
Many new programmers become aware of binary floating-point after seeing their programs give odd results: “Why does my program print 0.10000000000000001 when I enter 0.1?”; “Why does 0.3 + 0.6 = 0.89999999999999991?”; “Why does 6 * 0.1 not equal 0.6?” Questions like these are asked every day, on online forums like stackoverflow.com.
The answer is that most decimals have infinite representations in binary. Take 0.1 for example. It’s one of the simplest decimals you can think of, and yet it looks so complicated in binary:
The bits go on forever; no matter how many of those bits you store in a computer, you will never end up with the binary equivalent of decimal 0.1.
In my article “Binary Division” I showed how binary long division converts a fraction to a repeating bicimal. In this article, I’ll show you a well-known procedure — what I call the subtraction method — to do the reverse: convert a repeating bicimal to a fraction.