Articles with the ‘Floating-point’ Tag

Incorrect Round-Trip Conversions in Visual C++

Paul Bristow, a Boost.Math library author and reader of my blog, recently alerted me to a problem he discovered many years ago in Visual C++: some double-precision floating-point values fail to round-trip through a stringstream as a 17-digit decimal string. Interestingly, the 17-digit strings that C++ generates are not the problem; they are correctly rounded. The problem is that the conversion of those strings to floating-point is sometimes incorrect, off by one binary ULP.

I’ve previously discovered that Visual Studio makes incorrect decimal to floating-point conversions, and that Microsoft is OK with it — at least based on their response to my now deleted bug report. But incorrect decimal to floating-point conversions in this context seems like a problem that needs fixing. When you serialize a double to a 17-digit decimal string, shouldn’t you get the same double back later? Apparently Microsoft doesn’t think so, because Paul’s bug report has also been deleted.

Continue reading …

The Shortest Decimal String That Round-Trips: Examples

The exact decimal equivalent of an arbitrary double-precision binary floating-point number is typically an unwieldy looking number, like this one:

0.1000000000000000055511151231257827021181583404541015625

In general, when you print a floating-point number, you don’t want to see all its digits; most of them are “garbage” in a sense anyhow. But how many digits do you need? You’d like a short string, yet you’d want it long enough so that it identifies the original floating-point number. A well-known result in computer science is that you need 17 significant decimal digits to identify an arbitrary double-precision floating-point number. If you were to round the exact decimal value of any floating-point number to 17 significant digits, you’d have a number that, when converted back to floating-point, gives you the original floating-point number; that is, a number that round-trips. For our example, that number is 0.10000000000000001.

But 17 digits is the worst case, which means that fewer digits — even as few as one — could work in many cases. The number required depends on the specific floating-point number. For our example, the short string 0.1 does the trick. This means that 0.1000000000000000055511151231257827021181583404541015625 and 0.10000000000000001 and 0.1 are the same, at least as far as their floating-point representations are concerned.

Continue reading …

How the Negative Powers of Ten and Two Are Interleaved

I showed how the positive powers of ten and two are interleaved, and said that the interleaving of the negative powers of ten and two is its mirror image. In this article, I will show you why, and prove that the same properties hold.

Powers of Ten and Two in Logarithmic Scale (Nonpositive Powers Highlighted)

Powers of Ten and Two in Logarithmic Scale (Nonpositive Powers Highlighted)

Continue reading …

How the Positive Powers of Ten and Two Are Interleaved

To truly understand decimal to binary and binary to decimal conversion, you should know how the powers of ten and the powers of two relate. In particular, you should know how they are interleaved. The interleaving explains why, for example, the number of bits required to represent an n-digit decimal integer varies. Consequently, it also explains why 9,007,199,254,740,991 (253 – 1) is representable in binary floating-point, and why 9,007,199,254,740,993 (253 + 1) is not.

Powers of Ten and Two in Logarithmic Scale (Nonnegative Powers Highlighted)

Powers of Ten and Two in Logarithmic Scale (Nonnegative Powers Highlighted)

In this article, I will discuss the interleaving of the positive powers of ten and two, and prove some properties of it. (The interleaving of the negative powers is the mirror image of the positive powers, centered around 100 = 20 = 1.)

Continue reading …

Number of Bits in a Decimal Integer

Every integer has an equivalent representation in decimal and binary. Except for 0 and 1, the binary representation of an integer has more digits than its decimal counterpart. To find the number of binary digits (bits) corresponding to any given decimal integer, you could convert the decimal number to binary and count the bits. For example, the two-digit decimal integer 29 converts to the five-digit binary integer 11101. But there’s a way to compute the number of bits directly, without the conversion.

Sometimes you want to know, not how many bits are required for a specific integer, but how many are required for a d-digit integer — a range of integers. A range of integers has a range of bit counts. For example, four-digit decimal integers require between 10 and 14 bits. For any d-digit range, you might want to know its minimum, maximum, or average number of bits. Those values can be computed directly as well.

Continue reading …

How strtod() Works (and Sometimes Doesn’t)

Numbers are entered into computers as strings of text. These strings are converted to binary, into the numeric form understood by the computer’s hardware. Numbers with a decimal point — numbers we think of as real numbers — are converted into a format called binary floating-point. The procedure that converts decimal strings to binary floating-point — IEEE double-precision binary floating-point in particular — goes by the name strtod(), which stands for string to double.

Converting decimal strings to doubles correctly and efficiently is a surprisingly complex task; fortunately, David M. Gay did this for us when he wrote this paper and released this code over 20 years ago. (He maintains this reference copy of strtod() to this day.) His code appears in many places, including in the Python, PHP, and Java programming languages, and in the Firefox, Chrome, and Safari Web browsers.

I’ve spent considerable time reverse engineering strtod(); neither the paper nor the code are easy reads. I’ve written articles about how each of its major pieces work, and I’ve discovered bugs (as have a few of my readers) along the way. This article ties all of my strtod() research together.

Continue reading …

Adjusting the Floating-Point Approximation in strtod()

I’ve discussed how David Gay’s strtod() function computes an initial floating-point approximation and then uses a loop to check and correct it, but I have not discussed how the correction is made. That is the last piece of the strtod() puzzle, and I will cover it in this article.

Continue reading …

Why 0.1 Does Not Exist In Floating-Point

Many new programmers become aware of binary floating-point after seeing their programs give odd results: “Why does my program print 0.10000000000000001 when I enter 0.1?”; “Why does 0.3 + 0.6 = 0.89999999999999991?”; “Why does 6 * 0.1 not equal 0.6?” Questions like these are asked every day, on online forums like stackoverflow.com.

The answer is that most decimals have infinite representations in binary. Take 0.1 for example. It’s one of the simplest decimals you can think of, and yet it looks so complicated in binary:

Decimal 0.1 In Binary ( To 1369 Places)

Decimal 0.1 In Binary ( To 1369 Places)

The bits go on forever; no matter how many of those bits you store in a computer, you will never end up with the binary equivalent of decimal 0.1.

Continue reading …

A Bug in the Bigcomp Function of David Gay’s strtod()

Last week, a reader of my blog, Geza Herman, told me about a bug he found in David Gay’s strtod() function. In random testing of decimal numbers nearly halfway between double-precision floating-point numbers, he discovered this 53-digit number, which converts incorrectly:

1.8254370818746402660437411213933955878019332885742187

As Geza noted, the problem is in the bigcomp() function, an optimization that kicks in for long decimal inputs. I traced his example through bigcomp() — I’ll show you what’s going on.

Continue reading …

Bigcomp: Deciding Truncated, Near Halfway Conversions

In my article “Using Integers to Check a Floating-Point Approximation,” I briefly mentioned “bigcomp,” an optimization strtod() uses to reduce big integer overhead when checking long decimal inputs. bigcomp does a floating-point to decimal conversion — right in the middle of a decimal to floating-point conversion mind you — to generate the decimal expansion of the number halfway between two target floating-point numbers. This decimal expansion is compared to the input decimal string, and the result of the comparison dictates which of the two target numbers is the correctly rounded result.

In this article, I’ll explain how bigcomp works, and when it applies. Also, I’ll talk briefly about its performance; my informal testing shows that, under the default setting, bigcomp actually worsens performance for some inputs.

Continue reading …

Using Integers to Check a Floating-Point Approximation

For decimal inputs that don’t qualify for fast path conversion, David Gay’s strtod() function does three things: first, it uses IEEE double-precision floating-point arithmetic to calculate an approximation to the correct result; next, it uses arbitrary-precision integer arithmetic (AKA big integers) to check if the approximation is correct; finally, it adjusts the approximation, if necessary. In this article, I’ll explain the second step — how the check of the approximation is done.

Continue reading …

strtod()’s Initial Decimal to Floating-Point Approximation

David Gay’s strtod() function does decimal to floating-point conversion using both IEEE double-precision floating-point arithmetic and arbitrary-precision integer arithmetic. For some inputs, a simple IEEE floating-point calculation suffices to produce the correct result; for other inputs, a combination of IEEE arithmetic and arbitrary-precision arithmetic is required. In the latter case, IEEE arithmetic is used to calculate an approximation to the correct result, which is then refined using arbitrary-precision arithmetic. In this article, I’ll describe the approximation calculation, which is based on a form of binary exponentiation.

Continue reading …