## College Notebook: When I Was Taught Floating-Point

In my article “Floating-Point Questions Are Endless on stackoverflow.com” I showed examples of the many questions asked that demonstrate lack of knowledge of the most basic property of floating-point — that not all decimal values are representable in binary. In response to a reader’s comment on my article I wrote:

It would be interesting to know how it’s taught today (it’s been a very long time since I was taught it). I can’t imagine though that the person teaching it wouldn’t say — within a sentence or two of saying “floating-point” — that it “can’t represent all decimal numbers accurately”.

That prompted me to look through my box of thirty plus year old college (undergraduate) notebooks. I found notebooks for four classes in which I was taught floating-point. The notes from three of those classes confirm what I thought — that we were warned early of the decimal/binary mismatch. But in the first class of the four — the beginner’s class — it’s less clear what we were told. I’ll show you images of the relevant excerpts from my notes. (I notice I had some elements of cursive in my handwriting back then.)

## GLIBC strtod() Incorrectly Converts 2^-1075

A reader of my blog, Water Qian, reported a bug to me after reading my article “How GLIBC’s strtod() Works”. I recently tested strtod(), which was was fixed to do correct rounding in glibc 2.17; I had found no incorrect conversions.

Water tested the conversion of 2-1075 — in retrospect an obvious corner case I should have tried — and found that it converted incorrectly to 0x0.0000000000001p-1022. That’s 2-1074, the smallest double-precision value. It should have converted to 0, under round-to-nearest/ties-to-even rounding.

(Update 11/13/13: This bug has been fixed for version 2.19.)

## Floating-Point Questions Are Endless on stackoverflow.com

For years I’ve followed, through RSS, floating-point related questions on stackoverflow.com. Every day it seems there is a question like “why does 19.24 plus 6.95 equal 26.189999999999998?” I decided to track these questions, to see if my sense of their frequency was correct. I found that, in the last 40 days, there were 18 such questions. That’s not one per day, but still — a lot!

## How GCC Converts Decimal Literals to Floating-Point

I’ve written about two implementations of decimal string to double-precision binary floating-point conversion: David Gay’s strtod(), and glibc’s strtod(). GCC, the GNU Compiler Collection, has yet another implementation; it uses it to convert decimal floating-point literals to double-precision. It is much simpler than David Gay’s and glibc’s implementations, but there’s a hitch: limited precision causes it to produce some incorrect conversions. Nonetheless, I wanted to explain how it works, since I’ve been studying it recently. (I looked specifically at the conversion of floating-point literals in C code, although the same code is used for other languages.)

## How GLIBC’s strtod() Works

The string to double function, strtod(), converts decimal numbers represented as strings into binary numbers represented in IEEE double-precision floating-point. Many programming environments implement their string to double conversions with David Gay’s strtod(); glibc, the GNU C Library, does not.

Like David Gay’s strtod(), glibc’s strtod() produces correctly rounded conversions. But it uses a simpler algorithm: it doesn’t have a floating-point only fast path for small inputs; it doesn’t compute a floating-point approximation to the correct result; it doesn’t check the approximation with big integers; it doesn’t adjust the approximation and recheck it; it doesn’t have an optimization for really long inputs. Instead, it handles all inputs uniformly, converting their integer and fractional parts separately, using only big integers. I will give an overview of how glibc’s strtod() works.

## GCC Conversions Are Incorrect, Architecture or Otherwise

Recently I wrote about my retesting of the gcc C compiler’s string to double conversions and how it appeared that its incorrect conversions were due to an architecture-dependent bug. My examples converted incorrectly on 32-bit systems, but worked on 64-bit systems — at least most of them. I decided to dig into gcc’s source code and trace its execution, and I found the architecture dependency I was looking for. But I found more than that: due to limited precision, gcc will do incorrect conversions on any system. I’ve constructed an example to demonstrate this.

## Correctly Rounded Conversions in GCC and GLIBC

Three years ago I wrote about how the gcc C compiler and the glibc strtod() function do some decimal to double-precision floating-point conversions incorrectly. I recently retested their conversions and found out two things: glibc’s strtod() has been fixed, and gcc’s conversion code, while still unfixed, produces correct conversions on some machines.

## A Better Way to Convert Integers in David Gay’s strtod()

A reader of my blog, John Harrison, suggested a way to improve how David Gay’s strtod() converts large integers to doubles. Instead of approximating the conversion and going through the correction loop to check and correct it — the signature processes of strtod() — he proposed doing the conversion directly from a binary big integer representation of the decimal input. strtod() does lots of processing with big integers, so the facility to do this is already there.

I implemented John’s idea in a copy of strtod(). The path for large integers is so much simpler and faster that I can’t believe it never occurred to me to do it this way. It’s also surprising that strtod() never implemented it this way to begin with.

## Floating-Point Is So Insane Even a Ten-Year Old Can See It

I’ve been teaching my sons Java by watching the Udacity course “Introduction to Programming: Problem Solving with Java” with them. In lesson four, we were introduced to the vagaries of floating-point arithmetic. The instructor talks about how this calculation

```double pennies = 4.35 * 100;
```

produces 434.99999999999994 as its output.

I told my kids “it has to do with binary numbers” and “I write about this all the time on my blog”. Now of course I know this trips people up, but it really struck me to see the reaction firsthand. (I have long since forgotten my own first reaction.) It really hit home that thousands of new programmers are exposed to this every day.

## Incorrect Round-Trip Conversions in Visual C++

Paul Bristow, a Boost.Math library author and reader of my blog, recently alerted me to a problem he discovered many years ago in Visual C++: some double-precision floating-point values fail to round-trip through a stringstream as a 17-digit decimal string. Interestingly, the 17-digit strings that C++ generates are not the problem; they are correctly rounded. The problem is that the conversion of those strings to floating-point is sometimes incorrect, off by one binary ULP.

I’ve previously discovered that Visual Studio makes incorrect decimal to floating-point conversions, and that Microsoft is OK with it — at least based on their response to my now deleted bug report. But incorrect decimal to floating-point conversions in this context seems like a problem that needs fixing. When you serialize a double to a 17-digit decimal string, shouldn’t you get the same double back later? Apparently Microsoft doesn’t think so, because Paul’s bug report has also been deleted.

## The Shortest Decimal String That Round-Trips: Examples

The exact decimal equivalent of an arbitrary double-precision binary floating-point number is typically an unwieldy looking number, like this one:

0.1000000000000000055511151231257827021181583404541015625

In general, when you print a floating-point number, you don’t want to see all its digits; most of them are “garbage” in a sense anyhow. But how many digits do you need? You’d like a short string, yet you’d want it long enough so that it identifies the original floating-point number. A well-known result in computer science is that you need 17 significant decimal digits to identify an arbitrary double-precision floating-point number. If you were to round the exact decimal value of any floating-point number to 17 significant digits, you’d have a number that, when converted back to floating-point, gives you the original floating-point number; that is, a number that round-trips. For our example, that number is 0.10000000000000001.

But 17 digits is the worst case, which means that fewer digits — even as few as one — could work in many cases. The number required depends on the specific floating-point number. For our example, the short string 0.1 does the trick. This means that 0.1000000000000000055511151231257827021181583404541015625 and 0.10000000000000001 and 0.1 are the same, at least as far as their floating-point representations are concerned.

## How the Negative Powers of Ten and Two Are Interleaved

I showed how the positive powers of ten and two are interleaved, and said that the interleaving of the negative powers of ten and two is its mirror image. In this article, I will show you why, and prove that the same properties hold.

Copyright © 2008-2022 Exploring Binary

Privacy policy

Powered by WordPress