(Updated June 22, 2015: added a tenth display form, “decimal integer times a power of ten”.)
In the strictest sense, converting a decimal number to binary floating-point means putting it in IEEE 754 format — a multi-byte structure composed of a sign field, an exponent field, and a significand field. Viewing it in this raw form (binary or hex) is useful, but there are other forms that are more enlightening.
I’ve written an online converter that takes a decimal number as input, converts it to floating-point, and then displays its exact floating-point value in ten forms (including the two raw IEEE forms). I will show examples of these forms in this article.
After discovering that App Inventor represents numbers in floating-point, I wanted to see how it handled some edge case decimal/floating-point conversions. In one group of tests, I gave it numbers that were converted to floating-point incorrectly in other programming languages (I include the famous PHP and Java numbers). In another group of tests, I gave it numbers that, when converted to floating-point and back, demonstrate the rounding algorithm used when printing halfway cases. It turns out that App Inventor converts all examples correctly, and prints numbers using the round-half-to-even rule.
I am exploring App Inventor, an Android application development environment for novice programmers. I am teaching it to my kids, as an introductory step towards “real” app development. While playing with it I wondered: are its numbers implemented in decimal? No, they aren’t. They are implemented in double-precision binary floating-point. I put together a few simple block programs to demonstrate this, and to expose the usual floating-point “gotchas”.
GCC was recently fixed so that its decimal to floating-point conversions are done correctly; it now calls the MPFR function mpfr_strtofr() instead of using its own algorithm. However, GCC still does its conversion in two steps: first it converts to an intermediate precision (160 or 192 bits), and then it rounds that result to a target precision (53 bits for double-precision floating-point). That is double rounding — how does it avoid double rounding errors? It uses round-to-odd rounding on the intermediate result.
This fix, which will be available in version 4.9, scraps the old algorithm and replaces it with a call to MPFR function mpfr_strtofr(). I tested the fix on version 4.8.1, replacing its copy of gcc/real.c with the fixed one. I found no incorrect conversions.
While running some of GCC’s string to double conversion testcases I discovered a bug in David Gay’s strtod(): it converts some very small subnormal numbers incorrectly. Unlike numbers 2-1075 or smaller, which should convert to zero under round-to-nearest/ties-to-even rounding, numbers between 2-1075 and 2-1074 should convert to 2-1074, the smallest number representable in double-precision binary floating-point. strtod() correctly converts the former to 0, but it incorrectly converts the latter to 0 as well.
It would be interesting to know how it’s taught today (it’s been a very long time since I was taught it). I can’t imagine though that the person teaching it wouldn’t say — within a sentence or two of saying “floating-point” — that it “can’t represent all decimal numbers accurately”.
That prompted me to look through my box of thirty plus year old college (undergraduate) notebooks. I found notebooks for four classes in which I was taught floating-point. The notes from three of those classes confirm what I thought — that we were warned early of the decimal/binary mismatch. But in the first class of the four — the beginner’s class — it’s less clear what we were told. I’ll show you images of the relevant excerpts from my notes. (I notice I had some elements of cursive in my handwriting back then.)
Water tested the conversion of 2-1075 — in retrospect an obvious corner case I should have tried — and found that it converted incorrectly to 0x0.0000000000001p-1022. That’s 2-1074, the smallest double-precision value. It should have converted to 0, under round-to-nearest/ties-to-even rounding.
(Update 11/13/13: This bug has been fixed for version 2.19.)
For years I’ve followed, through RSS, floating-point related questions on stackoverflow.com. Every day it seems there is a question like “why does 19.24 plus 6.95 equal 26.189999999999998?” I decided to track these questions, to see if my sense of their frequency was correct. I found that, in the last 40 days, there were 18 such questions. That’s not one per day, but still — a lot!
I’ve written about two implementations of decimal string to double-precision binary floating-point conversion: David Gay’s strtod(), and glibc’s strtod(). GCC, the GNU Compiler Collection, has yet another implementation; it uses it to convert decimal floating-point literals to double-precision. It is much simpler than David Gay’s and glibc’s implementations, but there’s a hitch: limited precision causes it to produce some incorrect conversions. Nonetheless, I wanted to explain how it works, since I’ve been studying it recently. (I looked specifically at the conversion of floating-point literals in C code, although the same code is used for other languages.)