The maximum digit counts are useful if you want to print the full decimal value of a floating-point number (worst case format specifier and buffer size) or if you are writing or trying to understand a decimal to floating-point conversion routine (worst case number of input digits that must be converted).
The binary fraction 0.101 converts to the decimal fraction 0.625; the binary fraction 0.1010001 converts to the decimal fraction 0.6328125; the binary fraction 0.00111011011 converts to the decimal fraction 0.23193359375. In each of those examples, the binary fraction converts to a decimal fraction — that is, a terminating decimal representation — that has the same number of digits as the binary fraction has bits.
One digit per bit? We know that’s not true for binary integers. But it is true for binary fractions; every binary fraction of length n has a corresponding equivalent decimal fraction of length n.
This is the reason why you get all those “extra” digits when you print the full decimal value of an IEEE binary floating-point fraction, and why glibc strtod() and Visual C++ strtod() were once broken.
Every double-precision floating-point number can be specified with 17 significant decimal digits or less. A simple way to generate this 17-digit number is to round the full-precision decimal value of the double to 17 digits. For example, the double-precision value 0x1.6d4c11d09ffa1p-3, which in decimal is 1.783677474777478899614635565740172751247882843017578125 x 10-1, can be recovered from the decimal floating-point literal 1.7836774747774789e-1. The extra digits are unnecessary, since they will only take you to the same double.
On the other hand, an arbitrary, arbitrarily long decimal literal rounded or truncated to 17 digits may not convert to the double-precision value it’s supposed to. This is a subtle point, one that has even tripped up implementers of widely used decimal to floating-point conversion routines (glibc strtod() and Visual C++ strtod(), for example).
How many decimal digits of precision does a binary floating-point number have?
For example, does an IEEE single-precision binary floating-point number, or float as it’s known, have 6-8 digits? 7-8 digits? 6-9 digits? 6 digits? 7 digits? 7.22 digits? 6-112 digits? (I’ve seen all those answers on the Web.)
Any double-precision floating-point number can be identified with at most 17 significant decimal digits. This means that if you convert a floating-point number to a decimal string, round it (to nearest) to 17 digits, and then convert that back to floating-point, you will recover the original floating-point number. In other words, the conversion will round-trip.
Sometimes (many) fewer than 17 digits will serve to round-trip; it is often desirable to find the shortest such string. Some programming languages generate shortest decimal strings, but many do not.1 If your language does not, you can attempt this yourself using brute force, by rounding a floating-point number to increasing length decimal strings and checking each time whether conversion of the string round-trips. For double-precision, you’d start by rounding to 15 digits, then if necessary to 16 digits, and then finally, if necessary, to 17 digits.
There is an interesting anomaly in this process though, one that I recently learned about from Mark Dickinson on stackoverflow.com: in rare cases, it’s possible to overlook the shortest decimal string that round-trips. Mark described the problem in the context of single-precision binary floating-point, but it applies to double-precision binary floating-point as well — or any precision binary floating-point for that matter. I will look at this anomaly in the context of double-precision floating-point, and give a detailed analysis of its cause.
PHP’s base_convert() is a useful function that converts integers between any pair of bases, 2 through 36. However, you might hesitate to use it after reading this vague and mysterious warning in its documentation:
base_convert() may lose precision on large numbers due to properties related to the internal “double” or “float” type used.
The truth is that it works perfectly for integers up to a certain maximum — you just have to know what that is. I will show you this maximum value in each of the 35 bases, and how to check if the values you are using are within this limit.
I’ve previously written a Windows program (in C++) and an Android app (in Java) to turn my Fretlight guitar into a binary clock. I’ve now written a Python program to do the same, running under Raspbian Linux on a Raspberry Pi computer. I will show you the code and tell you how to run it.