In my article “Fifteen Digits Don’t Round-Trip Through SQLite Reals” I showed examples of decimal floating-point numbers — 15 significant digits or less — that don’t round-trip through double-precision binary floating-point variables stored in SQLite. The round-trip failures occur because SQLite’s floating-point to decimal conversion routine uses limited-precision floating-point arithmetic.
SQLite has a limited-precision floating-point to decimal conversion routine which it uses to print double-precision floating-point values retrieved from a database. As I’ve discovered, its limited-precision conversion results in decimal numbers of 15 significant digits or less that won’t round-trip. For example, if you store the number 9.944932e+31, it will print back as 9.94493200000001e+31.
SQLite also has a limited-precision decimal to floating-point conversion routine, which it uses to convert input decimal numbers to double-precision floating-point numbers for storage in a database. I’ve found that some of its conversions are incorrect — by as many as fourULPs — and that some decimal numbers fail to round-trip because of this; “garbage in, garbage out” as they say.
I’ve discovered that decimal floating-point numbers of 15 significant digits or less don’t always round-trip through SQLite. Consider this example, executed on version 3.7.3 of the pre-compiled SQLite command shell:
sqlite> create table t1(d real);
sqlite> insert into t1 values(9.944932e+31);
sqlite> select * from t1;
SQLite represents a decimal floating-point number that has real affinity as a double-precision binary floating-point number — a double. A decimal number of 15 significant digits or less is supposed to be recoverable from its double-precision representation. In SQLite, however, this guarantee is not met; this is because its floating-point to decimal conversion routine is implemented in limited-precision floating-point arithmetic.
double f(double a)
double b, c;
b = 10*a - 10;
c = a - 0.1*b;
Based solely on reading the code, you’ll conclude that it always returns 1: c = a – 0.1*(10*a – 10) = a – (a-1) = 1. But if you execute the code, you’ll find that it may or may not return 1, depending on the input. If you know anything about binary floating-point arithmetic, that won’t surprise you; what might surprise you is how far from 1 the answer can be — as far away as a large negative number!
In my article “Quick and Dirty Decimal to Floating-Point Conversion” I presented a small C program that uses double-precision floating-point arithmetic to convert decimal strings to binary floating-point numbers. The program converts some numbers incorrectly, despite using an algorithm that’s mathematically correct; its limited precision calculations are to blame. I dubbed the program “quick and dirty” because it’s simple, and overall converts reasonably accurately.
For this article, I took a similar approach to the conversion in the opposite direction — from binary floating-point to decimal string. I wrote a small C program that combines two mathematically correct algorithms: the classic “repeated division by ten” algorithm to convert integer values, and the classic “repeated multiplication by ten” algorithm to convert fractional values. The program uses double-precision floating-point arithmetic, so like its quick and dirty decimal to floating-point counterpart, its conversions are not always correct — though reasonably accurate. I’ll present the program and analyze some example conversions, both correct and incorrect.
int main (void)
The answer depends on which compiler you use. If you compile the program with Visual C++ and run on it on Windows, it prints 0.3; if you compile it with gcc and run it on Linux, it prints 0.2.
The compilers — actually, their run time libraries — are using different rules to break decimal rounding ties. The two-digit number 0.25, which has an exact binary floating-point representation, is equally near two one-digit decimal numbers: 0.2 and 0.3; either is an acceptable answer. Visual C++ uses the round-half-away-from-zero rule, and gcc (actually, glibc) uses the round-half-to-even rule, also known as bankers’ rounding.
This inconsistency of printed output is not limited to C — it spans many programming environments. In all, I tested fixed-format printing in nineteen environments: in thirteen of them, round-half-away-from-zero was used; in the remaining six, round-half-to-even was used. I also discovered an anomaly in some environments: numbers like 0.15 — which look like halfway cases but are actually not when viewed in binary — may be rounded incorrectly. I’ll report my results in this article.
Hexadecimal floating-point constants, also known as hexadecimal floating-point literals, are an alternative way to represent floating-point numbers in a computer program. A hexadecimal floating-point constant is shorthand for binary scientific notation, which is an abstract — yet direct — representation of a binary floating-point number. As such, hexadecimal floating-point constants have exact representations in binary floating-point, unlike decimal floating-point constants, which in general do not.
Hexadecimal floating-point constants are useful for two reasons: they bypass decimal to floating-point conversions, which are sometimes doneincorrectly, and they bypass floating-point to decimal conversions which, even if done correctly, are often limited toa fixed number of decimal digits. In short, their advantage is that they allow for direct control of floating-point variables, letting you read and write their exact contents.
In this article, I’ll show you what hexadecimal floating-point constants look like, and how to use them in C.
Recently I had to replace a 4-way switch, an electrical component that lets you turn a light on or off from three or more locations. While installing it, I thought about the binary properties of my three-switch circuit: I saw powers of two, binary logic, and binary gray code. I drew some diagrams, including a state machine. I’ll share my thoughts and diagrams with you in this article.
Double rounding is when a number is rounded twice, first from n0 digits to n1 digits, and then from n1 digits to n2 digits. Double rounding is often harmless, giving the same result as rounding once, directly from n0 digits to n2 digits. However, sometimes a doubly rounded result will be incorrect, in which case we say that a double rounding error has occurred.
For example, consider the 6-digit decimal number 7.23496. Rounded directly to 3 digits — using round-to-nearest, round half to even rounding — it’s 7.23; rounded first to 5 digits (7.2350) and then to 3 digits it’s 7.24. The value 7.24 is incorrect, reflecting a double rounding error.
In a computer, double rounding occurs in binary floating-point arithmetic; the typical example is a calculated result that’s rounded to fit into an x87 FPU extended precision register and then rounded again to fit into a double-precision variable. But I’ve discovered another context in which double rounding occurs: conversion from a decimal floating-point literal to a single-precision floating-point variable. The double rounding is from full-precision binary to double-precision, and then from double-precision to single-precision.
In this article, I’ll show example conversions in C that are tainted by double rounding errors, and how attaching the ‘f’ suffix to floating-point literals prevents them — in gcc C at least, but not in Visual C++!
An IEEE double-precision floating-point number, or double, is a 64-bit encoding of a rational number. Internally, the 64 bits are broken into three fields: a 1-bit sign field, which represents positive or negative; an 11-bit exponent field, which represents a power of two; and a 52-bit fraction field, which represents the significant bits of the number. These three fields — together with an implicit leading 1 bit — represent a number in binary scientific notation, with 1 to 53 bits of precision.
For example, consider the decimal number 33.75. It converts to a double with a sign field of 0, an exponent field of 10000000100, and a fraction field of 0000111000000000000000000000000000000000000000000000. The 0 in the sign field means it’s a positive number (1 would mean it’s negative). The value of 10000000100 in the exponent field, which equals 1028 in decimal, means the exponent of the power of two is 5 (the exponent field value is offset, or biased, by 1023). The fraction field, when prefixed with an implicit leading 1, represents the binary fraction 1.0000111. Written in normalized binary scientific notation — following the convention that the fraction is written in binary and the power of two is written in decimal — 33.75 equals 1.0000111 x 25.
For correctly rounded decimal to floating-point conversions, many open source projects rely on David Gay’s strtod() function. In the default rounding mode, IEEE 754 round-to-nearest, this function is known to give correct results (notwithstanding recent bugs, which have been fixed). However, in the less frequently used IEEE 754 directed rounding modes — round toward positive infinity, round toward negative infinity, and round toward zero — strtod() gives incorrectly rounded results for some inputs.