Articles with the ‘Floating-point’ Tag - Exploring Binary

Displaying IEEE Doubles in Binary Scientific Notation

By Rick Regan July 1st, 2010

An IEEE double-precision floating-point number, or double, is a 64-bit encoding of a rational number. Internally, the 64 bits are broken into three fields: a 1-bit sign field, which represents positive or negative; an 11-bit exponent field, which represents a power of two; and a 52-bit fraction field, which represents the significant bits of the number. These three fields — together with an implicit leading 1 bit — represent a number in binary scientific notation, with 1 to 53 bits of precision.

For example, consider the decimal number 33.75. It converts to a double with a sign field of 0, an exponent field of 10000000100, and a fraction field of 0000111000000000000000000000000000000000000000000000. The 0 in the sign field means it’s a positive number (1 would mean it’s negative). The value of 10000000100 in the exponent field, which equals 1028 in decimal, means the exponent of the power of two is 5 (the exponent field value is offset, or biased, by 1023). The fraction field, when prefixed with an implicit leading 1, represents the binary fraction 1.0000111. Written in normalized binary scientific notation — following the convention that the fraction is written in binary and the power of two is written in decimal — 33.75 equals 1.0000111 x 2⁵.

In this article, I’ll show you the C function I wrote to display a double in normalized binary scientific notation. This function is useful, for example, when verifying that decimal to floating-point conversions are correctly rounded.

Continue reading “Displaying IEEE Doubles in Binary Scientific Notation”

Incorrect Directed Conversions in David Gay’s strtod()

By Rick Regan June 23rd, 2010

For correctly rounded decimal to floating-point conversions, many open source projects rely on David Gay’s strtod() function. In the default rounding mode, IEEE 754 round-to-nearest, this function is known to give correct results (notwithstanding recent bugs, which have been fixed). However, in the less frequently used IEEE 754 directed rounding modes — round toward positive infinity, round toward negative infinity, and round toward zero — strtod() gives incorrectly rounded results for some inputs.

Continue reading “Incorrect Directed Conversions in David Gay’s strtod()”

Visual C++ and GLIBC strtod() Ignore Rounding Mode

By Rick Regan June 18th, 2010

When a decimal number is converted to a binary floating-point number, the floating-point number, in general, is only an approximation to the decimal number. Large integers, and most decimal fractions, require more significant bits than can be represented in the floating-point format. This means the decimal number must be rounded, to one of the two floating-point numbers that surround it.

Common practice considers a decimal number correctly rounded when the nearest of the two floating-point numbers is chosen (and when both are equally near, when the one with significant bit number 53 equal to 0 is chosen). This makes sense intuitively, and also reflects the default IEEE 754 rounding mode — round-to-nearest. However, there are three other IEEE 754 rounding modes, which allow for directed rounding: round toward positive infinity, round toward negative infinity, and round toward zero. For a conversion to be considered truly correctly rounded, it must honor all four rounding modes — whichever is currently in effect.

I evaluated the Visual C++ and glibc strtod() functions under the three directed rounding modes, like I did for round-to-nearest mode in my articles “Incorrectly Rounded Conversions in Visual C++” and “Incorrectly Rounded Conversions in GCC and GLIBC”. What I discovered was this: they only convert correctly about half the time — pure chance! — because they ignore the rounding mode altogether.

Continue reading “Visual C++ and GLIBC strtod() Ignore Rounding Mode”

Incorrectly Rounded Conversions in GCC and GLIBC

By Rick Regan June 3rd, 2010

Visual C++ rounds some decimal to double-precision floating-point conversions incorrectly, but it’s not alone; the gcc C compiler and the glibc strtod() function do the same. In this article, I’ll show examples of incorrect conversions in gcc and glibc, and I’ll present a C program that demonstrates the errors.

Continue reading “Incorrectly Rounded Conversions in GCC and GLIBC”

Incorrectly Rounded Conversions in Visual C++

By Rick Regan May 28th, 2010

In my analysis of decimal to floating-point conversion I noted an example that was converted incorrectly by the Microsoft Visual C++ compiler. I’ve found more examples — including a class of examples — that it converts incorrectly. I will analyze those examples in this article.

Continue reading “Incorrectly Rounded Conversions in Visual C++”

Decimal to Floating-Point Needs Arbitrary Precision

By Rick Regan May 20th, 2010

In my article “Quick and Dirty Decimal to Floating-Point Conversion” I presented a small C program that converts a decimal string to a double-precision binary floating-point number. The number it produces, however, is not necessarily the closest — or so-called correctly rounded — double-precision binary floating-point number. This is not a failing of the algorithm; mathematically speaking, the algorithm is correct. The flaw comes in its implementation in limited precision binary floating-point arithmetic.

The quick and dirty program is implemented in native C, so it’s limited to double-precision floating-point arithmetic (although on some systems, extended precision may be used). Higher precision arithmetic — in fact, arbitrary precision arithmetic — is needed to ensure that all decimal inputs are converted correctly. I will demonstrate the need for high precision by analyzing three examples, all taken from Vern Paxson’s paper “A Program for Testing IEEE Decimal–Binary Conversion”.

Continue reading “Decimal to Floating-Point Needs Arbitrary Precision”

Quick and Dirty Decimal to Floating-Point Conversion

By Rick Regan April 30th, 2010

This little C program converts a decimal value — represented as a string — into a double-precision floating-point number:

#include <string.h>

int main (void)
{
 double intPart = 0, fracPart = 0, conversion;
 unsigned int i;
 char decimal[] = "3.14159";

 i = 0; /* Left to right */
 while (decimal[i] != '.') {
    intPart = intPart*10 + (decimal[i] - '0');
    i++;
   }

 i = strlen(decimal)-1; /* Right to left */
 while (decimal[i] != '.') {
    fracPart = (fracPart + (decimal[i] - '0'))/10;
    i--;
   }

 conversion = intPart + fracPart;
}

The conversion is done using the elegant Horner’s method, summing each digit according to its decimal place value. So why do I call it “quick and dirty?” Because the binary floating-point value it produces is not necessarily the closest approximation to the input decimal value — the so-called correctly rounded result. (Remember that most real numbers cannot be represented exactly in floating-point.) Most of the time it will produce the correctly rounded result, but sometimes it won’t — the result will be off in its least significant bit(s). There’s just not enough precision in floating-point to guarantee the result is correct every time.

I will demonstrate this program with different input values, some of which convert correctly, and some of which don’t. In the end, you’ll appreciate one reason why library functions like strtod() exist — to perform efficient, correctly rounded conversion.

Continue reading “Quick and Dirty Decimal to Floating-Point Conversion”

When Doubles Don’t Behave Like Doubles

By Rick Regan April 9th, 2010

In my article “When Floats Don’t Behave Like Floats” I explained how calculations involving single-precision floating-point variables may be done, under the covers, in double or extended precision. This leads to anomalies in expected results, which I demonstrated with two C programs — compiled with Microsoft Visual C++ and run on a 32-bit Intel Core Duo processor.

In this article, I’ll do a similar analysis for double-precision floating-point variables, showing how similar anomalies arise when extended precision calculations are done. I modified my two example programs to use doubles instead of floats. Interestingly, the doubles version of program 2 does not exhibit the anomaly. I’ll explain.

Continue reading “When Doubles Don’t Behave Like Doubles”

When Floats Don’t Behave Like Floats

By Rick Regan March 30th, 2010

These two programs — compiled with Microsoft Visual C++ and run on a 32-bit Intel Core Duo processor — demonstrate an anomaly that occurs when using single-precision floating point variables:

Program 1

#include "stdio.h"
int main (void)
{
 float f1 = 0.1f, f2 = 3.0f, f3;

 f3 = f1 * f2;
 if (f3 != f1 * f2)
   printf("Not equal\n");
}

Prints “Not equal”.

Program 2

#include "stdio.h"
int main (void)
{
 float f1 = 0.7f, f2 = 10.0f, f3;
 int i1, i2;

 f3 = f1 * f2;
 i1 = (int)f3;
 i2 = (int)(f1 * f2);
 if (i1 != i2)
   printf("Not equal\n");
}

Prints “Not equal”.

In each case, f3 and f1 * f2 differ. But why? I’ll explain what’s going on.

Continue reading “When Floats Don’t Behave Like Floats”

Floating-Point Error in the NPR Media Player

By Rick Regan October 31st, 2009

The NPR Media Player apparently uses floating-point numbers to represent timestamps, based on this image (click it to enlarge):

NaNs in NPR Media Player (thumbnail) — NaNs in NPR Media Player (click image to enlarge).

Continue reading “Floating-Point Error in the NPR Media Player”

Displaying the Raw Fields of a Floating-Point Number

By Rick Regan May 20th, 2009

A double-precision floating-point number is represented internally as 64 bits, divided into three fields: a sign field, an exponent field, and a fraction field. You don’t need to know this to use floating-point numbers, but knowing it can help you understand them. This article shows you how to access those fields in C code, and how to print them — in binary or hex.

Continue reading “Displaying the Raw Fields of a Floating-Point Number”

Converting Floating-Point Numbers to Binary Strings in C

By Rick Regan May 6th, 2009

If you want to print a floating-point number in binary using C code, you can’t use printf() — it has no format specifier for it. That’s why I wrote a program to do it, a program I describe in this article.

(If you’re wondering why you’d want to print a floating-point number in binary, I’ll tell you that too.)

Continue reading “Converting Floating-Point Numbers to Binary Strings in C”