The Answer is One (Unless You Use Floating-Point)

What does this C function do?

double f(double a)
{
 double b, c;

 b = 10*a - 10;
 c = a - 0.1*b;

 return (c);
}

Based solely on reading the code, you’ll conclude that it always returns 1: c = a – 0.1*(10*a – 10) = a – (a-1) = 1. But if you execute the code, you’ll find that it may or may not return 1, depending on the input. If you know anything about binary floating-point arithmetic, that won’t surprise you; what might surprise you is how far from 1 the answer can be — as far away as a large negative number!

Continue reading “The Answer is One (Unless You Use Floating-Point)”

Quick and Dirty Floating-Point to Decimal Conversion

In my article “Quick and Dirty Decimal to Floating-Point Conversion” I presented a small C program that uses double-precision floating-point arithmetic to convert decimal strings to binary floating-point numbers. The program converts some numbers incorrectly, despite using an algorithm that’s mathematically correct; its limited precision calculations are to blame. I dubbed the program “quick and dirty” because it’s simple, and overall converts reasonably accurately.

For this article, I took a similar approach to the conversion in the opposite direction — from binary floating-point to decimal string. I wrote a small C program that combines two mathematically correct algorithms: the classic “repeated division by ten” algorithm to convert integer values, and the classic “repeated multiplication by ten” algorithm to convert fractional values. The program uses double-precision floating-point arithmetic, so like its quick and dirty decimal to floating-point counterpart, its conversions are not always correct — though reasonably accurate. I’ll present the program and analyze some example conversions, both correct and incorrect.

Continue reading “Quick and Dirty Floating-Point to Decimal Conversion”

Inconsistent Rounding of Printed Floating-Point Numbers

What does this C program print?

#include <stdio.h>
int main (void)
{
 printf ("%.1f\n",0.25);
}

The answer depends on which compiler you use. If you compile the program with Visual C++ and run on it on Windows, it prints 0.3; if you compile it with gcc and run it on Linux, it prints 0.2.

The compilers — actually, their run time libraries — are using different rules to break decimal rounding ties. The two-digit number 0.25, which has an exact binary floating-point representation, is equally near two one-digit decimal numbers: 0.2 and 0.3; either is an acceptable answer. Visual C++ uses the round-half-away-from-zero rule, and gcc (actually, glibc) uses the round-half-to-even rule, also known as bankers’ rounding.

This inconsistency of printed output is not limited to C — it spans many programming environments. In all, I tested fixed-format printing in nineteen environments: in thirteen of them, round-half-away-from-zero was used; in the remaining six, round-half-to-even was used. I also discovered an anomaly in some environments: numbers like 0.15 — which look like halfway cases but are actually not when viewed in binary — may be rounded incorrectly. I’ll report my results in this article.

Continue reading “Inconsistent Rounding of Printed Floating-Point Numbers”

Double Rounding Errors in Floating-Point Conversions

Double rounding is when a number is rounded twice, first from n0 digits to n1 digits, and then from n1 digits to n2 digits. Double rounding is often harmless, giving the same result as rounding once, directly from n0 digits to n2 digits. However, sometimes a doubly rounded result will be incorrect, in which case we say that a double rounding error has occurred.

For example, consider the 6-digit decimal number 7.23496. Rounded directly to 3 digits — using round-to-nearest, round half to even rounding — it’s 7.23; rounded first to 5 digits (7.2350) and then to 3 digits it’s 7.24. The value 7.24 is incorrect, reflecting a double rounding error.

In a computer, double rounding occurs in binary floating-point arithmetic; the typical example is a calculated result that’s rounded to fit into an x87 FPU extended precision register and then rounded again to fit into a double-precision variable. But I’ve discovered another context in which double rounding occurs: conversion from a decimal floating-point literal to a single-precision floating-point variable. The double rounding is from full-precision binary to double-precision, and then from double-precision to single-precision.

In this article, I’ll show example conversions in C that are tainted by double rounding errors, and how attaching the ‘f’ suffix to floating-point literals prevents them — in gcc C at least, but not in Visual C++!

Continue reading “Double Rounding Errors in Floating-Point Conversions”

Displaying IEEE Doubles in Binary Scientific Notation

An IEEE double-precision floating-point number, or double, is a 64-bit encoding of a rational number. Internally, the 64 bits are broken into three fields: a 1-bit sign field, which represents positive or negative; an 11-bit exponent field, which represents a power of two; and a 52-bit fraction field, which represents the significant bits of the number. These three fields — together with an implicit leading 1 bit — represent a number in binary scientific notation, with 1 to 53 bits of precision.

For example, consider the decimal number 33.75. It converts to a double with a sign field of 0, an exponent field of 10000000100, and a fraction field of 0000111000000000000000000000000000000000000000000000. The 0 in the sign field means it’s a positive number (1 would mean it’s negative). The value of 10000000100 in the exponent field, which equals 1028 in decimal, means the exponent of the power of two is 5 (the exponent field value is offset, or biased, by 1023). The fraction field, when prefixed with an implicit leading 1, represents the binary fraction 1.0000111. Written in normalized binary scientific notation — following the convention that the fraction is written in binary and the power of two is written in decimal — 33.75 equals 1.0000111 x 25.

In this article, I’ll show you the C function I wrote to display a double in normalized binary scientific notation. This function is useful, for example, when verifying that decimal to floating-point conversions are correctly rounded.

Continue reading “Displaying IEEE Doubles in Binary Scientific Notation”

Quick and Dirty Decimal to Floating-Point Conversion

This little C program converts a decimal value — represented as a string — into a double-precision floating-point number:

#include <string.h>

int main (void)
{
 double intPart = 0, fracPart = 0, conversion;
 unsigned int i;
 char decimal[] = "3.14159";

 i = 0; /* Left to right */
 while (decimal[i] != '.') {
    intPart = intPart*10 + (decimal[i] - '0');
    i++;
   }

 i = strlen(decimal)-1; /* Right to left */
 while (decimal[i] != '.') {
    fracPart = (fracPart + (decimal[i] - '0'))/10;
    i--;
   }

 conversion = intPart + fracPart;
}

The conversion is done using the elegant Horner’s method, summing each digit according to its decimal place value. So why do I call it “quick and dirty?” Because the binary floating-point value it produces is not necessarily the closest approximation to the input decimal value — the so-called correctly rounded result. (Remember that most real numbers cannot be represented exactly in floating-point.) Most of the time it will produce the correctly rounded result, but sometimes it won’t — the result will be off in its least significant bit(s). There’s just not enough precision in floating-point to guarantee the result is correct every time.

I will demonstrate this program with different input values, some of which convert correctly, and some of which don’t. In the end, you’ll appreciate one reason why library functions like strtod() exist — to perform efficient, correctly rounded conversion.

Continue reading “Quick and Dirty Decimal to Floating-Point Conversion”

When Doubles Don’t Behave Like Doubles

In my article “When Floats Don’t Behave Like Floats” I explained how calculations involving single-precision floating-point variables may be done, under the covers, in double or extended precision. This leads to anomalies in expected results, which I demonstrated with two C programs — compiled with Microsoft Visual C++ and run on a 32-bit Intel Core Duo processor.

In this article, I’ll do a similar analysis for double-precision floating-point variables, showing how similar anomalies arise when extended precision calculations are done. I modified my two example programs to use doubles instead of floats. Interestingly, the doubles version of program 2 does not exhibit the anomaly. I’ll explain.

Continue reading “When Doubles Don’t Behave Like Doubles”

When Floats Don’t Behave Like Floats

These two programs — compiled with Microsoft Visual C++ and run on a 32-bit Intel Core Duo processor — demonstrate an anomaly that occurs when using single-precision floating point variables:

Program 1

#include "stdio.h"
int main (void)
{
 float f1 = 0.1f, f2 = 3.0f, f3;

 f3 = f1 * f2;
 if (f3 != f1 * f2)
   printf("Not equal\n");
}

Prints “Not equal”.

Program 2

#include "stdio.h"
int main (void)
{
 float f1 = 0.7f, f2 = 10.0f, f3;
 int i1, i2;

 f3 = f1 * f2;
 i1 = (int)f3;
 i2 = (int)(f1 * f2);
 if (i1 != i2)
   printf("Not equal\n");
}

Prints “Not equal”.

In each case, f3 and f1 * f2 differ. But why? I’ll explain what’s going on.

Continue reading “When Floats Don’t Behave Like Floats”

Finding Numbers That Are Palindromic In Multiple Bases

A palindromic number, or number palindrome, is a number like 74347, which is the same written forward and backward.

A number can be palindromic in any base, not just decimal. For example, 101101 is a palindrome in binary. A number can also be palindromic in more than one base, like decimal 719848917, which is 101010111010000000010111010101 in binary and 5272002725 in octal.

An efficient way to find palindromes in a single base is to generate them, iterating through each integer and constructing palindromes from them. An efficient way to find numbers that are palindromic in multiple bases is to take a palindrome in one base and test if it’s a palindrome in one or more additional bases.

In this article, I’ll show you C code I wrote that finds multi-base numeric palindromes. I used this code to generate tables of numbers that are palindromic in decimal and binary, decimal and hexadecimal, and decimal and octal. I also used this code to solve Euler problem 36, which asks for the sum of all numbers, less than one million, that are palindromic in decimal and binary.

Continue reading “Finding Numbers That Are Palindromic In Multiple Bases”

Displaying the Raw Fields of a Floating-Point Number

A double-precision floating-point number is represented internally as 64 bits, divided into three fields: a sign field, an exponent field, and a fraction field. You don’t need to know this to use floating-point numbers, but knowing it can help you understand them. This article shows you how to access those fields in C code, and how to print them — in binary or hex.

Continue reading “Displaying the Raw Fields of a Floating-Point Number”

Converting Floating-Point Numbers to Binary Strings in C

If you want to print a floating-point number in binary using C code, you can’t use printf() — it has no format specifier for it. That’s why I wrote a program to do it, a program I describe in this article.

(If you’re wondering why you’d want to print a floating-point number in binary, I’ll tell you that too.)

Continue reading “Converting Floating-Point Numbers to Binary Strings in C”

Base Conversion in PHP Using BCMath

PHP has a component called BCMath which does arbitrary-precision, decimal arithmetic. I used BCMath in my decimal/binary converter because:

  • Arbitrary-precision lets it operate on very large and very small numbers, numbers that can’t be represented in standard computer word sizes.
  • Decimal arithmetic lets it use the same algorithms I’d use to convert between decimal and binary by hand.

(If you’ve written a conversion routine in standard code, especially one to convert decimal fractions to binary, you’ll see the advantage of the second point.)

This article describes the implementation of my conversion routines with BCMath.

Continue reading “Base Conversion in PHP Using BCMath”

Copyright © 2008-2024 Exploring Binary

Privacy policy

Powered by WordPress

css.php