# Double Rounding Errors in Floating-Point Conversions

Copyright © 2008-2015 Exploring Binary

http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/

Double rounding is when a number is rounded twice, first from n_{0} digits to n_{1} digits, and then from n_{1} digits to n_{2} digits. Double rounding is often harmless, giving the same result as rounding once, directly from n_{0} digits to n_{2} digits. However, sometimes a doubly rounded result will be incorrect, in which case we say that a ** double rounding error** has occurred.

For example, consider the 6-digit decimal number 7.23496. Rounded directly to 3 digits — using round-to-nearest, round half to even rounding — it’s 7.23; rounded first to 5 digits (7.2350) and then to 3 digits it’s 7.24. The value 7.24 is incorrect, reflecting a double rounding error.

In a computer, double rounding occurs in binary floating-point arithmetic; the typical example is a calculated result that’s rounded to fit into an x87 FPU extended precision register and then rounded again to fit into a double-precision variable. But I’ve discovered another context in which double rounding occurs: conversion from a decimal floating-point literal to a single-precision floating-point variable. The double rounding is from full-precision binary to double-precision, and then from double-precision to single-precision.

In this article, I’ll show example conversions in C that are tainted by double rounding errors, and how attaching the ‘f’ suffix to floating-point literals prevents them — in gcc C at least, but not in Visual C++!

## Double Rounding Error Example 1

The following C program demonstrates a double rounding error that occurs when a decimal floating-point literal is converted to a float by the compiler. The program converts the same decimal number three times: first to a double *d*; then to a float *f*, using the ‘f’ (float) suffix; and then to a float *fd*, without using the ‘f’ suffix.

#include <stdio.h> int main (void) { double d; float f, fd; d = 0.5000000894069671353303618843710864894092082977294921875; f = 0.5000000894069671353303618843710864894092082977294921875f;fd = 0.5000000894069671353303618843710864894092082977294921875;printf ("d = %a\n",d); printf ("f = %a\n",f); printf ("fd = %a\n",fd); }

### Output on Linux

Here’s the output when the program is compiled and run on Linux using gcc:

d = 0x1.000003p-1 f = 0x1.000002p-1fd = 0x1.000004p-1

Translated from hexadecimal floating-point constants to binary, the output is

d = 0.1000000000000000000000011 f = 0.100000000000000000000001 fd = 0.10000000000000000000001

To interpret the output, first consider the full-precision binary equivalent of the decimal number in the program (it is a terminating binary fraction):

0.1000000000000000000000010111111111111111111111111111111

Now let’s look at the three variables in turn:

*d*is the correctly rounded conversion of the full-precision binary value to double-precision — 53 significant bits. The full-precision binary value is rounded up, since the value of bits 54 and beyond is greater than 1/2 ULP:0.1000000000000000000000010111111111111111111111111111111

*f*is the correctly rounded conversion of the full-precision binary value to single-precision — 24 significant bits. The full-precision binary value is rounded down, since the value of bits 25 and beyond is*less than*1/2 ULP:0.1000000000000000000000010111111111111111111111111111111

*fd*is anto single-precision. It’s 1 ULP above the correctly rounded value, due to a double rounding error. The compiler converted it by first rounding it up to 53 bits, giving 0.1000000000000000000000011, and then rounding that up to 24 bits, giving 0.10000000000000000000001. In other words, the first rounding causes a carry to propagate to bit 25, creating a halfway case that gets resolved by rounding up to nearest even.**incorrectly rounded conversion**

#### Summary

With the ‘f’ suffix on a decimal literal, gcc converts a full-precision binary value directly to single-precision, avoiding double rounding. Without the ‘f’ suffix, gcc converts the full-precision binary value to double-precision and then to single-precision; this indirect conversion from full-precision to single-precision is double rounding, and in this case results in a double rounding error.

### Output on Windows

Here’s the output when the program is compiled and run on Windows using Visual C++:

d = 0x1.000003p-1f = 0x1.000004p-1fd = 0x1.000004p-1

Like with gcc, *fd* is doubly rounded; but Visual C++ doubly rounds *f* as well! **It seems Visual C++ ignores the ‘f’ suffix**, first rounding to double-precision and then rounding to single-precision. (I am presuming this is the case. It’s also possible that Visual C++ is honoring the ‘f’ suffix, only it’s doing the conversion incorrectly — due to reasons other than double rounding.)

## Double Rounding Error Example 2

Here’s a similar example, except that it demonstrates a double rounding error that results in a value that is 1 ULP *below* the correctly rounded value:

#include <stdio.h> int main (void) { double d; float f, fd; d = 0.5000000298023224154508881156289135105907917022705078125; f = 0.5000000298023224154508881156289135105907917022705078125f;fd = 0.5000000298023224154508881156289135105907917022705078125;printf ("d = %a\n",d); printf ("f = %a\n",f); printf ("fd = %a\n",fd); }

### Output on Linux

Here’s the output when the program is compiled and run on Linux using gcc:

d = 0x1.000001p-1 f = 0x1.000002p-1fd = 0x1p-1

Translated from hexadecimal floating-point constants to binary, the output is

d = 0.1000000000000000000000001 f = 0.100000000000000000000001 fd = 0.1

The full-precision binary equivalent of the decimal number in the program is

0.1000000000000000000000001000000000000000000000000000001

Again, let’s look at the three variables:

*d*is the correctly rounded conversion of the full-precision binary value to 53 significant bits. The full-precision binary value is rounded down, since the value of bits 54 and beyond is less than 1/2 ULP:0.1000000000000000000000001000000000000000000000000000001

*f*is the correctly rounded conversion of the full-precision binary value to 24 significant bits. The full-precision binary value is rounded up, since the value of bits 25 and beyond is*greater than*1/2 ULP:0.1000000000000000000000001000000000000000000000000000001

*fd*is anto single-precision. It’s 1 ULP below the correctly rounded value, due to a double rounding error. The compiler converted it by first rounding it down to 53 bits, giving 0.1000000000000000000000001, and then rounding that down to 24 bits, giving 0.1. In other words, the first rounding creates a halfway case that gets resolved by rounding down to nearest even.**incorrectly rounded conversion**

As in example 1, use of the ‘f’ suffix prevents double rounding.

### Output on Windows

Here’s the output when the program is compiled and run on Windows using Visual C++:

d = 0x1.000001p-1f = 0x1.000000p-1fd = 0x1.000000p-1

As in example 1, Visual C++ ignores the ‘f’ suffix, resulting in double rounding for both *f* and *fd*.

## Summary

Floating-point literals are subject to double rounding when assigned to single-precision variables, resulting in incorrectly rounded decimal to floating-point conversions. If you’re using the gcc C compiler, you can avoid this by attaching the ‘f’ suffix to your literals.

### A Note on the Examples

The examples above are admittedly contrived, but they serve to illustrate the potential for double rounding errors during conversion.

## Please Try My Examples

If you are using a compiler other than Visual C++ or gcc, please try out my examples. In particular, I’d like to know if your compiler doubly rounds decimal literals when the ‘f’ suffix is used.

## Bug Report

I wrote a bug report against Visual C++; this was Microsoft’s response:

“This is indeed an issue with the compiler but we regret that we cannot fix it for the next release due to its priority.”

## References

Here is some background reading on double rounding:

- Handbook of Floating-Point Arithmetic (see pages 75-77).
- When Double Rounding is Odd (see the two pages titled “Double rounding going bad”).
- When Is Double Rounding Innocuous? (see the first two pages).

August 19th, 2010 at 5:18 am

It would be interesting to know if the MSVC results are only a problem with long decimal input strings, or whether the problem persists with short (i.e., <= 17 significant digits) strings.

For example, what results do you get on MSVC from the following snippet?

float f = 1.0000000596046448f;

printf ("f = %a\n",f);

This is a 17-digit decimal number that lies in the interval (1+2^-24, 1+2^-24+2^-53), so on a well-behaved compiler I'd expect f to have value 1 + 2^-23, or just 1 (as a result of double rounding) if the 'f' suffix is removed.

And I get these expected results on gcc 4.2.1 / amd64.

August 19th, 2010 at 9:28 am

Mark,

That’s a good example (shorter examples are always more compelling). Visual C++ does in fact double round as you guessed. Here’s the analysis, in the format of the article:

Output on Windows

1.0000000596046448 in binary is

1.00000000000000000000000100000000000000000000000000000001110…

As a double it rounds correctly to 1.000000000000000000000001

As an ‘f’ suffix float it

shouldround correctly to 1.00000000000000000000001, but it doubly rounds to 1.0.April 9th, 2012 at 7:09 pm

[…] Double rounding can lead to inaccurate results, even when doing something as simple as assigning a constant to a float […]

October 17th, 2012 at 1:40 pm

I want to know how could these double rounding error can be avoided. Please throw some light on the solution..

October 17th, 2012 at 1:57 pm

@Lusion,

You avoid double rounding errors by directly rounding to the final precision; in this case, 24 bits.

July 20th, 2013 at 12:03 pm

[…] to get predictable rounding on x87 is to store to memory, which costs performance and may lead to double-rounding errors. The net result is that it may be impossible or impractical to get the x87 FPU to give […]