Double Rounding Errors in Floating-Point Conversions

Double rounding is when a number is rounded twice, first from n0 digits to n1 digits, and then from n1 digits to n2 digits. Double rounding is often harmless, giving the same result as rounding once, directly from n0 digits to n2 digits. However, sometimes a doubly rounded result will be incorrect, in which case we say that a double rounding error has occurred.

For example, consider the 6-digit decimal number 7.23496. Rounded directly to 3 digits — using round-to-nearest, round half to even rounding — it’s 7.23; rounded first to 5 digits (7.2350) and then to 3 digits it’s 7.24. The value 7.24 is incorrect, reflecting a double rounding error.

In a computer, double rounding occurs in binary floating-point arithmetic; the typical example is a calculated result that’s rounded to fit into an x87 FPU extended precision register and then rounded again to fit into a double-precision variable. But I’ve discovered another context in which double rounding occurs: conversion from a decimal floating-point literal to a single-precision floating-point variable. The double rounding is from full-precision binary to double-precision, and then from double-precision to single-precision.

In this article, I’ll show example conversions in C that are tainted by double rounding errors, and how attaching the ‘f’ suffix to floating-point literals prevents them — in gcc C at least, but not in Visual C++!

Double Rounding Error Example 1

The following C program demonstrates a double rounding error that occurs when a decimal floating-point literal is converted to a float by the compiler. The program converts the same decimal number three times: first to a double d; then to a float f, using the ‘f’ (float) suffix; and then to a float fd, without using the ‘f’ suffix.

#include <stdio.h>

int main (void)
{
 double d;
 float f, fd; 

 d =  0.5000000894069671353303618843710864894092082977294921875;
 f =  0.5000000894069671353303618843710864894092082977294921875f;
 fd = 0.5000000894069671353303618843710864894092082977294921875;

 printf ("d =  %a\n",d);
 printf ("f =  %a\n",f);
 printf ("fd = %a\n",fd);
}

Output on Linux

Here’s the output when the program is compiled and run on Linux using gcc:

d =  0x1.000003p-1
f =  0x1.000002p-1
fd = 0x1.000004p-1

Translated from hexadecimal floating-point constants to binary, the output is

d =  0.1000000000000000000000011
f =  0.100000000000000000000001
fd = 0.10000000000000000000001

To interpret the output, first consider the full-precision binary equivalent of the decimal number in the program (it is a terminating binary fraction):

0.1000000000000000000000010111111111111111111111111111111

Now let’s look at the three variables in turn:

  • d is the correctly rounded conversion of the full-precision binary value to double-precision — 53 significant bits. The full-precision binary value is rounded up, since the value of bits 54 and beyond is greater than 1/2 ULP: 0.1000000000000000000000010111111111111111111111111111111
  • f is the correctly rounded conversion of the full-precision binary value to single-precision — 24 significant bits. The full-precision binary value is rounded down, since the value of bits 25 and beyond is less than 1/2 ULP: 0.1000000000000000000000010111111111111111111111111111111
  • fd is an incorrectly rounded conversion to single-precision. It’s 1 ULP above the correctly rounded value, due to a double rounding error. The compiler converted it by first rounding it up to 53 bits, giving 0.1000000000000000000000011, and then rounding that up to 24 bits, giving 0.10000000000000000000001. In other words, the first rounding causes a carry to propagate to bit 25, creating a halfway case that gets resolved by rounding up to nearest even.

Summary

With the ‘f’ suffix on a decimal literal, gcc converts a full-precision binary value directly to single-precision, avoiding double rounding. Without the ‘f’ suffix, gcc converts the full-precision binary value to double-precision and then to single-precision; this indirect conversion from full-precision to single-precision is double rounding, and in this case results in a double rounding error.

Output on Windows

Here’s the output when the program is compiled and run on Windows using Visual C++:

d =  0x1.000003p-1
f =  0x1.000004p-1
fd = 0x1.000004p-1

Like with gcc, fd is doubly rounded; but Visual C++ doubly rounds f as well! It seems Visual C++ ignores the ‘f’ suffix, first rounding to double-precision and then rounding to single-precision. (I am presuming this is the case. It’s also possible that Visual C++ is honoring the ‘f’ suffix, only it’s doing the conversion incorrectly — due to reasons other than double rounding.)

Double Rounding Error Example 2

Here’s a similar example, except that it demonstrates a double rounding error that results in a value that is 1 ULP below the correctly rounded value:

#include <stdio.h>

int main (void)
{
 double d;
 float f, fd; 

 d =  0.5000000298023224154508881156289135105907917022705078125;
 f =  0.5000000298023224154508881156289135105907917022705078125f;
 fd = 0.5000000298023224154508881156289135105907917022705078125;

 printf ("d =  %a\n",d);
 printf ("f =  %a\n",f);
 printf ("fd = %a\n",fd);
}

Output on Linux

Here’s the output when the program is compiled and run on Linux using gcc:

d =  0x1.000001p-1
f =  0x1.000002p-1
fd = 0x1p-1

Translated from hexadecimal floating-point constants to binary, the output is

d =  0.1000000000000000000000001
f =  0.100000000000000000000001
fd = 0.1

The full-precision binary equivalent of the decimal number in the program is

0.1000000000000000000000001000000000000000000000000000001

Again, let’s look at the three variables:

  • d is the correctly rounded conversion of the full-precision binary value to 53 significant bits. The full-precision binary value is rounded down, since the value of bits 54 and beyond is less than 1/2 ULP: 0.1000000000000000000000001000000000000000000000000000001
  • f is the correctly rounded conversion of the full-precision binary value to 24 significant bits. The full-precision binary value is rounded up, since the value of bits 25 and beyond is greater than 1/2 ULP: 0.1000000000000000000000001000000000000000000000000000001
  • fd is an incorrectly rounded conversion to single-precision. It’s 1 ULP below the correctly rounded value, due to a double rounding error. The compiler converted it by first rounding it down to 53 bits, giving 0.1000000000000000000000001, and then rounding that down to 24 bits, giving 0.1. In other words, the first rounding creates a halfway case that gets resolved by rounding down to nearest even.

As in example 1, use of the ‘f’ suffix prevents double rounding.

Output on Windows

Here’s the output when the program is compiled and run on Windows using Visual C++:

d =  0x1.000001p-1
f =  0x1.000000p-1
fd = 0x1.000000p-1

As in example 1, Visual C++ ignores the ‘f’ suffix, resulting in double rounding for both f and fd.

Summary

Floating-point literals are subject to double rounding when assigned to single-precision variables, resulting in incorrectly rounded decimal to floating-point conversions. If you’re using the gcc C compiler, you can avoid this by attaching the ‘f’ suffix to your literals.

A Note on the Examples

The examples above are admittedly contrived, but they serve to illustrate the potential for double rounding errors during conversion.

Please Try My Examples

If you are using a compiler other than Visual C++ or gcc, please try out my examples. In particular, I’d like to know if your compiler doubly rounds decimal literals when the ‘f’ suffix is used.

Bug Report

I wrote a bug report against Visual C++; this was Microsoft’s response:

“This is indeed an issue with the compiler but we regret that we cannot fix it for the next release due to its priority.”

References

Here is some background reading on double rounding:

Update (November 2020): I recently did a deeper investigation into double rounding errors in decimal to double to float conversions, specifically looking for short examples that demonstrate the error. For example, the two (long) examples in this article can be shortened — Example 1 to 17 digits and Example 2 to 16 digits — and still cause the error. However, not all long examples can be shortened to 17 digits (or fewer), which may not seem obvious at first.

Dingbat

7 comments

  1. It would be interesting to know if the MSVC results are only a problem with long decimal input strings, or whether the problem persists with short (i.e., <= 17 significant digits) strings.

    For example, what results do you get on MSVC from the following snippet?

    float f = 1.0000000596046448f;
    printf ("f = %a\n",f);

    This is a 17-digit decimal number that lies in the interval (1+2^-24, 1+2^-24+2^-53), so on a well-behaved compiler I'd expect f to have value 1 + 2^-23, or just 1 (as a result of double rounding) if the 'f' suffix is removed.
    And I get these expected results on gcc 4.2.1 / amd64.

  2. Mark,

    That’s a good example (shorter examples are always more compelling). Visual C++ does in fact double round as you guessed. Here’s the analysis, in the format of the article:

    #include <stdio.h>
    
    int main (void)
    {
     double d;
     float f, fd; 
    
     d =  1.0000000596046448;
     f =  1.0000000596046448f;
     fd = 1.0000000596046448;
    
     printf ("d =  %a\n",d);
     printf ("f =  %a\n",f);
     printf ("fd = %a\n",fd);
    }
    

    Output on Windows

    d =  0x1.000001p+0
    f =  0x1.000000p+0
    fd = 0x1.000000p+0
    

    1.0000000596046448 in binary is

    1.00000000000000000000000100000000000000000000000000000001110…

    As a double it rounds correctly to 1.000000000000000000000001
    As an ‘f’ suffix float it should round correctly to 1.00000000000000000000001, but it doubly rounds to 1.0.

  3. I want to know how could these double rounding error can be avoided. Please throw some light on the solution..

Comments are closed.

Copyright © 2008-2024 Exploring Binary

Privacy policy

Powered by WordPress

css.php