Incorrectly Rounded Conversions in Visual C++

In my analysis of decimal to floating-point conversion I noted an example that was converted incorrectly by the Microsoft Visual C++ compiler. I’ve found more examples — including a class of examples — that it converts incorrectly. I will analyze those examples in this article.

My Analysis

I did my analysis using Visual C++ Express Edition — both the 2008 and 2010 versions. I converted decimal numbers to double-precision floating-point, via the strtod() library function and floating-point literals converted implicitly by the compiler. I discovered examples of incorrect conversion in three ways: by random testing (using only strtod() so that it could be automated easily), by trial and error, and by direct construction.

My random testing was extensive, but not exhaustive. I generated random decimal strings (positive numbers only) in scientific notation, with a random number of digits (up to 21) and a random exponent (between -308 and +308). I compared the Visual C++ conversion to that given by David Gay’s strtod() function, the de facto standard correct conversion routine. This is what I found:

  • Conversions were done incorrectly about three hundredths of a percent of the time.
  • No conversion was incorrect by more than one unit in the last place (ULP).
  • Incorrect conversions included both “hard” and “easy” cases (defined below).
  • All incorrect conversions that I verified by hand were either “halfway” or “greater than halfway” cases (defined below).

For all the examples below, I used my C function fp2bin() to inspect the converted floating-point values. I verified that both strtod() and the compiler converted the same decimal number to the same double-precision floating-point value. I also verified the correctly converted value by hand, using the full-precision binary numbers generated by my decimal to binary converter.

Halfway Cases

A halfway case is when a decimal number converts to a full-precision binary number that is halfway between two binary floating-point values. By convention, in correct rounding, the tie is broken by choosing the floating-point value with its least significant bit equal to 0.

In terms of double-precision floating-point (normalized numbers only), the halfway case happens when significant bit 54 of the full-precision binary number is equal to 1, and all bits beyond are equal to 0. If bit 53 is equal to 0, then the full-precision binary number is rounded down to 53 bits; if bit 53 is equal to 1, then the full-precision binary number is rounded up to 53 bits. In either case, the correctly rounded result has 0 in its 53rd significant bit.

My testing shows that Visual C++ consistently rounds halfway cases incorrectly; specifically, it always rounds down, even when bit 53 of the full-precision binary number equals 1.

Example 1 (Incorrect)

The integer 9,214,843,084,008,499 converts to this binary number:

100000101111001101100111011000101011101110110000110011

Bit 54 is 1, and with no other bits beyond, is a halfway case. Bit 53 is 1, so the correctly rounded value — in binary scientific notation — is

1.000001011110011011001110110001010111011101100001101 x 253

Visual C++ computes the converted value as

1.0000010111100110110011101100010101110111011000011001 x 253

which is one ULP below the correctly rounded value.

Example 2 (Incorrect)

The dyadic fraction

0.500000000000000166533453693773481063544750213623046875,

which equals 2-1 + 2-53 + 2-54, converts to this binary number:

0.100000000000000000000000000000000000000000000000000011

It has 54 significant bits, with bits 53 and 54 equal to 1. The correctly rounded value is

1.000000000000000000000000000000000000000000000000001 x 2-1

Visual C++ computes the converted value as

1.0000000000000000000000000000000000000000000000000001 x 2-1

which is one ULP below the correctly rounded value.

Discussion

Examples 1 and 2 would be considered “easy” cases in correct rounding parlance. They require only one extra bit beyond double precision to make the correct rounding decision. These examples support my assertion that Visual C++ universally adopts truncation for halfway cases.

Greater than Halfway Cases

The greater than halfway case occurs when a decimal number converts to a full-precision binary number that is between two binary floating-point values, but closer to the one above. Visual C++ gets some of these wrong and some of these right.

Example 3 (Incorrect)

The integer 30,078,505,129,381,147,446,200 converts to this binary number:

11001011110100011110001110010011100011011000000100000100000000010111
0111000

Bit 54 is 1 and there is a 1 bit beyond bit 54, so the correctly rounded value is

1.1001011110100011110001110010011100011011000000100001 x 274

Visual C++ computes the converted value as

1.10010111101000111100011100100111000110110000001 x 274

which is one ULP below the correctly rounded value.

Example 4 (Correct)

The integer 1,777,820,000,000,000,000,001 converts to this binary number:

11000000110000000110101011011101011010010111000111101100000000000000
001

Just like the prior example, bit 54 is 1 and there is a 1 bit beyond it — but Visual C++ converts this correctly, to

1.100000011000000011010101101110101101001011100011111 x 270

Example 5 (Incorrect)

The dyadic fraction

0.500000000000000166547006220929549868969843373633921146392822265625

which equals 2-1 + 2-53 + 2-54 + 2-66, converts to this binary number:

0.100000000000000000000000000000000000000000000000000011000000000001

It converts correctly to

1.000000000000000000000000000000000000000000000000001 x 2-1

But Visual C++ converts it to this, one ULP below the correctly rounded value:

1.0000000000000000000000000000000000000000000000000001 x 2-1

Example 6 (Correct)

The dyadic fraction

0.50000000000000016656055874808561867439493653364479541778564453125

which equals 2-1 + 2-53 + 2-54 + 2-65, converts to this binary number:

0.10000000000000000000000000000000000000000000000000001100000000001

Visual C++ converts this correctly, to

1.000000000000000000000000000000000000000000000000001 x 2-1

Example 7 (Incorrect)

The decimal fraction 0.3932922657273 is non-terminating in binary (it’s not dyadic); here is the relevant portion of it — its first 82 places:

0.011001001010111011001101010010110001000110001000111001100000000000
0000000000000001

Significant bits 53 and 54 are both 1, as is significant bit 81; it rounds correctly to

1.100100101011101100110101001011000100011000100011101 x 2-2

Visual C++ computes the value one ULP below:

1.1001001010111011001101010010110001000110001000111001 x 2-2

Discussion

Visual C++ gets example 4 correct, even though it appears harder than example 3, which it converts incorrectly; example 4 has six more zeros beyond bit 54.

Examples 5 and 6 are dyadic fractions, which theoretically should be easy to convert. Example 5 is 66 digits long, and example 6 is 65 digits long. (Dyadic fractions have the same number of bits as decimal digits.) Example 6 converts correctly, but example 5 does not. Is it because Visual C++ stops looking beyond 65 places?

Example 7 is a hard case, which Visual C++ does get wrong.

My Bug Report

In April 2009, I submitted a bug report to Microsoft, titled “Incorrectly Rounded Decimal to Binary Conversions in Visual C++.” Microsoft closed the bug, but said they would “re-consider [sic] for future versions of VC++”. We’ll see. (Update 5/9/13: My bug report has been deleted.)

You could argue that being only one ULP away from the correct result is inconsequential. Technically, I think it’s ‘legal’ from the perspective of the IEEE floating-point specification. (This seems to be conventional wisdom, but I can’t find the exact wording in the spec — can someone enlighten me?) However, widely used open source code — David Gay’s strtod() function in dtoa.c — gets these conversions right. Should Visual C++ do the same?

(I would imagine this ‘problem’ also affects the rest of Visual Studio, but I haven’t investigated.)

Dingbat

3 comments

  1. I cannot believe that this rounding problem would exist in such late releases of Visual Studio, surely Microsoft should do something about it?

    As you rightly state – widely used open source code gets it correct, so why can’t a product as highly esteemed as Visual Studio do so too? Something doesn’t seem quite right. To me, even if it one ULP away from the correct answer, it should still be fixed.

    Anthony Daly CEO
    UniversalConverter.net

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

(Cookies must be enabled to leave a comment...it reduces spam.)

Copyright © 2008-2024 Exploring Binary

Privacy policy

Powered by WordPress

css.php