<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Exploring Binary &#187; Numbers in computers</title>
	<atom:link href="http://www.exploringbinary.com/category/numbers-in-computers/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.exploringbinary.com</link>
	<description>Binary Numbers, Binary Code, and Binary Logic</description>
	<lastBuildDate>Wed, 08 Sep 2010 13:37:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Double Rounding Errors in Floating-Point Conversions</title>
		<link>http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/</link>
		<comments>http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 16:21:21 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=291</guid>
		<description><![CDATA[Double rounding is when a number is rounded twice, first from n0 digits to n1 digits, and then from n1 digits to n2 digits. Double rounding is often harmless, giving the same result as rounding once, directly from n0 digits to n2 digits. However, sometimes a doubly rounded result will be incorrect, in which case [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/">Double Rounding Errors in Floating-Point Conversions</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Wikipedia article definition of double rounding" href="http://en.wikipedia.org/wiki/Rounding#Double_rounding">Double rounding</a> is when a number is rounded twice, first from n<sub>0</sub> digits to n<sub>1</sub> digits, and then from n<sub>1</sub> digits to n<sub>2</sub> digits. Double rounding is often harmless, giving the same result as rounding once, directly from n<sub>0</sub> digits to n<sub>2</sub> digits. However, sometimes a doubly rounded result will be incorrect, in which case we say that a <strong><em>double rounding error</em></strong> has occurred. </p>
<p>For example, consider the 6-digit decimal number 7.23496. Rounded directly to 3 digits &#8212; using round-to-nearest, <a title="Wikipedia article definition of &ldquo;round half to even&rdquo; rounding mode" href="http://en.wikipedia.org/wiki/Rounding#Round_half_to_even">round half to even</a> rounding &#8212; it&#8217;s 7.23; rounded first to 5 digits (7.2350) and then to 3 digits it&#8217;s 7.24. The value 7.24 is incorrect, reflecting a double rounding error.</p>
<p>In a computer, double rounding occurs in binary floating-point arithmetic; the typical example is a calculated result that&#8217;s rounded to fit into an x87 FPU extended precision register and then rounded again to fit into a double-precision variable. But I&#8217;ve discovered another context in which double rounding occurs: conversion from a decimal floating-point literal to a single-precision floating-point variable. The double rounding is from full-precision binary to double-precision, and then from double-precision to single-precision.</p>
<p>In this article, I&#8217;ll show example conversions in C that are tainted by double rounding errors, and how attaching the &#8216;f&#8217; suffix to floating-point literals prevents them &#8212; in gcc C at least, but not in Visual C++!</p>
<p><span id="more-291"></span></p>
<h2>Double Rounding Error Example 1</h2>
<p>The following C program demonstrates a double rounding error that occurs when a decimal floating-point literal is converted to a float by the compiler. The program converts the same decimal number three times: first to a double <em>d</em>; then to a float <em>f</em>, using the &#8216;f&#8217; (float) suffix; and then to a float <em>fd</em>, without using the &#8216;f&#8217; suffix.</p>
<pre>
#include &lt;stdio.h&gt;

int main (void)
{
 double d;
 float f, fd; 

 d =  0.5000000894069671353303618843710864894092082977294921875;
 f =  0.5000000894069671353303618843710864894092082977294921875f;
 <strong>fd = 0.5000000894069671353303618843710864894092082977294921875;</strong>

 printf (&quot;d =  %a\n&quot;,d);
 printf (&quot;f =  %a\n&quot;,f);
 printf (&quot;fd = %a\n&quot;,fd);
}
</pre>
<h3>Output on Linux</h3>
<p>Here&#8217;s the output when the program is compiled and run on Linux using gcc:</p>
<pre>
d =  0x1.000003p-1
f =  0x1.000002p-1
<strong>fd = 0x1.000004p-1</strong>
</pre>
<p>Translated from hexadecimal floating-point constants to binary, the output is</p>
<pre class="noborder">
d =  0.1000000000000000000000011
f =  0.100000000000000000000001
fd = 0.10000000000000000000001
</pre>
<p>To interpret the output, first consider the <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">full-precision binary equivalent</a> of the decimal number in the program (it is a terminating binary fraction):</p>
<p>0.1000000000000000000000010111111111111111111111111111111</p>
<p>Now let&#8217;s look at the three variables in turn:</p>
<ul>
<li><em>d</em> is the <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/#correct-rounding">correctly rounded conversion</a> of the full-precision binary value to double-precision &#8212; 53 significant bits. The full-precision binary value is rounded up, since the value of bits 54 and beyond is greater than 1/2 <a title="Wikipedia article &ldquo;Unit in the last place&rdquo;" href="http://en.wikipedia.org/wiki/Unit_in_the_last_place">ULP</a>:
<p>0.10000000000000000000000101111111111111111111111111111<span class="highlight_gray4">11</span></p>
</li>
<li><em>f</em> is the correctly rounded conversion of the full-precision binary value to single-precision &#8212; 24 significant bits. The full-precision binary value is rounded down, since the value of bits 25 and beyond is <em>less than</em> 1/2 ULP:
<p>0.100000000000000000000001<span class="highlight_gray4">0111111111111111111111111111111</span></p>
</li>
<li><em>fd</em> is an <em><strong>incorrectly rounded conversion</strong></em> to single-precision. It&#8217;s 1 ULP above the correctly rounded value, due to a double rounding error. The compiler converted it by first rounding it up to 53 bits, giving 0.100000000000000000000001<span class="highlight_gray4">1</span>, and then rounding that up to 24 bits, giving 0.10000000000000000000001. In other words, the first rounding causes a carry to propagate to bit 25, creating a <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/#halfway-case">halfway case</a> that gets resolved by rounding up to nearest even.</li>
</ul>
<h4>Summary</h4>
<p>With the &#8216;f&#8217; suffix on a decimal literal, gcc converts a full-precision binary value directly to single-precision, avoiding double rounding. Without the &#8216;f&#8217; suffix, gcc converts the full-precision binary value to double-precision and then to single-precision; this indirect conversion from full-precision to single-precision is double rounding, and in this case results in a double rounding error.</p>
<h3>Output on Windows</h3>
<p>Here&#8217;s the output when the program is compiled and run on Windows using Visual C++:</p>
<pre>
d =  0x1.000003p-1
<strong>f =  0x1.000004p-1</strong>
<strong>fd = 0x1.000004p-1</strong>
</pre>
<p>Like with gcc, <em>fd</em> is doubly rounded; but Visual C++ doubly rounds <em>f</em> as well!  <strong>It seems Visual C++ ignores the &#8216;f&#8217; suffix</strong>, first rounding to double-precision and then rounding to single-precision. (I am presuming this is the case. It&#8217;s also possible that Visual C++ is honoring the &#8216;f&#8217; suffix, only it&#8217;s <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">doing the conversion incorrectly</a> &#8212; due to reasons other than double rounding.)</p>
<h2>Double Rounding Error Example 2</h2>
<p>Here&#8217;s a similar example, except that it demonstrates a double rounding error that results in a value that is 1 ULP <em>below</em> the correctly rounded value:</p>
<pre>
#include &lt;stdio.h&gt;

int main (void)
{
 double d;
 float f, fd; 

 d =  0.5000000298023224154508881156289135105907917022705078125;
 f =  0.5000000298023224154508881156289135105907917022705078125f;
 <strong>fd = 0.5000000298023224154508881156289135105907917022705078125;</strong>

 printf (&quot;d =  %a\n&quot;,d);
 printf (&quot;f =  %a\n&quot;,f);
 printf (&quot;fd = %a\n&quot;,fd);
}
</pre>
<h3>Output on Linux</h3>
<p>Here’s the output when the program is compiled and run on Linux using gcc:</p>
<pre>
d =  0x1.000001p-1
f =  0x1.000002p-1
<strong>fd = 0x1p-1</strong>
</pre>
<p>Translated from hexadecimal floating-point constants to binary, the output is</p>
<pre class="noborder">
d =  0.1000000000000000000000001
f =  0.100000000000000000000001
fd = 0.1
</pre>
<p>The full-precision binary equivalent of the decimal number in the program is</p>
<p>0.1000000000000000000000001000000000000000000000000000001</p>
<p>Again, let&#8217;s look at the three variables:</p>
<ul>
<li><em>d</em> is the correctly rounded conversion of the full-precision binary value to 53 significant bits. The full-precision binary value is rounded down, since the value of bits 54 and beyond is less than 1/2 ULP:
<p>0.10000000000000000000000010000000000000000000000000000<span class="highlight_gray4">01</span></p>
</li>
<li><em>f</em> is the correctly rounded conversion of the full-precision binary value to 24 significant bits. The full-precision binary value is rounded up, since the value of bits 25 and beyond is <em>greater than</em> 1/2 ULP:
<p>0.100000000000000000000000<span class="highlight_gray4">1000000000000000000000000000001</span></p>
</li>
<li><em>fd</em> is an <em><strong>incorrectly rounded conversion</strong></em> to single-precision. It&#8217;s 1 ULP below the correctly rounded value, due to a double rounding error. The compiler converted it by first rounding it down to 53 bits, giving 0.100000000000000000000000<span class="highlight_gray4">1</span>, and then rounding that down to 24 bits, giving 0.1. In other words, the first rounding creates a halfway case that gets resolved by rounding down to nearest even.</li>
</ul>
<p>As in example 1, use of the ‘f’ suffix prevents double rounding. </p>
<h3>Output on Windows</h3>
<p>Here&#8217;s the output when the program is compiled and run on Windows using Visual C++:</p>
<pre>
d =  0x1.000001p-1
<strong>f =  0x1.000000p-1</strong>
<strong>fd = 0x1.000000p-1</strong>
</pre>
<p>As in example 1, Visual C++ ignores the &#8216;f&#8217; suffix, resulting in double rounding for both <em>f</em> and <em>fd</em>.</p>
<h2>Summary</h2>
<p>Floating-point literals are subject to double rounding when assigned to single-precision variables, resulting in incorrectly rounded decimal to floating-point conversions. If you&#8217;re using the gcc C compiler, you can avoid this by attaching the &#8216;f&#8217; suffix to your literals.</p>
<h3>A Note on the Examples</h3>
<p>The examples above are admittedly contrived, but they serve to illustrate the potential for double rounding errors during conversion. </p>
<h2>Please Try My Examples</h2>
<p>If you are using a compiler other than Visual C++ or gcc, please try out my examples. In particular, I&#8217;d like to know if your compiler doubly rounds decimal literals when the &#8216;f&#8217; suffix is used.</p>
<h2>Bug Report</h2>
<p>I wrote a <a title="See Microsoft Bug Report &ldquo;Double Rounding Error When the 'f' Suffix is Used on Floating-Point Literals&rdquo;" href="http://connect.microsoft.com/VisualStudio/feedback/details/585669/double-rounding-error-when-the-f-suffix-is-used-on-floating-point-literals">bug report against Visual C++</a>; this was Microsoft&#8217;s response:</p>
<blockquote><p>&ldquo;This is indeed an issue with the compiler but we regret that we cannot fix it for the next release due to its priority.&rdquo;</p></blockquote>
<h2>References</h2>
<p>Here is some background reading on double rounding:</p>
<ul>
<li><a title="See Book &ldquo;Handbook of Floating-Point Arithmetic&rdquo;" href="http://books.google.com/books?id=VDJVFM8WyzsC&#038;printsec=frontcover&#038;dq=Handbook+of+Floating-Point+Arithmetic&#038;source=bl&#038;ots=qlW0NFNKgW&#038;sig=5x2gReyRufmK0HQzaAT-V84e4To&#038;hl=en&#038;ei=FklhTJziIoO88gah2J2jCg&#038;sa=X&#038;oi=book_result&#038;ct=result&#038;resnum=3&#038;ved=0CCUQ6AEwAg#v=onepage&#038;q&#038;f=false">Handbook of Floating-Point Arithmetic</a> (see pages 75-77).</li>
<li><a title="See Presentation &ldquo;When Double Rounding is Odd&rdquo;" href="http://www.lri.fr/~melquion/doc/05-imacs17_1-expose.pdf">When Double Rounding is Odd</a> (see the two pages titled &ldquo;Double rounding going bad&rdquo;).</li>
<li><a title="See Paper &ldquo;When Is Double Rounding Innocuous?&rdquo;" href="http://delivery.acm.org/10.1145/230000/221334/p21-figueroa.pdf?key1=221334&#038;key2=8593441821&#038;coll=portal&#038;dl=ACM&#038;CFID=15151515&#038;CFTOKEN=6184618">When Is Double Rounding Innocuous?</a> (see the first two pages).</li>
</ul>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/">Double Rounding Errors in Floating-Point Conversions</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Displaying IEEE Doubles in Binary Scientific Notation</title>
		<link>http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/</link>
		<comments>http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/#comments</comments>
		<pubDate>Thu, 01 Jul 2010 04:39:28 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=289</guid>
		<description><![CDATA[An IEEE double-precision floating-point number, or double, is a 64-bit encoding of a rational number. Internally, the 64 bits are broken into three fields: a 1-bit sign field, which represents positive or negative; an 11-bit exponent field, which represents a power of two; and a 52-bit fraction field which, when appended to an implicit leading [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">Displaying IEEE Doubles in Binary Scientific Notation</a></p>
]]></description>
			<content:encoded><![CDATA[<p>An IEEE double-precision floating-point number, or double, is a 64-bit encoding of a rational number. Internally, the 64 bits are broken into three fields: a 1-bit sign field, which represents positive or negative; an 11-bit exponent field, which represents a power of two; and a 52-bit fraction field which, when appended to an implicit leading 1, represents a normalized binary fraction. The three fields together represent a number in binary scientific notation, with 53 bits of precision.</p>
<p>For example, consider the decimal number 33.75.  It converts to a double with a sign field of 0, an exponent field of 10000000100, and a fraction field of 0000111000000000000000000000000000000000000000000000. The 0 in the sign field means it&#8217;s a positive number (1 would mean it&#8217;s negative). The value of 10000000100 in the exponent field, which equals 1028 in decimal, means the exponent of the power of two is 5 (the exponent field value is offset, or biased, by 1023). The fraction field represents the normalized binary fraction 1.0000111. Written in binary scientific notation &#8212; following the convention that the fraction is written in binary and the power of two is written in decimal &#8212; 33.75 equals 1.0000111 x 2<sup>5</sup>.</p>
<p>In this article, I&#8217;ll show you the C function I wrote to display a double in binary scientific notation. This function is useful, for example, when <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">verifying that decimal to floating-point conversions are correctly rounded</a>.</p>
<p><span id="more-289"></span></p>
<h2>The Code</h2>
<p>I wrote a function called <strong>print_double_binsci()</strong> that prints a double-precision floating-point number in binary scientific notation. It is based on a call to my function <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">parse_double()</a>, which isolates the three fields of a double.</p>
<p>I declared and defined this function in files I named binsci.h and binsci.c, respectively.</p>
<h3>binsci.h</h3>
<pre>
/***********************************************************/
/* binsci.h: Function to print an IEEE double-precision    */
/*           floating-point number in binary scientific    */
/*           notation                                      */
/***********************************************************/
void print_double_binsci(double d);
</pre>
<h3>binsci.c</h3>
<pre>
/***********************************************************/
/* binsci.c: Function to print an IEEE double-precision    */
/*           floating-point number in binary scientific    */
/*           notation                                      */
/***********************************************************/
#include &lt;stdio.h&gt;
#include &quot;<a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">rawdouble.h</a>&quot;
#include &quot;binsci.h&quot;

void print_double_binsci(double d)
{
 int i;
 unsigned char sign_field;
 unsigned short exponent_field;
 unsigned long long fraction_field;

 //Isolate the three fields of the double
 <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">parse_double</a>(d,&#038;sign_field,&#038;exponent_field,&#038;fraction_field);

 //Print a minus sign, if necessary
 if (sign_field == 1)
   printf(&quot;-&quot;);

 //Print the implicit leading 1 and radix point
 printf(&quot;1.&quot;);

 //Print the fraction field
 for (i=0; i&lt;=51; i++)
   if ((fraction_field &gt;&gt; (51-i)) &#038; 1)
     printf(&quot;1&quot;);
   else
     printf(&quot;0&quot;);

 //Print the power, taking into account exponent bias
 printf(&quot; x 2^%d\n&quot;,exponent_field - 1023);
}
</pre>
<h3>Notes</h3>
<ul>
<li>All 53 bits are printed, including superfluous trailing 0s.</li>
<li>Numbers that are not raised to a power are printed with the suffix &ldquo;x 2^0&rdquo;.</li>
<li>Not-a-number (NaN) and infinity values are not handled.</li>
</ul>
<h2>Examples of Usage</h2>
<p>I wrote a program, called binsciTest.c, that shows some example calls to print_double_binsci():</p>
<pre>
/***********************************************************/
/* binsciTest.c: Program to test printing of IEEE double   */
/*               precision floating-point numbers in       */
/*               binary scientific notation                */
/***********************************************************/
#include &lt;stdio.h&gt;
#include &quot;binsci.h&quot;

int main (void)
{
 printf(&quot;33.75 =\n&quot;);
 print_double_binsci(33.75);
 printf(&quot;\n&quot;);

 printf(&quot;0.1 =\n&quot;);
 print_double_binsci(0.1);
 printf(&quot;\n&quot;);

 printf(&quot;-0.6 =\n&quot;);
 print_double_binsci(-0.6);
 printf(&quot;\n&quot;);

 printf(&quot;3.518437208883201171875e13 =\n&quot;);
 print_double_binsci(3.518437208883201171875e13);
 printf(&quot;\n&quot;);

 printf(&quot;9214843084008499.0 =\n&quot;);
 print_double_binsci(9214843084008499.0);
 printf(&quot;\n&quot;);

 printf(&quot;30078505129381147446200.0 =\n&quot;);
 print_double_binsci(30078505129381147446200.0);
 printf(&quot;\n&quot;);

 printf(&quot;1777820000000000000001.0 =\n&quot;);
 print_double_binsci(1777820000000000000001.0);
 printf(&quot;\n&quot;);

 printf(&quot;0.3932922657273 =\n&quot;);
 print_double_binsci(0.3932922657273);

 return (0);
}
</pre>
<p>(Some of these examples were taken from my articles <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Incorrectly Rounded Conversions in Visual C++</a> and <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in GCC and GLIBC&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/">Incorrectly Rounded Conversions in GCC and GLIBC</a>.)</p>
<p>I compiled and ran it on both Windows and Linux:</p>
<ul>
<li>On Windows, I built a project in Visual C++ with files binsci.c, binsci.h, binsciTest.c, rawdouble.c, and rawdouble.h, and compiled and ran it in there.</li>
<li>On Linux, I compiled with &ldquo;gcc binsciTest.c binsci.c rawdouble.c -o binsciTest&rdquo; and then ran it with &#8220;./binsciTest&#8221;.</li>
</ul>
<h3>Output</h3>
<pre>
33.75 =
1.0000111000000000000000000000000000000000000000000000 x 2^5

0.1 =
1.1001100110011001100110011001100110011001100110011010 x 2^-4

-0.6 =
-1.0011001100110011001100110011001100110011001100110011 x 2^-1

3.518437208883201171875e13 =
1.0000000000000000000000000000000000000000000000000001 x 2^45

9214843084008499.0 =
1.0000010111100110110011101100010101110111011000011001 x 2^53

30078505129381147446200.0 =
1.1001011110100011110001110010011100011011000000100000 x 2^74

1777820000000000000001.0 =
1.1000000110000000110101011011101011010010111000111110 x 2^70

0.3932922657273 =
1.1001001010111011001101010010110001000110001000111001 x 2^-2
</pre>
<p>Except for 33.75, which is exact, all the other examples are 53 significant bit <em>approximations</em> to the decimal numbers they stand in for (type &lsquo;0.1&rsquo; into my <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">decimal to binary converter</a> and see for yourself).</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">Displaying IEEE Doubles in Binary Scientific Notation</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Incorrect Directed Conversions in David Gay&#8217;s strtod()</title>
		<link>http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/</link>
		<comments>http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/#comments</comments>
		<pubDate>Wed, 23 Jun 2010 13:47:58 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=287</guid>
		<description><![CDATA[For correctly rounded decimal to floating-point conversions, many open source projects rely on David Gay&#8217;s strtod() function. In the default rounding mode, IEEE 754 round-to-nearest, this function is known to give correct results (notwithstanding recent bugs, which have been fixed). However, in the less frequently used IEEE 754 directed rounding modes &#8212; round toward positive [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/">Incorrect Directed Conversions in David Gay&#8217;s strtod()</a></p>
]]></description>
			<content:encoded><![CDATA[<p>For <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">correctly rounded decimal to floating-point conversions</a>, many open source projects rely on <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a>. In the default rounding mode, IEEE 754 <em>round-to-nearest</em>, this function is known to give correct results (notwithstanding <a title="Change Log for David Gay's dtoa.c" href="http://www.netlib.org/fp/changes">recent bugs</a>, which have been fixed). However, in the less frequently used IEEE 754 directed rounding modes &#8212; <em>round toward positive infinity</em>, <em>round toward negative infinity</em>, and <em>round toward zero</em> &#8212; strtod() gives incorrectly rounded results for some inputs.</p>
<p><span id="more-287"></span></p>
<p>(<strong>Update 7/11/10: dtoa.c is now fixed &#8212; see below</strong>.)</p>
<h2>Discovering the Problem</h2>
<p>As a follow up to my article &ldquo;<a title="Read Rick Regan's article &ldquo;Visual C++ and GLIBC strtod() Ignore Rounding Mode&rdquo;" href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">Visual C++ and GLIBC strtod() Ignore Rounding Mode</a>,&rdquo; I asked Mark Dickinson &#8212; the developer who maintains Python&#8217;s copy of dtoa.c &#8212; why Python does not enable what I call <em>directed conversions</em>. Out of that discussion came Mark&#8217;s conjecture: the directed rounding code was not correct. He soon followed up with an example to prove it.</p>
<h2>An Automated Way to Find Examples</h2>
<p>I wanted to find more examples, but I wanted a C program to do it for me. I needed a way to generate examples whose correct outcomes were known (in my evaluations of round-to-nearest conversions, David Gay&#8217;s strtod() served this purpose). Inspired by Mark&#8217;s example, this is the algorithm I devised:</p>
<ol>
<li>Generate a random decimal string that represents a number that is not an integer and is not <a title="Wikipedia article on dyadic fraction" href="http://en.wikipedia.org/wiki/Dyadic_fraction">dyadic</a>.</li>
<li>Convert the decimal string to a double (I used David Gay&#8217;s strtod(), but the built-in strtod() would have been fine &#8212; correct rounding does not matter here).</li>
<li>Convert the double to a decimal string, giving the exact decimal representation of a floating-point number (I used <a title="David Gay's g_fmt.c" href="http://www.netlib.org/fp/g_fmt.c">David Gay&#8217;s g_fmt() function</a>, since I&#8217;m running on Windows and <a title="Read Rick Regan's article &ldquo;Print Precision of Dyadic Fractions Varies by Language&rdquo;" href="http://www.exploringbinary.com/print-precision-of-dyadic-fractions-varies-by-language/">printf() in Visual C++ can&#8217;t print all the decimal digits of a floating-point number</a>).</li>
<li>Change the last digit of the new decimal string &#8212; which by the way is always 5 (<a title="Read Rick Regan's Article &ldquo;Patterns in the Last Digits of the Positive Powers of Five&rdquo;" href="http://www.exploringbinary.com/patterns-in-the-last-digits-of-the-positive-powers-of-five/">this article</a> indirectly explains why) &#8212; to either 4 or 6, depending on whether you want a number just below or just above the floating-point number.</li>
</ol>
<p>Except for Mark&#8217;s example, the examples below were found by my C program implementation of this algorithm. I ran the program against a copy of dtoa.c most recently changed on &ldquo;Tue Feb  2 23:05:34 MST 2010,&rdquo; which I downloaded on May 27, 2010.</p>
<h2>Examples: Decimal Numbers Just Above a Floating-Point Number</h2>
<p>This section shows two examples that David Gay&#8217;s strtod() (hereafter referred to only as strtod()) rounds incorrectly; they are very close to, but just <em>above</em> (in absolute value), a floating-point number.</p>
<h3>Example 1</h3>
<p>This is Mark&#8217;s example, and it shows a positive number rounding to nearest in round toward positive infinity mode:</p>
<p>1.100000000000000088817841970012523233890533447265626</p>
<p>It converts to this binary number, which has a non-terminating binary fraction portion:</p>
<p>1.0001100110011001100110011001100110011001100110011010<span class="highlight_gray4">0</span>0000000000000<br />
00000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Rounded toward positive infinity the result should be</p>
<p>1.0001100110011001100110011001100110011001100110011011</p>
<p>In hexadecimal “%a” notation, this equals 0&#120;1.199999999999bp+0.</p>
<p>strtod() converts it incorrectly to 0&#120;1.199999999999ap+0, which is one ULP below the correctly rounded value. 0&#120;1.199999999999ap+0 is the floating-point number equal to decimal 1.100000000000000088817841970012523233890533447265625, which differs from the input decimal value only in the last decimal digit (5 vs 6).</p>
<h3>Example 2</h3>
<p>This example shows a negative number rounding to nearest in round toward negative infinity mode:</p>
<p>-0.91276999999999997026378650843980722129344940185546876</p>
<p>It converts to this non-terminating binary number:</p>
<p>-0.11101001101010110100101101110010110001010001100101111<span class="highlight_gray4">0</span>000000000000<br />
0000000000000000000000000000000000000000000000000000000000000000000<br />
0000000000000000000000000000000000000000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Rounded toward negative infinity the result should be -0&#120;1.d35696e58a330p-1; strtod() converts it incorrectly to -0&#120;1.d35696e58a32fp-1.</p>
<h2>Examples: Decimal Numbers Just Below a Floating-Point Number</h2>
<p>This section shows two examples that strtod() rounds incorrectly; they are very close to, but just <em>below</em> (in absolute value), a floating-point number.</p>
<h3>Example 3</h3>
<p>This example shows a negative number rounding to nearest in round toward positive infinity mode: </p>
<p>-266.240000000000009094947017729282379150390624</p>
<p>Rounded toward positive infinity it should be -0&#120;1.0a3d70a3d70a3p+8; strtod() rounds it incorrectly to -0&#120;1.0a3d70a3d70a4p+8.</p>
<h3>Example 4</h3>
<p>This example shows a positive number rounding to nearest in both round toward negative infinity and round toward zero modes: </p>
<p>8.2556288587679180024720432899523381023022507640626854730214745359262<br />
45152950286865234374e-17 </p>
<p>Rounded toward negative infinity and rounded toward zero it should be 0&#120;1.7cb9433617c9bp-54; strtod() rounds it incorrectly &#8212; in both cases &#8212; to 0&#120;1.7cb9433617c9cp-54.</p>
<h2>Discussion</h2>
<p>In my article &ldquo;<a title="Read Rick Regan's article &ldquo;Visual C++ and GLIBC strtod() Ignore Rounding Mode&rdquo;" href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">Visual C++ and GLIBC strtod() Ignore Rounding Mode</a>&rdquo; I said</p>
<blockquote><p>The percentage of incorrect conversions for both Visual C++ and glibc ranged between 49.00% and 49.06%, suggesting they were not reacting to the rounding mode at all. (Why it wasn’t almost exactly 50% I don’t know.) </p></blockquote>
<p>It might not have been 50% because, although I didn&#8217;t know it at the time, strtod()&#8217;s conversions weren&#8217;t 100% correct! <strike>(I&#8217;ll rerun those tests when dtoa.c is fixed.)</strike> (Update: the percentages are still near 49%, even with the fixed dtoa.c, so scratch that theory.)</p>
<h3>Do You Use Directed Rounding?</h3>
<p>If your project uses dtoa.c, could you let me know whether it enables directed conversions? That is, does it define the &ldquo;Honor_FLT_ROUNDS&rdquo; flag?</p>
<p>You should be aware that <a title="Read Rick Regan's article &ldquo;Visual C++ and GLIBC strtod() Ignore Rounding Mode&rdquo;" href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">strtod() reacts to the rounding mode, although incorrectly, when Honor_FLT_ROUNDS is not defined</a>.</p>
<h2>Bug Report</h2>
<p>I reported this to David Gay; I&#8217;ll keep you posted.</p>
<h3>Update: dtoa.c is Fixed</h3>
<p>David Gay updated dtoa.c with a fix, and it appears to work: all four examples above now round correctly, as did the millions of random examples I checked.</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/">Incorrect Directed Conversions in David Gay&#8217;s strtod()</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visual C++ and GLIBC strtod() Ignore Rounding Mode</title>
		<link>http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/</link>
		<comments>http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/#comments</comments>
		<pubDate>Fri, 18 Jun 2010 13:28:25 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=286</guid>
		<description><![CDATA[When a decimal number is converted to a binary floating-point number, the floating-point number, in general, is only an approximation to the decimal number. Large integers, and most decimal fractions, require more significant bits than can be represented in the floating-point format. This means the decimal number must be rounded, to one of the two [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">Visual C++ and GLIBC strtod() Ignore Rounding Mode</a></p>
]]></description>
			<content:encoded><![CDATA[<p>When a decimal number is converted to a binary floating-point number, the floating-point number, in general, is only an approximation to the decimal number. Large integers, and most decimal fractions, require more significant bits than can be represented in the floating-point format. This means the decimal number must be rounded, to one of the two floating-point numbers that surround it.</p>
<p>Common practice considers a decimal number correctly rounded when the <em>nearest</em> of the two floating-point numbers is chosen (and when both are equally near, when the one with significant bit number 53 equal to 0 is chosen). This makes sense intuitively, and also reflects the default IEEE 754 rounding mode &#8212; <em>round-to-nearest</em>. However, there are three other IEEE 754 rounding modes, which allow for directed rounding: <em>round toward positive infinity</em>, <em>round toward negative infinity</em>, and <em>round toward zero</em>. For a conversion to be considered truly correctly rounded, it must honor all four rounding modes &#8212; whichever is currently in effect.</p>
<p>I evaluated the Visual C++ and glibc strtod() functions under the three directed rounding modes, like I did for round-to-nearest mode in my articles &ldquo;<a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Incorrectly Rounded Conversions in Visual C++</a>&rdquo; and &ldquo;<a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in GCC and GLIBC&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/">Incorrectly Rounded Conversions in GCC and GLIBC</a>.&rdquo;. What I discovered was this: they only convert correctly about half the time &#8212; pure chance! &#8212; because they ignore the rounding mode altogether.</p>
<p><span id="more-286"></span></p>
<h2>Example: Correct Rounding in All Four Rounding Modes</h2>
<p>The diagram below shows how correct conversion works in each of the four rounding modes; I used the decimal values 0.1 and -0.1 as examples:</p>
<div class="wp-caption aligncenter" style="width: 557px"><img src="http://www.exploringbinary.com/wp-content/uploads/RoundingIntervalCorrect.png" alt="How 0.1 and -0.1 Convert Correctly in Each of the Four Rounding Modes" width="547" height="169"/><p class="wp-caption-text">How 0.1 and -0.1 Convert Correctly in Each of the Four Rounding Modes</p></div>
<p>The conversions &#8212; shown in hexadecimal &ldquo;%a&rdquo; notation &#8212; were done by <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a> and also verified by hand. In terms of absolute value, there are two different results; here they are in binary:</p>
<ul>
<li>0&#120;1.9999999999999p-4 =
<p>1.1001100110011001100110011001100110011001100110011001  x 2<sup>-4</sup></p>
</li>
<li>0&#120;1.999999999999ap-4 =
<p>1.100110011001100110011001100110011001100110011001101 x 2<sup>-4</sup></p>
</li>
</ul>
<h2>Example: Incorrect Rounding in Visual C++ and glibc</h2>
<p>Visual C++ and glibc strtod() do four of the eight conversions incorrectly:</p>
<div class="wp-caption aligncenter" style="width: 555px"><img src="http://www.exploringbinary.com/wp-content/uploads/RoundingIntervalIncorrect.png" alt="How 0.1 and -0.1 Convert Incorrectly in Visual C++ and glibc" width="545" height="166"/><p class="wp-caption-text">How 0.1 and -0.1 Convert Incorrectly in Visual C++ and glibc</p></div>
<p>0.1, in round toward zero and round toward negative infinity modes, converts incorrectly to 0&#120;1.999999999999ap-4. -0.1, in round toward zero and round toward positive infinity modes, converts incorrectly to -0&#120;1.999999999999ap-4. That is, Visual C++ and glibc are converting 0.1 and -0.1 to a single absolute value in every case: 0&#120;1.999999999999ap-4. That, not coincidentally, is the nearest floating-point number.</p>
<h2>Visual C++ and glibc Always Round to Nearest</h2>
<p>To see what percentage of conversions Visual C++ and glibc got right, I tested strtod() with random decimal values &#8212; both positive and negative &#8212;  in each of the three directed rounding modes. I compared their results to those computed by David Gay&#8217;s strtod(). The percentage of incorrect conversions for both Visual C++ and glibc ranged between 49.00% and 49.06%, suggesting they were not reacting to the rounding mode at all. (Why it wasn&#8217;t almost exactly 50% I don&#8217;t know.) To confirm this, I generated random decimal values and converted each in all four modes; in every case, the four conversions were the same.</p>
<p><strong>Conclusion</strong>: Visual C++ and glibc are using arbitrary-precision arithmetic in their calculations, and their algorithms are hard-coded to perform round-to-nearest rounding.</p>
<p>(<em>Update</em>: I was assuming David Gay&#8217;s strtod() got the conversions right in the directed rounding modes, but it turns out <a title="Read Rick Regan's article &ldquo;Incorrect Directed Conversions in David Gay&rsquo;s strtod()&rdquo;" href="http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/">there was a bug &#8212; now fixed &#8212; that caused some conversions to come out incorrectly rounded</a>. In any case, my conclusion is unaffected, since ultimately I didn&#8217;t need David Gay&#8217;s strtod() to reach it.)</p>
<h2>Enabling Rounding Mode Detection in David Gay&#8217;s strtod()</h2>
<p>David Gay&#8217;s strtod() function does not honor rounding mode by default; it must be enabled, by building dtoa.c with the flag &ldquo;Honor_FLT_ROUNDS&rdquo;. This forces the inclusion of file fenv.h, which defines the function fegetround(). fegetround() is called to get the current processor rounding mode, which strtod() uses to round its results correctly.</p>
<p>To make this work in Visual C++ &#8212; which doesn&#8217;t have the file fenv.h &#8212; I had to simulate fegetround(). I did this by calling _controlfp_s(&#038;cur_cw,0,0) and translating the rounding mode masks (I am using the x87 FPU).</p>
<h3>Conversion Algorithm Uses Floating-Point Arithmetic</h3>
<p>David Gay&#8217;s strtod() function uses floating-point as well as arbitrary-precision calculations in its conversion, which you can see by reading the source code. You can also see this by building dtoa.c without &ldquo;Honor_FLT_ROUNDS&rdquo; and running it in the different rounding modes. Although the conversions won&#8217;t  be correct (in general), they will differ, meaning they were computed with the help of the processor&#8217;s floating-point unit. </p>
<h2>Discussion</h2>
<h3>Compile Time Vs. Run Time Conversions</h3>
<p>I&#8217;ve been discussing strtod(), which does conversions at run time. But conversions can be done at compile time too &#8212; using floating-point literals. For the sake of argument, let&#8217;s say both strtod() and compile time conversions honor the current rounding mode (I don&#8217;t know if compile time conversions honor the rounding mode, but I expect they don&#8217;t). Assume also that both give the same result for the same rounding mode for the same decimal input. The compiled and strtod() conversions can still differ &#8212; if the rounding mode is different at run time than at compile time.</p>
<h3>GLIBC Bug Report</h3>
<p>The glibc bug report &ldquo;<a title="Bugzilla Bug 3479: Incorrect rounding in strtod()" href="http://sources.redhat.com/bugzilla/show_bug.cgi?id=3479">Bugzilla Bug 3479: Incorrect rounding in strtod()</a>&rdquo; mentions, as an aside, strtod()&#8217;s failure to respond to the current rounding mode (see comment 6).</p>
<h3>Are There Any Applications That Depend on Directed Conversions?</h3>
<p>Are there any Visual C++ or glibc based applications that depend on these &ldquo;directed&rdquo; conversions? I hope not &#8212; it doesn&#8217;t work.</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">Visual C++ and GLIBC strtod() Ignore Rounding Mode</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Incorrectly Rounded Conversions in GCC and GLIBC</title>
		<link>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/</link>
		<comments>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/#comments</comments>
		<pubDate>Thu, 03 Jun 2010 22:42:42 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=285</guid>
		<description><![CDATA[Visual C++ rounds some decimal to double-precision floating-point conversions incorrectly, but it&#8217;s not alone; the gcc C compiler and the glibc strtod() function do the same. In this article, I&#8217;ll show examples of incorrect conversions in gcc and glibc, and I&#8217;ll present a C program that demonstrates the errors. (To understand my methodology and terminology, [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/">Incorrectly Rounded Conversions in GCC and GLIBC</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Visual C++ rounds some decimal to double-precision floating-point conversions incorrectly</a>, but it&#8217;s not alone; the gcc C compiler and the glibc strtod() function do the same. In this article, I&#8217;ll show examples of incorrect conversions in gcc and glibc, and I&#8217;ll present a C program that demonstrates the errors.</p>
<p><span id="more-285"></span></p>
<p>(To understand my methodology and terminology, please read my article &ldquo;<a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Incorrectly Rounded Conversions in Visual C++</a>.&rdquo;)</p>
<h2>My Analysis</h2>
<p>I ran my tests on an Intel Core Duo processor running <a title="Ubuntu Linux" href="http://www.ubuntu.com/">Ubuntu Linux</a> version 10.04, and I used the <a title="GCC, the GNU Compiler Collection" href="http://gcc.gnu.org/">gcc C compiler</a> version 4.4.3 and the <a title="Embedded GLIBC (EGLIBC)" href="http://www.eglibc.org/home">embedded glibc library (eglibc)</a> version 2.11.1. (As far as I can tell, eglibc uses the same conversion code as <a title="GNU C Library" href="http://www.gnu.org/software/libc/">glibc</a>, so I will refer to eglibc as glibc for the rest of this article.) (<em>Update</em>: I reran my tests on glibc, getting the same results.)</p>
<p>I did automated testing of glibc strtod() string conversions and manual testing of gcc floating-point literal conversions. This is what found in my automated testing of strtod():</p>
<ul>
<li>Conversions were done incorrectly only about four ten-thousandths of a percent of the time.</li>
<li>No conversion was incorrect by more than one <em>unit in the last place</em> (ULP).</li>
<li>Incorrect conversions included both &ldquo;hard&rdquo; and &ldquo;easy&rdquo; cases.</li>
<li>There were incorrect conversions covering all three cases: &ldquo;halfway&rdquo;, &ldquo;greater than halfway&rdquo;, and &ldquo;less than halfway.&rdquo;</li>
</ul>
<p>My manual testing of gcc was not as comprehensive, of course, but I did manage to find some examples of incorrect conversions.</p>
<p>For the remainder of this article, I will discuss six examples of incorrect conversions &#8212; examples uncovered by my automated and manual testing, as well as examples found in Bugzilla bug reports for gcc and glibc.</p>
<h2>Incorrect Conversion by Both GCC and GLIBC</h2>
<p>Here&#8217;s an example that&#8217;s converted incorrectly by <em>both</em> gcc and glibc &#8212; to the same incorrect value:</p>
<h3>Example 1 (Halfway Case)</h3>
<p>0.500000000000000166533453693773481063544750213623046875, </p>
<p>which equals 2<sup>-1</sup> + 2<sup>-53</sup> + 2<sup>-54</sup>, converts to this binary number:</p>
<p>0.10000000000000000000000000000000000000000000000000001<span class="highlight_gray4">1</span></p>
<p>It has 54 significant bits, with bits 53 and 54 equal to 1. The correctly rounded value &#8212; in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a> &#8212; is</p>
<p>1.000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<p>gcc and glibc strtod() both compute the converted value as</p>
<p>1.0000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<p>(I constructed this example by hand.)</p>
<h2>Incorrect Conversions by GLIBC Only</h2>
<p>Here are three examples that are converted incorrectly by glibc but converted correctly by gcc:</p>
<h3>Example 2 (Halfway Case)</h3>
<p>3.518437208883201171875e13 converts to this binary number:</p>
<p>1000000000000000000000000000000000000000000000.0000001<span class="highlight_gray4">1</span></p>
<p>It has 54 significant bits, with bits 53 and 54 equal to 1. The correctly rounded value is</p>
<p>1.000000000000000000000000000000000000000000000000001 x 2<sup>45</sup></p>
<p>glibc strtod() computes the converted value as</p>
<p>1.0000000000000000000000000000000000000000000000000001 x 2<sup>45</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<p>(This example was taken from &ldquo;<a title="Bugzilla Bug 3479: Incorrect rounding in strtod()" href="http://sources.redhat.com/bugzilla/show_bug.cgi?id=3479">Bugzilla Bug 3479: Incorrect rounding in strtod()</a>.&rdquo;)</p>
<h3>Example 3 (Greater than Halfway Case)</h3>
<p>62.5364939768271845828 converts to this binary number, which has a non-terminating binary fraction portion:</p>
<p>111110.10001001010101111010101101010100111110010100011<span class="highlight_gray4">1</span>000000000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Bits 53 and 54 are equal to 1, so the correctly rounded value is</p>
<p>1.11110100010010101011110101011010101001111100101001 x 2<sup>5</sup></p>
<p>glibc strtod() computes the converted value as</p>
<p>1.1111010001001010101111010101101010100111110010100011 x 2<sup>5</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<p>(I discovered this example during random testing.)</p>
<h3>Example 4 (Less than Halfway Case)</h3>
<p>8.10109172351e-10 converts to this non-terminating binary fraction:</p>
<p>0.000000000000000000000000000000110111101010111001011101011101111000<br />
01111110100001100<span class="highlight_gray4">0</span>1111111111111<span class="highlight_gray4">0</span>&#8230;</p>
<p>Significant bit 54 is 0, so it should be rounded down to get the correctly rounded value</p>
<p>1.10111101010111001011101011101111000011111101000011 x 2<sup>-31</sup></p>
<p>glibc strtod() computes the converted value as</p>
<p>1.1011110101011100101110101110111100001111110100001101 x 2<sup>-31</sup></p>
<p>which is one ULP above the correctly rounded value.</p>
<p>(I discovered this example during random testing.)</p>
<h2>Incorrect Conversions by GCC Only</h2>
<p>Here are two examples that are converted incorrectly by gcc but converted correctly by glibc:</p>
<h3>Example 5 (Halfway Case)</h3>
<p>The number</p>
<p>1.50000000000000011102230246251565404236316680908203125</p>
<p>which equals 1 + 2<sup>-1</sup> + 2<sup>-53</sup>, converts to this binary number:</p>
<p>1.1000000000000000000000000000000000000000000000000000<span class="highlight_gray4">1</span></p>
<p>It rounds correctly to 1.1, since bit 53 is 0. </p>
<p>gcc converts it to </p>
<p>1.1000000000000000000000000000000000000000000000000001</p>
<p>which is one ULP above the correctly rounded value.</p>
<p>(I constructed this example by hand.)</p>
<h3>Example 6 (Less than Halfway Case)</h3>
<p>9007199254740991.4999999999999999999999999999999995 converts to this binary number, which has a non-terminating binary fraction portion:</p>
<p>11111111111111111111111111111111111111111111111111111.<span class="highlight_gray4">0</span>1111111111111<br />
1111111111111111111111111111111111111111111111111111111111111111111<br />
11111111111111111111111111111<span class="highlight_gray4">0</span>&#8230;</p>
<p>Bit 54 is 0, so it converts correctly to</p>
<p>1.1111111111111111111111111111111111111111111111111111 x 2<sup>52</sup></p>
<p>gcc converts it to</p>
<p>1.0 x 2<sup>53</sup></p>
<p>which is one ULP above the correctly rounded value.</p>
<p>(This example was taken from &ldquo;<a title="GCC Bugzilla Bug 21718: real.c rounding not perfect" href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21718">Bugzilla Bug 21718: real.c rounding not perfect</a>.&rdquo;)</p>
<h2>A C Program that Reports the Behavior</h2>
<p>Here&#8217;s a C program you can use to confirm incorrect rounding on your own Linux system:</p>
<pre>
// gcc -o examples examples.c
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;

#define NUM_CONVERT 6

int main (void)
{
 unsigned int i;

 char decimal[NUM_CONVERT][60] = {
  "0.500000000000000166533453693773481063544750213623046875",
  "3.518437208883201171875e13",
  "62.5364939768271845828",
  "8.10109172351e-10",
  "1.50000000000000011102230246251565404236316680908203125",
  "9007199254740991.4999999999999999999999999999999995"
 };

 double gcc_conversion[NUM_CONVERT] = {
  0.500000000000000166533453693773481063544750213623046875,
  3.518437208883201171875e13,
  62.5364939768271845828,
  8.10109172351e-10,
  1.50000000000000011102230246251565404236316680908203125,
  9007199254740991.4999999999999999999999999999999995
};

 char correct[NUM_CONVERT][30] = {
  "0x1.0000000000002p-1",
  "0x1.0000000000002p+45",
  "0x1.f44abd5aa7ca4p+5",
  "0x1.bd5cbaef0fd0cp-31",
  "0x1.8p+0",
  "0x1.fffffffffffffp+52"
 };

 for (i=0; i &lt; NUM_CONVERT; i++)
 {
  printf ("%s\n",decimal[i]);
  printf (" Correct = %s\n",correct[i]);
  printf (" gcc =     %a\n",gcc_conversion[i]);
  printf (" strtod =  %a\n\n",strtod(decimal[i],NULL));
 }
}
</pre>
<p>This is the output on my system (I shaded the incorrect conversions):</p>
<pre>
0.500000000000000166533453693773481063544750213623046875
 Correct = 0x1.0000000000002p-1
 <span class="highlight_gray4">gcc =     0x1.0000000000001p-1</span>
 <span class="highlight_gray4">strtod =  0x1.0000000000001p-1</span>

3.518437208883201171875e13
 Correct = 0x1.0000000000002p+45
 gcc =     0x1.0000000000002p+45
 <span class="highlight_gray4">strtod =  0x1.0000000000001p+45</span>

62.5364939768271845828
 Correct = 0x1.f44abd5aa7ca4p+5
 gcc =     0x1.f44abd5aa7ca4p+5
 <span class="highlight_gray4">strtod =  0x1.f44abd5aa7ca3p+5</span>

8.10109172351e-10
 Correct = 0x1.bd5cbaef0fd0cp-31
 gcc =     0x1.bd5cbaef0fd0cp-31
 <span class="highlight_gray4">strtod =  0x1.bd5cbaef0fd0dp-31</span>

1.50000000000000011102230246251565404236316680908203125
 Correct = 0x1.8p+0
 <span class="highlight_gray4">gcc =     0x1.8000000000001p+0</span>
 strtod =  0x1.8p+0

9007199254740991.4999999999999999999999999999999995
 Correct = 0x1.fffffffffffffp+52
 <span class="highlight_gray4">gcc =     0x1p+53</span>
 strtod =  0x1.fffffffffffffp+52
</pre>
<p>The values for the correct conversions were obtained using <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a>. I ran a separate program to compute them and then hard coded them in this program. (I could also have constructed the correct hex formats by hand, using my analysis in the examples above.)</p>
<p>If you try this on your system, let me know what happens; I expect the same output. Please include the type of Linux, and version numbers if possible.</p>
<h2>Bug Reports</h2>
<p>There are existing bug reports for gcc and glibc:</p>
<ul>
<li><a title="GCC Bugzilla Bug 21718: real.c rounding not perfect" href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21718">Bugzilla Bug 21718: real.c rounding not perfect</a>.</li>
<li><a title="Bugzilla Bug 3479: Incorrect rounding in strtod()" href="http://sources.redhat.com/bugzilla/show_bug.cgi?id=3479">Bugzilla Bug 3479: Incorrect rounding in strtod()</a>.</li>
</ul>
<p>It&#8217;s not clear whether these bugs will (or should?) be fixed; if they are, I&#8217;d imagine <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a> would be used to replace the current code.</p>
<p>I would guess this problem affects the other GCC compilers, although I haven&#8217;t tested.</p>
<h2>GCC/GLIBC Vs. Visual C++</h2>
<p>Here are the differences I found between gcc/glibc and Visual C++ in my two analyses:</p>
<ul>
<li>glibc strtod() converted incorrectly about two orders of magnitude less frequently than Visual C++ strtod().</li>
<li>I found no integer values that rounded incorrectly in gcc or glibc, unlike in Visual C++.</li>
<li>gcc and glibc strtod() can convert the same decimal number differently; I found no such example in Visual C++ and its strtod() function.</li>
<li>Of the <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">five examples Visual C++ converts incorrectly</a>, only one &#8212; example 2 &#8212; is converted incorrectly by gcc or glibc (in this case, by both). Of the six examples in this article converted incorrectly by gcc or glibc, two are converted incorrectly by Visual C++: examples 1 and 2.</li>
<li>When gcc or glibc rounds halfway cases incorrectly, it does so either up or down. I found Visual C++ to consistently round incorrectly down.</li>
<li>I found an incorrect conversion for a &ldquo;less than halfway case&rdquo; in gcc/glibc; I found no such example in Visual C++.</li>
</ul>
<h2>Summary</h2>
<p>I&#8217;ve demonstrated that a decimal floating-point number is not guaranteed to convert to its <em>closest</em> binary floating-point number. Also, I&#8217;ve shown that a decimal floating-point number may be converted to <em>different</em> binary floating-point numbers, depending on which compiler or run-time library is used.</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/">Incorrectly Rounded Conversions in GCC and GLIBC</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Incorrectly Rounded Conversions in Visual C++</title>
		<link>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/</link>
		<comments>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/#comments</comments>
		<pubDate>Fri, 28 May 2010 18:56:23 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=283</guid>
		<description><![CDATA[In my analysis of decimal to floating-point conversion I noted an example that was converted incorrectly by the Microsoft Visual C++ compiler. I&#8217;ve found more examples &#8212; including a class of examples &#8212; that it converts incorrectly. I will analyze those examples in this article. My Analysis I did my analysis using Visual C++ Express [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Incorrectly Rounded Conversions in Visual C++</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In my analysis of <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">decimal to floating-point conversion</a> I noted an example that was converted incorrectly by the Microsoft Visual C++ compiler. I&#8217;ve found more examples &#8212; including a class of examples &#8212; that it converts incorrectly. I will analyze those examples in this article.</p>
<p><span id="more-283"></span></p>
<h2>My Analysis</h2>
<p>I did my analysis using Visual C++ Express Edition &#8212; both the 2008 and 2010 versions. I converted decimal numbers to double-precision floating-point, via the strtod() library function and floating-point literals converted implicitly by the compiler. I discovered examples of incorrect conversion in three ways: by random testing (using only strtod() so that it could be automated easily), by trial and error, and by direct construction. </p>
<p>My random testing was extensive, but not exhaustive. I generated random decimal strings (positive numbers only) in scientific notation, with a random number of digits (up to 21) and a random exponent (between -308 and +308). I compared the Visual C++ conversion to that given by <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a>, the de facto standard correct conversion routine. This is what I found:</p>
<ul>
<li>Conversions were done incorrectly about three hundredths of a percent of the time.</li>
<li>No conversion was incorrect by more than one <em>unit in the last place</em> (ULP).</li>
<li>Incorrect conversions included both &ldquo;hard&rdquo; and &ldquo;easy&rdquo; cases (defined below).</li>
<li>All incorrect conversions that I verified by hand were either &ldquo;halfway&rdquo; or &ldquo;greater than halfway&rdquo; cases (defined below).</li>
</ul>
<p>For all the examples below, I used my C function <a href="http://www.exploringbinary.com/converting-floating-point-numbers-to-binary-strings-in-c/" title="Read Rick Regan's Article &ldquo;Converting Floating-Point Numbers to Binary Strings in C&rdquo;">fp2bin()</a> to inspect the converted floating-point values. I verified that both strtod() and the compiler converted the same decimal number to the same double-precision floating-point value. I also verified the correctly converted value by hand, using the full-precision binary numbers generated by my <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">decimal to binary converter</a>.</p>
<h2><a name="halfway-case"></a>Halfway Cases</h2>
<p>A <em>halfway case</em> is when a decimal number converts to a full-precision binary number that is halfway between two binary floating-point values. By convention, in correct rounding, the tie is broken by choosing the floating-point value with its least significant bit equal to 0. </p>
<p>In terms of double-precision floating-point, the halfway case happens when significant bit 54 of the full-precision binary number is equal to 1, and all bits beyond are equal to 0. If bit 53 is equal to 0, then the full-precision binary number is rounded down to 53 bits; if bit 53 is equal to 1, then the full-precision binary number is rounded up to 53 bits. In either case, the correctly rounded result has 0 in its 53rd significant bit.</p>
<p>My testing shows that Visual C++ consistently rounds halfway cases incorrectly; specifically, it always rounds down, even when bit 53 of the full-precision binary number equals 1.</p>
<h3>Example 1 (Incorrect)</h3>
<p>The integer 9,214,843,084,008,499 converts to this binary number:</p>
<p>10000010111100110110011101100010101110111011000011001<span class="highlight_gray4">1</span></p>
<p>Bit 54 is 1, and with no other bits beyond, is a halfway case. Bit 53 is 1, so the correctly rounded value &#8212; in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a> &#8212; is</p>
<p>1.000001011110011011001110110001010111011101100001101 x 2<sup>53</sup></p>
<p>Visual C++ computes the converted value as</p>
<p>1.0000010111100110110011101100010101110111011000011001 x 2<sup>53</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<h3>Example 2 (Incorrect)</h3>
<p>The <a title="Wikipedia article on dyadic fraction" href="http://en.wikipedia.org/wiki/Dyadic_fraction">dyadic fraction</a></p>
<p>0.500000000000000166533453693773481063544750213623046875, </p>
<p>which equals 2<sup>-1</sup> + 2<sup>-53</sup> + 2<sup>-54</sup>, converts to this binary number:</p>
<p>0.100000000000000000000000000000000000000000000000000011</p>
<p>It has 54 significant bits, with bits 53 and 54 equal to 1. The correctly rounded value is</p>
<p>1.000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<p>Visual C++ computes the converted value as</p>
<p>1.0000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<h3>Discussion</h3>
<p>Examples 1 and 2 would be considered &ldquo;easy&rdquo; cases in correct rounding parlance. They require only one extra bit beyond double precision to make the correct rounding decision. These examples support my assertion that Visual C++ universally adopts truncation for halfway cases.</p>
<h2>Greater than Halfway Cases</h2>
<p>The <em>greater than halfway case</em> occurs when a decimal number converts to a full-precision binary number that is between two binary floating-point values, but closer to the one above. Visual C++ gets some of these wrong and some of these right.</p>
<h3>Example 3 (Incorrect)</h3>
<p>The integer 30,078,505,129,381,147,446,200 converts to this binary number:</p>
<p>11001011110100011110001110010011100011011000000100000<span class="highlight_gray4">1</span>000000000<span class="highlight_gray4">1</span>0111<br />
0111000</p>
<p>Bit 54 is 1 and there is a 1 bit beyond bit 54, so the correctly rounded value is</p>
<p>1.1001011110100011110001110010011100011011000000100001 x 2<sup>74</sup></p>
<p>Visual C++ computes the converted value as</p>
<p>1.10010111101000111100011100100111000110110000001 x 2<sup>74</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<h3>Example 4 (Correct)</h3>
<p>The integer 1,777,820,000,000,000,000,001 converts to this binary number:</p>
<p>11000000110000000110101011011101011010010111000111101<span class="highlight_gray4">1</span>00000000000000<br />
00<span class="highlight_gray4">1</span></p>
<p>Just like the prior example, bit 54 is 1 and there is a 1 bit beyond it &#8212; but Visual C++ converts this <em><strong>correctly</strong></em>, to </p>
<p>1.100000011000000011010101101110101101001011100011111 x 2<sup>70</sup></p>
<h3>Example 5 (Incorrect)</h3>
<p>The dyadic fraction</p>
<p>0.500000000000000166547006220929549868969843373633921146392822265625</p>
<p>which equals 2<sup>-1</sup> + 2<sup>-53</sup> + 2<sup>-54</sup> + 2<sup>-66</sup>, converts to this binary number:</p>
<p>0.10000000000000000000000000000000000000000000000000001<span class="highlight_gray4">1</span>00000000000<span class="highlight_gray4">1</span></p>
<p>It converts correctly to</p>
<p>1.000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<p>But Visual C++ converts it to this, one ULP below the correctly rounded value:</p>
<p>1.0000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<h3>Example 6 (Correct)</h3>
<p>The dyadic fraction</p>
<p>0.50000000000000016656055874808561867439493653364479541778564453125 </p>
<p>which equals 2<sup>-1</sup> + 2<sup>-53</sup> + 2<sup>-54</sup> + 2<sup>-65</sup>, converts to this binary number:</p>
<p>0.10000000000000000000000000000000000000000000000000001<span class="highlight_gray4">1</span>0000000000<span class="highlight_gray4">1</span></p>
<p>Visual C++ converts this <em><strong>correctly</strong></em>, to</p>
<p>1.000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<h3>Example 7 (Incorrect)</h3>
<p>The decimal fraction 0.3932922657273 is non-terminating in binary (it&#8217;s not dyadic); here is the relevant portion of it &#8212; its first 82 places:</p>
<p>0.011001001010111011001101010010110001000110001000111001<span class="highlight_gray4">1</span>00000000000<br />
000000000000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Significant bits 53 and 54 are both 1, as is significant bit 81; it rounds correctly to</p>
<p>1.100100101011101100110101001011000100011000100011101 x 2<sup>-2</sup></p>
<p>Visual C++ computes the value one ULP below:</p>
<p>1.1001001010111011001101010010110001000110001000111001 x 2<sup>-2</sup></p>
<h3>Discussion</h3>
<p>Visual C++ gets example 4 correct, even though it appears harder than example 3, which it converts incorrectly; example 4 has six more zeros beyond bit 54.</p>
<p>Examples 5 and 6 are dyadic fractions, which theoretically should be easy to convert. Example 5 is 66 digits long, and example 6 is 65 digits long. (Dyadic fractions have the same number of bits as decimal digits.)  Example 6 converts correctly, but example 5 does not. Is it because Visual C++ stops looking beyond 65 places?</p>
<p>Example 7 is a hard case, which Visual C++ does get wrong.</p>
<h2>My Bug Report</h2>
<p>In April 2009, I submitted a bug report to Microsoft, titled &ldquo;<a title="See Microsoft bug report &ldquo;Incorrectly Rounded Decimal to Binary Conversions in Visual C++&rdquo;" href="http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=434491">Incorrectly Rounded Decimal to Binary Conversions in Visual C++</a>.&rdquo; Microsoft closed the bug, but said they would &ldquo;re-consider [sic] for future versions of VC++&rdquo;. We&#8217;ll see.</p>
<p>You could argue that being only one ULP away from the correct result is inconsequential. Technically, I think it&#8217;s &lsquo;legal&rsquo; from the perspective of the IEEE floating-point specification. (This seems to be conventional wisdom, but I can&#8217;t find the exact wording in the spec &#8212; can someone enlighten me?) However, widely used open source code &#8212; <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function in dtoa.c</a> &#8212; gets these conversions right. Should Visual C++ do the same?</p>
<p> (I would imagine this &lsquo;problem&rsquo; also affects the rest of Visual Studio, but I haven&#8217;t investigated.)</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Incorrectly Rounded Conversions in Visual C++</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Decimal to Floating-Point Needs Arbitrary Precision</title>
		<link>http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/</link>
		<comments>http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/#comments</comments>
		<pubDate>Thu, 20 May 2010 18:42:28 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=282</guid>
		<description><![CDATA[In my article &#8220;Quick and Dirty Decimal to Floating-Point Conversion&#8221; I presented a small C program that converts a decimal string to a double-precision binary floating-point number. The number it produces, however, is not necessarily the closest &#8212; or so-called correctly rounded &#8212; double-precision binary floating-point number. This is not a failing of the algorithm; [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">Decimal to Floating-Point Needs Arbitrary Precision</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In my article &ldquo;<a title="Read Rick Regan's article &ldquo;Quick and Dirty Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/">Quick and Dirty Decimal to Floating-Point Conversion</a>&rdquo; I  presented a small C program that converts a decimal string to a double-precision binary floating-point number. The number it produces, however, is not necessarily the closest &#8212; or so-called <em>correctly rounded</em> &#8212; double-precision binary floating-point number. This is not a failing of the algorithm; mathematically speaking, the algorithm is correct. The flaw comes in its implementation in limited precision binary floating-point arithmetic. </p>
<p>The quick and dirty program is implemented in native C, so it&#8217;s limited to double-precision floating-point arithmetic (although on some systems, extended precision may be used). Higher precision arithmetic &#8212; in fact, arbitrary precision arithmetic &#8212; is needed to ensure that all decimal inputs are converted correctly. I will demonstrate the need for high precision by analyzing three examples, all taken from Vern Paxson&#8217;s paper &ldquo;<a title="Read Vern Paxson's paper &ldquo;A Program for Testing IEEE Decimal–Binary Conversion&rdquo;" href="http://www.icir.org/vern/papers/testbase-report.pdf">A Program for Testing IEEE Decimal–Binary Conversion</a>&rdquo;.</p>
<p><span id="more-282"></span></p>
<h2><a name="correct-rounding"></a>Correct Rounding</h2>
<p>A <a title="Read David M. Gay's paper &ldquo;Correctly Rounded Binary-Decimal and Decimal-Binary Conversions&rdquo;" href="http://www.ampl.com/REFS/rounding.pdf">correctly rounded conversion </a> from a decimal string to a double-precision floating-point number is the 53 significant bit value closest to the decimal input &#8212; and in the event of a tie, the value with significant bit number 53 equal to 0. (This is the prevailing definition, being based on the default IEEE 754 rounding mode. But to be truly correctly rounded &#8212; theoretically at least &#8212; <a title="Read Rick Regan's article &ldquo;Visual C++ and GLIBC strtod() Ignore Rounding Mode&rdquo;" href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">all four IEEE 754 rounding modes must be honored</a>.) The closest 53-bit number is obtained by rounding what would be the full-precision binary number to 53 bits. This can be done correctly only if the correct value of bit 54 &#8212; <em>and sometimes the correct values of an arbitrary number of bits beyond</em> &#8212; is known. Here are the rules:</p>
<ul>
<li>If bit 54 is 0, round down</li>
<li>If bit 54 is 1
<ul>
<li>If any bit beyond bit 54 is 1, round up</li>
<li>If all bits beyond bit 54 are 0 (meaning the number is halfway between two floating-point numbers)
<ul>
<li>If bit 53 is 0, round down</li>
<li>If bit 53 is 1, round up</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>The rules seem simple enough, but there&#8217;s a problem: the calculation needs to be precise enough so that the true value of bit 54 and the appropriate number of bits beyond is known. Decimal numbers that are close to halfway between two floating-point numbers require the most precision to get right; the three examples that follow are such numbers.</p>
<h2>Example 1: 7.8459735791271921 x 10<sup>65</sup></h2>
<p>7.8459735791271921 x 10<sup>65</sup> is shorthand for this number:</p>
<p>784597357912719210000000000000000000000000000000000000000000000000</p>
<p>In binary, it equals</p>
<p>11101110011010000000010001001110000010011000101001110<span class="highlight_gray4">0</span>11111111111111<br />
11111111111111111111111111111111111111111111111111<span class="highlight_gray4">0</span>00111100101001110<br />
11111110111110111000110111111101010000000000000000000000000000000000<br />
000000000000000</p>
<p>Rounded correctly to 53 bits it equals</p>
<p>11101110011010000000010001001110000010011000101001110000000000000000<br />
00000000000000000000000000000000000000000000000000000000000000000000<br />
00000000000000000000000000000000000000000000000000000000000000000000<br />
000000000000000</p>
<p>which in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a> equals</p>
<p>1.110111001101000000001000100111000001001100010100111 x 2<sup>218</sup></p>
<h3>Why This Conversion is Difficult in Floating-Point</h3>
<p>The full-precision binary value above was obtained using my <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">decimal to binary converter</a>. The converter uses <a title="Read Rick Regan's Article &ldquo;Base Conversion in PHP Using BCMath&rdquo;" href="http://www.exploringbinary.com/base-conversion-in-php-using-bcmath/">arbitrary precision decimal arithmetic</a>, so every bit in the result is correct. Since bit 54 (the first highlighted bit above) is 0, it is rounded down to get the correctly rounded 53-bit result.</p>
<p>When the conversion is done in floating-point, things get a bit murky. With insufficient precision, it&#8217;s impossible to tell whether bit 54 is really 0, since it&#8217;s a tiny fraction away from being 1. The conversion algorithm cannot make the distinction unless the calculation is known to be precise through the 119th bit (the second highlighted bit above).</p>
<h3>Converting with Varying Floating-Point Precision</h3>
<p>To demonstrate the effect of floating-point precision, I modified my <a title="Read Rick Regan's article &ldquo;Quick and Dirty Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/">quick and dirty conversion program</a> to use <a href="http://www.exploringbinary.com/how-to-install-and-run-gmp-on-windows-using-mpir/" title="Read Rick Regan's Article &ldquo;How to Install and Run GMP on Windows Using MPIR&rdquo;">GMP floating-point arithmetic</a>. I increased the precision of the GMP conversions incrementally until I reached the intermediate result with sufficient precision; for each increment, I printed the result in binary and rounded correctly by hand.</p>
<p>I converted 7.8459735791271921 x 10<sup>65</sup> to double precision using the quick and dirty program  (53-bit precision) and GMP with 64, 96, and 128 bits of precision (on my system, GMP changes precision internally at 32-bit granularity, plus a few extra bits); here are the results, expressed in binary scientific notation:</p>
<ul>
<li>Quick and Dirty (53-bit precision):
<p>1.110111001101000000001000100111000001001100010101 x 2<sup>218</sup>
</p>
</li>
<li>GMP, with 64 and 96 bit precision:
<p>1.110111001101000000001000100111000001001100010100111 x 2<sup>218</sup>
</p>
</li>
<li>GMP, with 128-bit precision:
<p>1.110111001101000000001000100111000001001100010100111 x 2<sup>218</sup>
</p>
</li>
</ul>
<p>The quick and dirty result is <em>incorrect</em>, and is off by two units in the last place (ULPs). All three GMP values are correct, but the 64 and 96 bit values are correct by chance; before rounding, the intermediate result (not shown) has bit 53 equal to 0 and bit 54 equal to 1, with all bits beyond equal to 0. That makes it look like a halfway case, so it rounds down to the correct value &#8212; by sheer luck. The 128-bit GMP calculation is precise enough that the 119th bit is known to be 0, meaning bit 54 is 0 and that the result should be rounded down.</p>
<h2>Example 2: 3.571 x 10<sup>266</sup></h2>
<p>3.571 x 10<sup>266</sup> is an 886 bit binary number, so I&#8217;ll only show the relevant portion &#8212; its first 77 bits:</p>
<p>10110001001100100010011000110000111010100000110101001<span class="highlight_gray4">1</span>00000000000000<br />
00000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Rounded correctly to 53 bits it equals</p>
<p>1.011000100110010001001100011000011101010000011010101 x 2<sup>885</sup></p>
<p>Here are the quick and dirty and GMP conversions:</p>
<ul>
<li>Quick and Dirty (53-bit precision):
<p>1.011000100110010001001100011000011101010000011011001 x 2<sup>885</sup>
</p>
</li>
<li>GMP, with 64-bit precision:
<p>1.0110001001100100010011000110000111010100000110101001 x 2<sup>885</sup>
</p>
</li>
<li>GMP, with 96-bit precision:
<p>1.011000100110010001001100011000011101010000011010101 x 2<sup>885</sup>
</p>
</li>
</ul>
<p>The quick and dirty and 64-bit GMP results are both <em>incorrect</em>; the quick and dirty result is off by eight ULPs, and the 64-bit GMP result is off by one ULP. The 64-bit conversion goes wrong because GMP computes this intermediate value (it&#8217;s 66 bits, since GMP can carry extra bits beyond the expected precision):</p>
<p>10110001001100100010011000110000111010100000110101001<span class="highlight_gray4">0</span>11111111111<span class="highlight_gray4">0</span></p>
<p>At 64-bit precision, bit 54 appears to be 0, and hence the result is rounded down.</p>
<p>As expected, the 96-bit GMP result is correct, since only 77 bits of precision are needed to decide the fate of bit 54.</p>
<h2>Example 3: 3.08984926168550152811 x 10<sup>-32</sup></h2>
<p>3.08984926168550152811 x 10<sup>-32</sup> is a decimal fraction which, when written in binary, is non-terminating; for our purposes, we only need to see its first 234 places (104 leading 0s, and significant bits 1-53 and 54-130):</p>
<p>0.000000000000000000000000000000000000000000000000000000000000000000<br />
00000000000000000000000000000000000000101000000110111100100100001100<br />
11101100110010100111010<span class="highlight_gray4">1</span>00000000000000000000000000000000000000000000<br />
0000000000000000000000000000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Rounded correctly to 53 bits it equals</p>
<p>1.0100000011011110010010000110011101100110010100111011 x 2<sup>-105</sup></p>
<p>Here are the quick and dirty and GMP conversions:</p>
<ul>
<li>Quick and Dirty (53-bit precision):
<p>1.010000001101111001001000011001110110011001010011101 x 2<sup>-105</sup>
</p>
</li>
<li>GMP, with 64 and 96 bit precision:
<p>1.010000001101111001001000011001110110011001010011101 x 2<sup>-105</sup>
</p>
</li>
<li>GMP, with 128-bit precision:
<p>1.0100000011011110010010000110011101100110010100111011 x 2<sup>-105</sup>
</p>
</li>
<li>GMP, with 160-bit precision:
<p>1.0100000011011110010010000110011101100110010100111011 x 2<sup>-105</sup>
</p>
</li>
</ul>
<p>The quick and dirty, 64-bit GMP, and 96-bit GMP results are all <em>incorrect</em>; they are all the same, and are off by 1 ULP. (The Visual C++ compiler gets this conversion wrong too, getting the same wrong answer!)</p>
<p>The 128-bit GMP result is correct by accident. The intermediate value used to compute it (not shown) has a 0 at bit 53, a 1 at bit 54, and then a string of 74 0s followed by a 1 (yes, 129 bits). The full-precision result has <em>75</em> 0s followed by a 1. The difference does not matter though, since both bit patterns indicate that the result should be rounded up.</p>
<p>The 160-bit GMP result is correct, which is expected since 130 bits of the full-precision binary result are needed to guarantee the right rounding decision is made.</p>
<h2>How Much Precision is Enough?</h2>
<p>In the examples above, I knew how much precision was needed for correct rounding because I already knew the full-precision binary equivalents of the decimal inputs &#8212; by external means. But how would a floating-point conversion routine know how much precision is enough? Naively, it would have to assume the worst case, computing with enough precision to match the input (for integer values, this would mean computing the full-precision binary equivalent). That would be overkill, since most of the time much less precision is needed to make a correct rounding decision.</p>
<p><a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function in dtoa.c</a> &#8212; the de facto standard conversion routine &#8212; avoids this problem by avoiding high precision floating-point. It uses standard double-precision floating-point for most conversions, and arbitrary precision integer arithmetic for &ldquo;tough&rdquo; conversions; it has the smarts to know when to use one or the other. (I will discuss that code in a future article.)</p>
<h2>Acknowledgement</h2>
<p>Thanks to <a title="Vern Paxson's page at Berkeley" href="http://www.eecs.berkeley.edu/Faculty/Homepages/paxson.html">Vern Paxson</a> for helping me understand the &ldquo;very close to but less than 1/2 ULP case.&rdquo;</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">Decimal to Floating-Point Needs Arbitrary Precision</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Quick and Dirty Decimal to Floating-Point Conversion</title>
		<link>http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/</link>
		<comments>http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/#comments</comments>
		<pubDate>Fri, 30 Apr 2010 15:21:20 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=281</guid>
		<description><![CDATA[This little C program converts a decimal value &#8212; represented as a string &#8212; into a double-precision floating-point number: #include &#60;string.h&#62; int main (void) { double intPart = 0, fracPart = 0, conversion; unsigned int i; char decimal[] = &#34;3.14159&#34;; i = 0; /* Left to right */ while (decimal[i] != '.') { intPart = [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/">Quick and Dirty Decimal to Floating-Point Conversion</a></p>
]]></description>
			<content:encoded><![CDATA[<p>This little C program converts a decimal value &#8212; represented as a string &#8212; into a double-precision floating-point number: </p>
<pre>
#include &lt;string.h&gt;

int main (void)
{
 double intPart = 0, fracPart = 0, conversion;
 unsigned int i;
 char decimal[] = &quot;3.14159&quot;;

 i = 0; /* Left to right */
 while (decimal[i] != '.') {
    <strong>intPart = intPart*10 + (decimal[i] - '0');</strong>
    i++;
   }

 i = strlen(decimal)-1; /* Right to left */
 while (decimal[i] != '.') {
    <strong>fracPart = (fracPart + (decimal[i] - '0'))/10;</strong>
    i--;
   }

 conversion = intPart + fracPart;
}
</pre>
<p>The conversion is done using the elegant Horner&#8217;s method, summing each digit according to its decimal place value. So why do I call it &ldquo;quick and <em>dirty</em>?&rdquo; Because the binary floating-point value it produces is not necessarily the closest approximation to the input decimal value &#8212; the so-called <em>correctly rounded</em> result. (Remember that most real numbers cannot be represented exactly in floating-point.) Most of the time it <em>will</em> produce the correctly rounded result, but sometimes it won&#8217;t &#8212; the result will be off in its least significant bit(s). There&#8217;s just not enough precision in floating-point to guarantee the result is correct every time.</p>
<p>I will demonstrate this algorithm with different input values, some of which convert correctly, and some of which don&#8217;t. In the end, you&#8217;ll appreciate one reason why library functions like strtod() exist &#8212; to perform efficient, correctly rounded conversion.</p>
<p><span id="more-281"></span></p>
<h2>Testing the Conversions</h2>
<p>How do you test whether a quick and dirty conversion is correct? First you need to know what the correctly rounded value is, and then you need to compare it to the quick and dirty value.</p>
<p>Here are two ways to find the correctly rounded value:</p>
<ul>
<li>Assign the decimal constant to a double and let your compiler compute it for you.</li>
<li>Enter the decimal constant in my <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">arbitrary-precision decimal/binary converter</a> and then round it by hand to 53 significant bits (the number of significant bits in a double).</li>
</ul>
<p>I did both, because I know that sometimes even compilers do the conversions incorrectly (see my articles on <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Visual C++</a> and <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in GCC and GLIBC&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/">gcc/glibc</a>). In any case, I verified that all the compiler conversions in the examples below are correct.</p>
<p>Next we need to compare the quick and dirty value to the correctly rounded value &#8212; but how? <a href="http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/" title="Read Rick Regan's Article &ldquo;When Doubles Don’t Behave Like Doubles&rdquo;">Comparing floating-point values for equality is not reliable</a>. That means we should display the values and compare them manually. But we can&#8217;t depend on printf(), because in many implementations it won&#8217;t print all the significant digits, either <a title="Read Rick Regan's article &ldquo;Print Precision of Floating-Point Integers Varies Too&rdquo;" href="http://www.exploringbinary.com/print-precision-of-floating-point-integers-varies-too/">before the decimal point</a> or <a title="Read Rick Regan's article &ldquo;Print Precision of Dyadic Fractions Varies by Language&rdquo;" href="http://www.exploringbinary.com/print-precision-of-dyadic-fractions-varies-by-language/">after the decimal point</a> (this is true of Visual C++, the compiler I am using). This leaves two options:</p>
<ul>
<li>Print the <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">raw IEEE representation</a> of the double.</li>
<li>Print the value of the double in binary, using my function <a href="http://www.exploringbinary.com/converting-floating-point-numbers-to-binary-strings-in-c/" title="Read Rick Regan's Article &ldquo;Converting Floating-Point Numbers to Binary Strings in C&rdquo;">fp2bin()</a>.</li>
</ul>
<p>I chose to use fp2bin(), because seeing the binary representation is more enlightening.</p>
<h3>The Test Code</h3>
<p>I modified the code in the introduction of this article to allow for testing, as described; here it is</p>
<pre>
#include &lt;string.h&gt;
#include &lt;stdio.h&gt;
#include &quot;fp2bin.h&quot;

int main (void)
{
 char binString[FP2BIN_STRING_MAX];
 double intPart = 0, fracPart = 0, conversion;
 unsigned int i;
 char decimal[] = &quot;<strong>3.14159</strong>&quot;; /* Converted manually */
 double compiler_conversion = <strong>3.14159</strong>; /* Converted by compiler */

 i = 0; /* Left to right */
 while (decimal[i] != '.')
   {
    intPart = intPart*10 + (decimal[i] - '0');
    i++;
   }

 i = strlen(decimal)-1; /* Right to left */
 while (decimal[i] != '.')
   {
    fracPart = (fracPart + (decimal[i] - '0'))/10;
    i--;
   }

 conversion = intPart + fracPart;

 /* Print the converted values, in binary */
 fp2bin(compiler_conversion,binString);
 printf(&quot;Correct = %s\n&quot;,binString);
 fp2bin(conversion,binString);
 printf(&quot;Q and D = %s\n&quot;,binString);
}
</pre>
<p>When run (I ran on an Intel Core Duo processor), the output shows that the compiler and quick and dirty conversions of <strong>3.14159</strong> are both correct:</p>
<pre>
Correct = 11.00100100001111110011111000000011011100001100110111
Q and D = 11.00100100001111110011111000000011011100001100110111
</pre>
<p>Their value in decimal, obtained from my <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">binary/decimal converter</a>, is<br />
3.14158999999999988261834005243144929409027099609375.</p>
<h2>Incorrect Conversions</h2>
<p>Incorrect conversions are due to the limitations of floating-point arithmetic; specifically, these three things affect the conversion calculations:</p>
<ul>
<li><strong>Large integers require more bits than can fit in a floating-point variable</strong>.
<p>Multiplying a binary integer by ten is exact, but once the integer exceeds the largest integer that a double can represent exactly, rounding errors can occur.</p>
</li>
<li><strong>Division by ten in binary arithmetic is inexact</strong>.
<p>Each time the fractional value is divided by ten, there is a chance of roundoff error.</p>
</li>
<li><strong>Addition of integers and fractions can incur rounding error</strong>.
<p>When an integer and fraction are added, some of the bits of the fraction can be lost.</p>
</li>
</ul>
<p>The example values below &#8212; which I discovered by both random testing and trial and error &#8212; are all rounded incorrectly by my quick and dirty algorithm.</p>
<h3>The Integer 18014398509481993</h3>
<p>The number 18,014,398,509,481,993 (2<sup>54</sup> + 9) is not converted correctly by the quick and dirty algorithm; this is the output of the program (the lines I changed are <em>char decimal[] = &#8220;18014398509481993.0&#8243;;</em> and <em>double compiler_conversion = 18014398509481993.0;</em>): </p>
<pre>
Correct = 1000000000000000000000000000000000000000000000000001000.0
Q and D = 1000000000000000000000000000000000000000000000000001<span class="highlight_gray4">100</span>.0
</pre>
<p>Their values, in decimal, are</p>
<ul>
<li>Correct = 18014398509481992 = 2<sup>54</sup> + <span class="highlight_gray4">8</span></li>
<li>Q and D = 18014398509481996 = 2<sup>54</sup> + <span class="highlight_gray4">12</span></li>
</ul>
<p>These values differ by 4, which is one <a title="Wikipedia article &ldquo;Unit in the last place&rdquo;" href="http://en.wikipedia.org/wiki/Unit_in_the_last_place">unit in the last place</a> (the last place for a double is the 53rd significant bit).</p>
<h3>The Fraction 0.9199</h3>
<p>The number 0.9199 is not converted correctly by the quick and dirty algorithm; this is the output of the program (the lines I changed are <em>char decimal[] = &#8220;0.9199&#8243;;</em> and  <em>double compiler_conversion = 0.9199;</em>): </p>
<pre>
Correct = 0.11101011011111101001000011111111100101110010010001111
Q and D = 0.1110101101111110100100001111111110010111001001000111<span class="highlight_gray4"> </span>
</pre>
<p>Their values, in decimal, are</p>
<ul>
<li>Correct = 0.91990000000000005098144129078718833625316619873046875
<p> = 828572259443623<span class="highlight_gray4">9</span>/2<sup>53</sup></p>
</li>
<li>Q and D = 0.9198999999999999399591388282715342938899993896484375
<p> = 4142861297218119/2<sup>52</sup> = 828572259443623<span class="highlight_gray4">8</span>/2<sup>53</sup></p>
</li>
</ul>
<p>These values differ by 2<sup>-53</sup>, which is one unit in the last place.</p>
<h3>The Combined Integer and Fraction 1.89</h3>
<p>The number 1.89 is not converted correctly by the quick and dirty algorithm; this is the output of the program (the lines I changed are <em>char decimal[] = &#8220;1.89&#8243;;</em> and <em>double compiler_conversion = 1.89;</em>): </p>
<pre>
Correct = 1.1110001111010111000010100011110101110000101000111101
Q and D = 1.11100011110101110000101000111101011100001010001111<span class="highlight_gray4">1 </span>
</pre>
<p>Their values, in decimal, are</p>
<ul>
<li>Correct = 1.8899999999999999023003738329862244427204132080078125
<p> = 1 + 400820366835974<span class="highlight_gray4">1</span>/2<sup>52</sup></p>
</li>
<li>Q and D = 1.890000000000000124344978758017532527446746826171875
<p> = 1 + 2004101834179871/2<sup>51</sup> = 1 + 400820366835974<span class="highlight_gray4">2</span>/2<sup>52</sup></p>
</li>
</ul>
<p>These values differ by 2<sup>-52</sup>, which is <em>two</em> units in the last place.</p>
<p>The interesting thing is that 0.89 by itself converts correctly; it&#8217;s only when it&#8217;s added to 1 that the final result becomes incorrect.</p>
<h2>Discussion</h2>
<p>In practice, you&#8217;ll depend on your compiler or interpreter or a library call for correctly rounded conversion. The de facto standard code used by many implementations is <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function in dtoa.c</a>. Take a look at it to get a sense of how involved  correctly rounded conversion is. (dtoa.c handles scientific notation, and also includes code for correct conversion in the other direction &#8212; floating-point to decimal).</p>
<p>For a little more detailed discussion of Horner&#8217;s method, see my PHP functions <a title="Read Rick Regan's Article &ldquo;Base Conversion in PHP Using BCMath&rdquo;" href="http://www.exploringbinary.com/base-conversion-in-php-using-bcmath/">bin2dec_i() and bin2dec_f()</a>. Those two functions implement the same quick and dirty algorithm, only using decimal arithmetic and powers of two (there&#8217;s nothing &ldquo;dirty&rdquo; about them though &#8212; they&#8217;re binary to decimal and arbitrary precision).</p>
<h3>Conversion to Single-Precision</h3>
<p>Conversion to floats works similarly, also producing correctly and incorrectly rounded results. There&#8217;s one thing I observed, which you might find surprising: a given decimal value may convert incorrectly to a double but <em>correctly</em> to a float!</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/">Quick and Dirty Decimal to Floating-Point Conversion</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>When Doubles Don’t Behave Like Doubles</title>
		<link>http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/</link>
		<comments>http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/#comments</comments>
		<pubDate>Fri, 09 Apr 2010 17:15:35 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=276</guid>
		<description><![CDATA[In my article &#8220;When Floats Don’t Behave Like Floats&#8221; I explained how calculations involving single-precision floating-point variables may be done, under the covers, in double or extended precision. This leads to anomalies in expected results, which I demonstrated with two C programs &#8212; compiled with Microsoft Visual C++ and run on a 32-bit Intel Core [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/">When Doubles Don’t Behave Like Doubles</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In my article &ldquo;<a href="http://www.exploringbinary.com/when-floats-dont-behave-like-floats/" title="Read Rick Regan's Article &ldquo;When Floats Don’t Behave Like Floats&rdquo;">When Floats Don’t Behave Like Floats</a>&rdquo; I explained how calculations involving single-precision floating-point variables may be done, under the covers, in double or extended precision. This leads to anomalies in expected results, which I demonstrated with two C programs &#8212; compiled with Microsoft Visual C++ and run on a 32-bit Intel Core Duo processor.</p>
<p>In this article, I&#8217;ll do a similar analysis for <em>double</em>-precision floating-point variables, showing how similar anomalies arise when extended precision calculations are done. I modified my two example programs to use doubles instead of floats. Interestingly, the doubles version of program 2 does not exhibit the anomaly. I&#8217;ll explain.</p>
<p><span id="more-276"></span></p>
<h2>The Environment in Which the Anomalies Occur</h2>
<p>To see the described anomalies, floating-point calculations must be done in precision greater than double-precision. For Intel processors, this means that x87 FPU instructions must be used (as opposed to <a title="Wikipedia article defining SSE" href="http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions">SSE instructions</a>), and the processor must be calculating in <strong>extended-precision</strong> mode. </p>
<p>Your compiler determines which instructions to generate, although you may influence it with compiler options. On my system, Visual C++ generates x87 FPU instructions by default. As for the precision mode, it has a default &#8212; which is double-precision on my system &#8212; that can be changed by a call to the <em>controlfp</em> function. I call this function in the programs below.</p>
<h2>Program 1</h2>
<pre>
#include &quot;stdio.h&quot;
#include &quot;float.h&quot;
int main (void)
{
 double d1 = 0.3333333333333333, d2 = 3.0, d3;
 unsigned int cur_cw;

 _controlfp_s(&#038;cur_cw,_PC_64,MCW_PC); 

 d3 = d1 * d2;
 if (d3 != d1 * d2)
   printf(&quot;Not equal\n&quot;);
}
</pre>
<p>Program 1 prints &ldquo;<strong>Not equal</strong>&rdquo;, for the same reason its float counterpart does: greater precision is being used under the covers.</p>
<h3>Computing d3</h3>
<p>Using my function <a href="http://www.exploringbinary.com/converting-floating-point-numbers-to-binary-strings-in-c/" title="Read Rick Regan's Article &ldquo;Converting Floating-Point Numbers to Binary Strings in C&rdquo;">fp2bin</a>, I printed the binary values of the three floating point variables in the computation:</p>
<ul>
<li>d1 = 0.010101010101010101010101010101010101010101010101010101</li>
<li>d2 = 11</li>
<li>d3 = 1</li>
</ul>
<p>The value of d3 is only an approximation to d1 * d2. Here&#8217;s the true value of d1 * d2, computed by hand:</p>
<pre>
  0.010101010101010101010101010101010101010101010101010101
  x                                                     11
  --------------------------------------------------------
     10101010101010101010101010101010101010101010101010101
    10101010101010101010101010101010101010101010101010101
  --------------------------------------------------------
  0.11111111111111111111111111111111111111111111111111111<strong>1</strong>
</pre>
<p>The product is 54 bits, which can be represented exactly in extended precision, but only approximately in double-precision. Since bit 54 is a 1, the value is rounded up, to 53 bits. The result is d3 = 1.</p>
<p>Here&#8217;s the relevant assembler code generated by the compiler:</p>
<pre>
 <strong>d3 = d1 * d2;</strong>
0041131F  fld         qword ptr [d1]
00411322  fmul        qword ptr [d2]
00411325  fstp        qword ptr [d3]
</pre>
<h3>Comparing d3 and d1 * d2</h3>
<p>The assembler code shows what is happening:</p>
<pre>
 <strong>if (d3 != d1 * d2)</strong>
00411328  fld         qword ptr [d1]
0041132B  fmul        qword ptr [d2]
0041132E  fcomp       qword ptr [d3]
</pre>
<p>d1 * d2 is computed in extended precision and left on the stack. It is then compared to d3, which still has 53 bits of precision even though it’s “promoted” to extended precision. The two compared values will differ.</p>
<h2>Program 2</h2>
<pre>
#include &quot;stdio.h&quot;
#include &quot;float.h&quot;
int main (void)
{
 double d1 = 0.3333333333333333, d2 = 3.0, d3;
 long long i1, i2;
 unsigned int cur_cw;

 _controlfp_s(&#038;cur_cw,_PC_64,MCW_PC); 

 d3 = d1 * d2;
 i1 = (int)d3;
 i2 = (int)(d1 * d2);
 if (i1 != i2)
   printf(&quot;Not equal\n&quot;);
}
</pre>
<p>Program 2 does not print anything, which is surprising knowing that its float counterpart does. According to our analysis of program 1 &#8212; program 2 uses the same values for d1 and d2 &#8212; d1 * d2 rounds to 1 in double precision, but is slightly less than 1 in extended precision. Therefore, we would expect i1 = 1 and i2 = 0. Instead, both are 1.</p>
<p>The explanation is simple: the extended precision calculation of d1 * d2 is &ldquo;spilled&rdquo; into a double before the conversion to integer, so it will be equal to d3:</p>
<h3>Computing i2</h3>
<pre>
 <strong>i2 = (int)(d1 * d2);</strong>
004114A5  fld         qword ptr [d1]
004114A8  fmul        qword ptr [d2]
004114AB  call        @ILT+245(__ftol2_sse) (4110FAh)
           ||
           ||
           \/
--- f:\dd\vctools\crt_bld\SELF_X86\crt\prebuild\tran\i386\ftol2.asm
00411B40  cmp         dword ptr [___sse2_available (417598h)],0
00411B47  je          _ftol2 (411B76h)
00411B49  push        ebp
00411B4A  mov         ebp,esp
00411B4C  sub         esp,8
00411B4F  and         esp,0FFFFFFF8h
00411B52  <span class="highlight_gray4">fstp        qword ptr [esp]</span>
00411B55  cvttsd2si   eax,mmword ptr [esp]
00411B5A  leave
00411B5B  ret
</pre>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/">When Doubles Don’t Behave Like Doubles</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>When Floats Don&#8217;t Behave Like Floats</title>
		<link>http://www.exploringbinary.com/when-floats-dont-behave-like-floats/</link>
		<comments>http://www.exploringbinary.com/when-floats-dont-behave-like-floats/#comments</comments>
		<pubDate>Tue, 30 Mar 2010 15:07:21 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=275</guid>
		<description><![CDATA[These two programs &#8212; compiled with Microsoft Visual C++ and run on a 32-bit Intel Core Duo processor &#8212; demonstrate an anomaly that occurs when using single-precision floating point variables: Program 1 #include &#34;stdio.h&#34; int main (void) { float f1 = 0.1f, f2 = 3.0f, f3; f3 = f1 * f2; if (f3 != f1 [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/when-floats-dont-behave-like-floats/">When Floats Don&#8217;t Behave Like Floats</a></p>
]]></description>
			<content:encoded><![CDATA[<p>These two programs &#8212; compiled with Microsoft Visual C++ and run on a 32-bit Intel  Core Duo processor &#8212; demonstrate an anomaly that occurs when using single-precision floating point variables:</p>
<p><strong>Program 1</strong></p>
<pre>
#include &quot;stdio.h&quot;
int main (void)
{
 float f1 = 0.1f, f2 = 3.0f, f3;

 f3 = f1 * f2;
 if (f3 != f1 * f2)
   printf(&quot;Not equal\n&quot;);
}
</pre>
<p>Prints &ldquo;<strong>Not equal</strong>&rdquo;.</p>
<p><strong>Program 2</strong></p>
<pre>
#include &quot;stdio.h&quot;
int main (void)
{
 float f1 = 0.7f, f2 = 10.0f, f3;
 int i1, i2;

 f3 = f1 * f2;
 i1 = (int)f3;
 i2 = (int)(f1 * f2);
 if (i1 != i2)
   printf(&quot;Not equal\n&quot;);
}
</pre>
<p>Prints &ldquo;<strong>Not equal</strong>&rdquo;.</p>
<p>In each case, f3 and f1 * f2 differ. But why? I&#8217;ll explain what&#8217;s going on.</p>
<p><span id="more-275"></span>(This article was inspired by <a title="Stack Overflow Question &ldquo;Why is my number being rounded incorrectly?&rdquo;" href="http://stackoverflow.com/questions/2509576/why-is-my-number-being-rounded-incorrectly">this question</a> and several related questions on stackoverflow.com.)</p>
<h2>Analyzing Program 1</h2>
<p>Program 1 compares two floating point values for equality &#8212; something conventional wisdom says not to do. But why should that rule apply in this case? We&#8217;re just checking that a variable holds the value we just gave it. How could it not? </p>
<p>The root of the &ldquo;problem&rdquo; is this: the compiler generates instructions that do floating-point calculations in <a title="Wikipedia article defining extended precision" href="http://en.wikipedia.org/wiki/Extended_precision">extended precision</a>. Whereas floats are 4 bytes long and have 24 significant bits of precision, extended precision values &#8212; which are stored in registers on the floating point stack &#8212; are 10 bytes long and have 64 bits of precision. </p>
<h3>Computing f3</h3>
<p>Using my function <a href="http://www.exploringbinary.com/converting-floating-point-numbers-to-binary-strings-in-c/" title="Read Rick Regan's Article &ldquo;Converting Floating-Point Numbers to Binary Strings in C&rdquo;">fp2bin</a>, I printed the binary values of the three floating point variables in the computation:</p>
<ul>
<li>f1 = 0.000110011001100110011001101</li>
<li>f2 = 11</li>
<li>f3 = 0.010011001100110011001101</li>
</ul>
<p>The value of f3 is only an approximation to f1 * f2. To understand why, let&#8217;s calculate the true value of f1 * f2; that is, let&#8217;s calculate it by hand, using binary multiplication:</p>
<pre>
  0.000110011001100110011001101
  x                          11
  -----------------------------
       110011001100110011001101
      110011001100110011001101
-------------------------------
  0.010011001100110011001100111
</pre>
<p>So f1 * f2 = 0.010011001100110011001100111. It fits comfortably within extended precision. But to assign it to f3 &#8212; a float &#8212; it must be rounded. f1 * f2 has 26 significant bits, but a float holds only 24. Rounding it to the nearest 24 bit value makes it 0.010011001100110011001101.</p>
<p>The assembler code generated by the compiler confirms what we&#8217;re seeing:</p>
<pre>
 <strong>f3 = f1 * f2;</strong>
0041130B  fld         dword ptr [f1]
0041130E  fmul        dword ptr [f2]
00411311  <span class="highlight_gray4">fstp        dword ptr [f3]</span>
</pre>
<p>f1 * f2 is computed in extended precision &#8212; enough bits to hold its true value &#8212; but that extra precision is lost when stored in f3.</p>
<h3>Comparing f3 and f1 * f2</h3>
<p>The answer is in the assembler code, so let&#8217;s get right to it:</p>
<pre>
 <strong>if (f3 != f1 * f2)</strong>
00411314  fld         dword ptr [f1]
00411317  fmul        dword ptr [f2]
0041131A  fld         dword ptr [f3]
0041131D  <span class="highlight_gray4">fucompp</span>
</pre>
<p>Again, f1 * f2 is computed in extended precision, but this time <strong>its true value is retained</strong> &#8212; it is left on the stack. f3 is then loaded onto the stack (of course it still has only 24 bits of precision, even though it&#8217;s been &ldquo;promoted&rdquo; to extended precision). The two values on top of the stack are then compared. Clearly, they differ.</p>
<h2>Analyzing Program 2</h2>
<p>Program 2 &ldquo;fails&rdquo; for the same reason as program 1, except that the &ldquo;error&rdquo; is magnified by the conversion of f3 and f1 * f2 to integers: the integer part of f3 is <strong>7</strong>, and  the integer part of f1 * f2 is <strong>6</strong>.</p>
<h3>Computing i1</h3>
<p>The binary values of the three floating point variables in the computation are:</p>
<ul>
<li>f1 = 0.101100110011001100110011</li>
<li>f2 = 1010</li>
<li>f3 = 111</li>
</ul>
<p>The value of f3 is an integer. To understand why, let&#8217;s calculate the true value of f1 * f2:</p>
<pre>
  0.101100110011001100110011
  x                     1010
  --------------------------
                           0
   101100110011001100110011
                         0
 101100110011001100110011
----------------------------
110.11111111111111111111111
</pre>
<p>So f1 * f2 = 110.11111111111111111111111. To assign it to f3 it must be rounded. f1 * f2 has 26 significant bits, and rounding it to the nearest 24 bit value makes it 111, or 7 decimal. Clearly, this means i1 will be 7 as well.</p>
<p>Here&#8217;s the assembler code:</p>
<pre>
 <strong>f3 = f1 * f2;</strong>
0041130B  fld         dword ptr [f1]
0041130E  fmul        dword ptr [f2]
00411311  fstp        dword ptr [f3]
 <strong>i1 = (int)f3;</strong>
00411314  fld         dword ptr [f3]
00411317  call        @ILT+155(__ftol2_sse) (4110A0h)
           ||
           ||
           \/
--- f:\dd\vctools\crt_bld\SELF_X86\crt\prebuild\tran\i386\ftol2.asm
00411600  cmp         dword ptr [___sse2_available (416554h)],0
00411607  je          _ftol2 (411636h)
00411609  push        ebp
0041160A  mov         ebp,esp
0041160C  sub         esp,8
0041160F  and         esp,0FFFFFFF8h
00411612  fstp        qword ptr [esp]
00411615  cvttsd2si   eax,mmword ptr [esp]
0041161A  leave
0041161B  ret
</pre>
<h3>Computing i2</h3>
<p>Here&#8217;s the assembler code for computing i2:</p>
<pre>
 <strong>i2 = (int)(f1 * f2);</strong>
0041131F  fld         dword ptr [f1]
00411322  fmul        dword ptr [f2]
00411325  call        @ILT+155(__ftol2_sse) (4110A0h)
           ||
           ||
           \/
--- f:\dd\vctools\crt_bld\SELF_X86\crt\prebuild\tran\i386\ftol2.asm
00411600  cmp         dword ptr [___sse2_available (416554h)],0
00411607  je          _ftol2 (411636h)
00411609  push        ebp
0041160A  mov         ebp,esp
0041160C  sub         esp,8
0041160F  and         esp,0FFFFFFF8h
00411612  <span class="highlight_gray4">fstp        qword ptr [esp]</span>
00411615  <span class="highlight_gray4">cvttsd2si   eax,mmword ptr [esp]</span>
0041161A  leave
0041161B  ret
</pre>
<p>The interesting thing in the calculation of i2 is that f1 * f2 is stored in a <em>double</em> before being cast to an integer (see the highlighted fstp instruction). f1 * f2 in double precision is 110.11111111111111111111111; that is, its true 26 bit value. Casting this to an integer results in 110, or decimal 6. If the compiler had stored f1 * f2 in a float and used the cvtts<strong>s</strong>2si instruction instead, the answer would have been 7 &#8212; just like in the first calculation.</p>
<h3>Discussion</h3>
<p>In program 2, it would appear that single-precision is more accurate than extended precision. After all, forcing the value into a float gives the expected answer. This is just a happy coincidence. Two losses of precision &#8212; the conversion of 0.7 to floating-point and the rounding up of the product &#8212; have effectively canceled each other out.</p>
<h2>Behavior Depends on the Processor and How It Is Used</h2>
<p>This anomaly may not occur on your machine. It depends on your processor, and in particular, the instructions used and the mode that it&#8217;s in. My example programs were compiled into Intel x87 FPU instructions, and ran with the x87 precision control field set to 53-bits. (I said above that my programs were using extended precision; technically they weren&#8217;t, but double precision is sufficient to make them &ldquo;fail&rdquo;. The true values of f1 * f2 are less than 53 bits.)</p>
<p>Intel processors can also do floating point using SSE (Streaming SIMD Extensions) instructions. I recompiled my programs using the Visual C++ compiler options /arch:SSE (single-precision floating-point) and /arch:SSE2  (double-precision floating-point). For the SSE option, there was no change; the compiler, at its discretion, still decided to generate x87 instructions. (See the <a title="A comment on this article" href="http://www.exploringbinary.com/when-floats-dont-behave-like-floats/#comment-4316">comment below</a> about the Mac&#8217;s use of SSE instructions making the anomaly disappear.)</p>
<p>Recompiling with /arch:SSE2 I got different assembler code, but the same output; here&#8217;s the assembler code for program 1:</p>
<pre>
<strong>f3 = f1 * f2;</strong>
00411313  cvtss2sd    xmm0,dword ptr [f1]
00411318  cvtss2sd    xmm1,dword ptr [f2]
0041131D  mulsd       xmm0,xmm1
00411321  cvtsd2ss    xmm0,xmm0
00411325  movss       dword ptr [f3],xmm0
<strong>if (f3 != f1 * f2)</strong>
0041132A  cvtss2sd    xmm0,dword ptr [f1]
0041132F  cvtss2sd    xmm1,dword ptr [f2]
00411334  mulsd       xmm0,xmm1
00411338  cvtss2sd    xmm1,dword ptr [f3]
0041133D  ucomisd     xmm1,xmm0
</pre>
<p>The SSE double-precision instructions are used. f3 &#8212; single precision &#8212; is compared to the double-precision intermediate result f1 * f2, resulting in a mismatch.</p>
<h3>Please Try it Out</h3>
<p>If you have access to a different compiler or processor, please try these programs out. Let me know what you find!</p>
<h2>The Message</h2>
<p>You can&#8217;t count on floats being handled as single precision values; they can be processed in double or extended precision, as dictated by your compiler and CPU.</p>
<p class="break">(I have written a companion article called &ldquo;<a href="http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/" title="Read Rick Regan's Article &ldquo;When Doubles Don’t Behave Like Doubles&rdquo;">When Doubles Don’t Behave Like Doubles</a>&rdquo;.)</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/when-floats-dont-behave-like-floats/">When Floats Don&#8217;t Behave Like Floats</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/when-floats-dont-behave-like-floats/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Floating-Point Error in the NPR Media Player</title>
		<link>http://www.exploringbinary.com/floating-point-error-in-the-npr-media-player/</link>
		<comments>http://www.exploringbinary.com/floating-point-error-in-the-npr-media-player/#comments</comments>
		<pubDate>Sat, 31 Oct 2009 14:11:58 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=254</guid>
		<description><![CDATA[The NPR Media Player apparently uses floating-point numbers to represent timestamps, based on this image (click it to enlarge): The counter displays NaN:NaN:NaN. It should show the current time count, like 01:12 (why it shows three NaNs instead of two I don&#8217;t know). NaN is the floating-point representation of &#8220;not a number,&#8221; which means something [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/floating-point-error-in-the-npr-media-player/">Floating-Point Error in the NPR Media Player</a></p>
]]></description>
			<content:encoded><![CDATA[<p>The NPR Media Player apparently uses floating-point numbers to represent timestamps, based on this image (click it to enlarge):</p>
<div class="wp-caption aligncenter" style="width: 460px"><a href="http://www.exploringbinary.com/wp-content/uploads/NPR-NaNs.jpg" title="Click to see full-size image of NaNs in NPR Media Player"><img src="http://www.exploringbinary.com/wp-content/uploads/NPR-NaNs-thumbnail.jpg" alt="NaNs in NPR Media Player (thumbnail)" width="450" height="391"/></a><p class="wp-caption-text">NaNs in NPR Media Player (click image to enlarge).</p></div>
<p><span id="more-254"></span>The counter displays <strong>NaN:NaN:NaN</strong>. It should show the current time count, like 01:12  (why it shows three NaNs instead of two I don&#8217;t know). <a title="Wikipedia article on Not-A-Number" href="http://en.wikipedia.org/wiki/NaN">NaN</a> is the floating-point representation of &#8220;not a number,&#8221; which means something went wrong in their calculation of the timestamp. </p>
<p>This happened after I hibernated my computer and powered up, while the media player was paused.</p>
<p>(Maybe my blogging &#8220;will bring change&#8221; to the NPR Media Player <img src='http://www.exploringbinary.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  .)</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/floating-point-error-in-the-npr-media-player/">Floating-Point Error in the NPR Media Player</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/floating-point-error-in-the-npr-media-player/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Displaying the Raw Fields of a Floating-Point Number</title>
		<link>http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/</link>
		<comments>http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/#comments</comments>
		<pubDate>Wed, 20 May 2009 13:28:08 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=235</guid>
		<description><![CDATA[A double-precision floating-point number is represented internally as 64 bits, divided into three fields: a sign field, an exponent field, and a fraction field. You don&#8217;t need to know this to use floating-point numbers, but knowing it can help you understand them. This article shows you how to access those fields in C code, and [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">Displaying the Raw Fields of a Floating-Point Number</a></p>
]]></description>
			<content:encoded><![CDATA[<p>A double-precision floating-point number is represented internally as 64 bits, divided into three fields: a sign field, an exponent field, and a fraction field. You don&#8217;t need to know this to <em>use</em> floating-point numbers, but knowing it can help you understand them. This article shows you how to access those fields in C code, and how to print them &#8212; in binary or hex.</p>
<p><span id="more-235"></span></p>
<h2>Treating a Floating-Point Variable as Raw Bits</h2>
<p>To access the raw storage of a double-precision floating-point variable, you need to treat the double not as a number, but as a generic sequence of bits. One way to do this is to cast the double as an array of characters. However, this is not ideal: it exposes <a title="Wikipedia article on endianness" href="http://en.wikipedia.org/wiki/Endianness">endianness</a>, and it does not allow for easy access to the IEEE defined fields &#8212; fields which span byte boundaries.</p>
<p>A better solution is to recast the double as an integer &#8212; a 64 bit integer. This gives endian-independent, bit-level access to the <a title="Read my article “What Powers of Two Look Like Inside a Computer”" href="http://www.exploringbinary.com/what-powers-of-two-look-like-inside-a-computer/">IEEE representation</a>. That is the solution I use in the C code below.</p>
<p>(<em><strong>Note:</strong></em> My code works on architectures &#8212; like Intel &#8212; that use the same endianness for both floating point and integer values. My code will <em>not</em> work with mixed-endian encoding, like that used by ARM &ldquo;soft float.&rdquo;)</p>
<h2>The Code</h2>
<p>I wrote three functions:</p>
<ul>
<li><strong>print_raw_double_binary()</strong>. This function casts the double as an integer and then shifts off its bits one at a time, printing each bit as it goes. A space is added between fields to delineate them.</li>
<li><strong>parse_double()</strong>. This function isolates the three fields of the double with shifts and bit masks and turns them into integers.</li>
<li><strong>print_raw_double_hex()</strong>. This function calls parse_double() to isolate the fields and then calls printf() with the &#8216;%X&#8217; format specifier to print them in hex (spaces are printed between fields).</li>
</ul>
<p>I declared and defined these functions in files I named rawdouble.h and rawdouble.c, respectively. I also wrote a test program, rawTest.c, to test the functions on various inputs.</p>
<h3>rawdouble.h</h3>
<pre>
/***********************************************************/
/* rawdouble.h: Functions to access and print the raw      */
/*              contents of an IEEE double floating-point  */
/*              variable                                   */
/*                                                         */
/* Rick Regan, http://www.exploringbinary.com              */
/*                                                         */
/***********************************************************/
void print_raw_double_binary(double d);

void parse_double(double d,
                  unsigned char *sign_field,
                  unsigned short *exponent_field,
                  unsigned long long *fraction_field);

void print_raw_double_hex(double d);
</pre>
<h3>rawdouble.c</h3>
<pre>
/***********************************************************/
/* rawdouble.c: Functions to access and print the raw      */
/*              contents of an IEEE double floating-point  */
/*              variable                                   */
/*                                                         */
/* Rick Regan, http://www.exploringbinary.com              */
/*                                                         */
/***********************************************************/
#include &quot;stdio.h&quot;
#include &quot;rawdouble.h&quot;

void <strong>print_raw_double_binary</strong>(double d)
{
 unsigned long long *double_as_int = (unsigned long long *)&#038;d;
 int i;

 for (i=0; i&lt;=63; i++)
   {
    if (i==1)
      printf(&quot; &quot;); // Space after sign field
    if (i==12)
      printf(&quot; &quot;); // Space after exponent field

    if ((*double_as_int &gt;&gt; (63-i)) &#038; 1)
      printf(&quot;1&quot;);
    else
      printf(&quot;0&quot;);
   }
 printf(&quot;\n&quot;);
}

void <strong>parse_double</strong>(double d,
                  unsigned char *sign_field,
                  unsigned short *exponent_field,
                  unsigned long long *fraction_field)
{
 unsigned long long *double_as_int = (unsigned long long *)&#038;d;

 *sign_field = (unsigned char)(*double_as_int &gt;&gt; 63);
 *exponent_field = (unsigned short)(*double_as_int &gt;&gt; 52 &#038; 0x7FF);
 *fraction_field = *double_as_int &#038; 0x000FFFFFFFFFFFFFULL;
}

void <strong>print_raw_double_hex</strong>(double d)
{
 unsigned char sign_field;
 unsigned short exponent_field;
 unsigned long long fraction_field;

 parse_double(d,&#038;sign_field,&#038;exponent_field,&#038;fraction_field);

 printf(&quot;%X %X %llX\n&quot;,sign_field,exponent_field,fraction_field);
}
</pre>
<h3>rawTest.c</h3>
<pre>
/***********************************************************/
/* rawTest.c: Program to test raw double functions         */
/*                                                         */
/* Rick Regan, http://www.exploringbinary.com              */
/*                                                         */
/***********************************************************/
#include &quot;stdio.h&quot;
#include &quot;assert.h&quot;
#include &quot;rawdouble.h&quot;

int main(int argc, char *argv[])
{
 double d;
 int i;

 /* Make sure unsigned long long is 8 bytes */
 assert (sizeof(unsigned long long) == sizeof(double));

 d = 0.5;
 printf(&quot;0.5:\n&quot;);
 print_raw_double_binary(d);
 print_raw_double_hex(d);
 printf(&quot;\n&quot;);

 d = 0.1;
 printf(&quot;0.1:\n&quot;);
 print_raw_double_binary(d);
 print_raw_double_hex(d);
 printf(&quot;\n&quot;);

 /* d = NaN */
 printf(&quot;NaN:\n&quot;);
 d = 0;
 d/=d;
 print_raw_double_binary(d);
 print_raw_double_hex(d);
 printf(&quot;\n&quot;);

 /* d = +infinity */
 printf(&quot;+infinity:\n&quot;);
 d = 1e300;
 d*=d;
 print_raw_double_binary(d);
 print_raw_double_hex(d);
 printf(&quot;\n&quot;);

 /* d = 2^-1074 */
 printf(&quot;2^-1074:\n&quot;);
 d = 1;
 for (i=1; i&lt;=1074; i++)
   d/=2;
 print_raw_double_binary(d);
 print_raw_double_hex(d);

 return (0);
}
</pre>
<h2>Notes</h2>
<ul>
<li>The code works for all double values, including negative numbers, not-a-number (NaN), and infinity.</li>
<li>You could use a union of a double and an integer instead of type casting the double to an integer, but this would require an extra step: copying the double to the union.</li>
<li>The statement <em>assert (sizeof(unsigned long long) == sizeof(double))</em> in rawTest.c is there to ensure that long long integers are 64 bits on your platform. You can remove it once the assertion checks out.</li>
</ul>
<h2>Compiling and Running</h2>
<p>I compiled and ran this code on both Windows and Linux:</p>
<ul>
<li>On Windows, I built a project in Visual C++ and compiled and ran it in there.</li>
<li>On Linux, I compiled with &ldquo;gcc rawTest.c rawdouble.c -o rawTest&rdquo; and then ran with &ldquo;./rawTest&rdquo;.</li>
</ul>
<h2>Output</h2>
<pre>
0.5:
0 01111111110 0000000000000000000000000000000000000000000000000000
0 3FE 0

0.1:
0 01111111011 1001100110011001100110011001100110011001100110011010
0 3FB 999999999999A

NaN:
1 11111111111 1000000000000000000000000000000000000000000000000000
1 7FF 8000000000000

+infinity:
0 11111111111 0000000000000000000000000000000000000000000000000000
0 7FF 0

2^-1074:
0 00000000000 0000000000000000000000000000000000000000000000000001
0 0 1
</pre>
<h2>Modifying to Print Floats</h2>
<p>The code can be modified to print floats; here&#8217;s an outline of what to change:</p>
<ul>
<li>Use &#8216;float&#8217; instead of &#8216;double&#8217;.</li>
<li>Use &#8216;unsigned long&#8217; instead of &#8216;unsigned long long&#8217;.</li>
<li>Use the &#8216;%lX&#8217; printf() format specifier instead of &#8216;%llX&#8217;</li>
<li>Use bit masks and shift amounts appropriate for the <a title="Wikipedia article on IEEE Floating-point" href="http://en.wikipedia.org/wiki/IEEE_754-1985">float representation</a>.</li>
<li>In print_raw_double_binary(), or what would become print_raw_float_binary(), print a space after bit 9 instead of after bit 12.</li>
<li>Rename the routines and variables to use &#8216;float&#8217; instead of &#8216;double&#8217;.</li>
</ul>
<h2>Addendum: Printing in Binary Scientific Notation</h2>
<p>Another way to display a floating-point number is in binary scientific notation; I wrote a C function called <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">print_double_binsci()</a> to do this.</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">Displaying the Raw Fields of a Floating-Point Number</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
