<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Exploring Binary &#187; Numbers in computers</title>
	<atom:link href="http://www.exploringbinary.com/category/numbers-in-computers/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.exploringbinary.com</link>
	<description>Binary Numbers, Binary Code, and Binary Logic</description>
	<lastBuildDate>Tue, 31 Jan 2012 16:38:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Bigcomp: Deciding Truncated, Near Halfway Conversions</title>
		<link>http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/</link>
		<comments>http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/#comments</comments>
		<pubDate>Fri, 02 Dec 2011 15:40:16 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Convert to decimal]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Exponents]]></category>
		<category><![CDATA[Floating-point]]></category>
		<category><![CDATA[Fractions]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=353</guid>
		<description><![CDATA[In my article &#8220;Using Integers to Check a Floating-Point Approximation,&#8221; I briefly mentioned &#8220;bigcomp,&#8221; an optimization strtod() uses to reduce big integer overhead when checking long decimal inputs. bigcomp does a floating-point to decimal conversion &#8212; right in the middle of a decimal to floating-point conversion mind you &#8212; to generate the decimal expansion of [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/">Bigcomp: Deciding Truncated, Near Halfway Conversions</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In my article &ldquo;<a title="Read Rick Regan's article &ldquo;Using Integers to Check a Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">Using Integers to Check a Floating-Point Approximation</a>,&rdquo; I briefly mentioned &ldquo;bigcomp,&rdquo; an optimization strtod() uses to reduce big integer overhead when checking long decimal inputs. bigcomp does a floating-point to decimal conversion &#8212; right in the middle of a decimal to floating-point conversion mind you &#8212; to generate the decimal expansion of the number halfway between two target floating-point numbers. This decimal expansion is compared to the input decimal string, and the result of the comparison dictates which of the two target numbers is the correctly rounded result.</p>
<p>In this article, I&#8217;ll explain how bigcomp works, and when it applies. Also, I&#8217;ll talk briefly about its performance; my informal testing shows that, under the default setting, bigcomp actually <em>worsens</em> performance for some inputs.</p>
<p><span id="more-353"></span></p>
<h2>When Bigcomp Applies</h2>
<p>Bigcomp is enabled by default in strtod() (it can be disabled by &ldquo;&#35;defining&rdquo; the variable &ldquo;NO_STRTOD_BIGCOMP&rdquo;). It applies to decimal strings longer than STRTOD_DIGLIM significant digits, which by default is &ldquo;&#35;defined&rdquo; to 40. For decimal inputs <em>dStr</em> of 40 significant digits or less, the <a title="Read Rick Regan's article &ldquo;Using Integers to Check a Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">normal approximation check</a> is done, using all significant digits of <em>dStr</em> for its big integer representation <em>d</em>; for inputs longer than 40 digits, only the first 18 digits are used for <em>d</em>. This truncation of <em>d</em> leads to smaller big integers, but it can lead to cases where the approximation check cannot decide between adjacent floating-point values. When that happens, the bigcomp procedure kicks in, through a call to the bigcomp() function.</p>
<p>Specifically, bigcomp() is called when the approximation check cannot decide between the approximation <em>b</em> and its successor, <em>b + u</em>. (As with all my articles in this series, I will only be discussing how strtod() works for round-to-nearest/round-half-to-even rounding.) This happens when the truncated value of <em>d</em> falls between <em>b</em> and <em>b + h</em>; that is, between <em>b</em> and one-half ULP above <em>b</em>.</p>
<p>(In this article, I distinguish between the input <em>dStr</em> and its big integer representation <em>d</em>, which may or may not be a truncation of <em>dStr</em>.) </p>
<h2>Overview of the Algorithm</h2>
<p>bigcomp() compares the significant digits of <em>dStr</em> to a string representation of the significant digits of <em>b + h</em>. It goes left-to-right, digit by significant digit, stopping when the two strings differ. (In the bulk of cases, the difference occurs between the 16th and 18th digits, although in the worst case the difference can occur at the 54th digit; it depends on how close the decimal number is to <em>b + h</em>.) If <em>dStr</em> is <em>less</em> than <em>b + h</em>, bigcomp() returns <em>b</em>; if <em>dStr</em> is <em>greater</em> than <em>b + h</em>, bigcomp() returns <em>b + u</em>;  if <em>dStr</em> equals <em>b + h</em>, either <em>b</em> or <em>b + u</em> is chosen, according to round-half-to-even rounding.</p>
<p>bigcomp() generates the significant digits of <em>b + h</em> using essentially the same algorithm as the dtoa() function, the floating-point to decimal conversion routine co-located in <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s dtoa.c</a>. It uses big integer arithmetic, including big integer <em>division</em>.</p>
<p>bigcomp() starts by creating a fraction that represents the significant digits of <em>b + h</em> (not <em>b + h</em> itself). In this form, it generates a digit by multiplying the numerator of the fraction by ten, and then using integer division to form a quotient and remainder. The quotient represents the next digit; the remainder represents the remaining digits, which are accessed one at a time by repeating this process. Digits are generated until there is a mismatch between <em>dStr</em> and <em>b + h</em>.</p>
<h2>The Algorithm, By Example</h2>
<p>I&#8217;ll break the algorithm into five steps, and illustrate it with two examples. I set STRTOD_DIGLIM to 19 so I could generate shorter examples (both are 20 digits).</p>
<h3>Example 1: Check 1.3694713649464322631e-11</h3>
<p>In this example, we want to check if <em>dStr</em> = 1.3694713649464322631e-11 is closer to <em>b</em> = 0&#120;1.e1d703bb5749cp-37 or <em>b + u</em> = 0&#120;1.e1d703bb5749dp-37. We do this by comparing <em>dStr</em> to <em>b + h</em> =  0&#120;1.e1d703bb5749c8p-37. (I describe <em>b + h</em> using a <a title="Read Rick Regan's article &ldquo;Hexadecimal Floating-Point Constants&rdquo;" href="http://www.exploringbinary.com/hexadecimal-floating-point-constants/">hexadecimal floating-point constant</a>, even though it won&#8217;t be represented as an IEEE floating-point number. This number has 54 bits, with bit 54 being a 1. It is <em>b</em> with a hexadecimal &lsquo;8&rsquo; tacked on at the end.)</p>
<h4>1. Represent <em>b + h</em> as an integer times a power of two</h4>
<p>We start by representing <em>b</em> as a 53-bit integer times a power of two, as we do in the <a title="Read Rick Regan's article &ldquo;Using Integers to Check a Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">basic approximation check</a>:</p>
<pre class="noborder">
<em>b</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup>
  = 8476617176609948 x 2<sup>-89</sup>
</pre>
<p>Since <em>bInt</em> is 53-bit integer, <em>u</em> = 2<sup><em>bExp</em></sup>, and <em>h</em> is 2<sup><em>bExp</em>-1</sup>. Therefore,</p>
<pre class="noborder">
<em>b + h</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup> + 2<sup><em>bExp</em>-1</sup>
      = (<em>bInt</em> x 2 + 1) x 2<sup><em>bExp</em>-1</sup>
      = (8476617176609948 x 2 + 1) x 2<sup>-90</sup>
      = 16953234353219897 x 2<sup>-90</sup>
</pre>
<p><em>In code</em>: We create <em>bInt</em> and <em>bExp</em> from <em>b</em>&rsquo;s <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">underlying floating-point representation</a>: we treat its significand as a 53-bit integer, and subtract 52 from its power of two exponent. We create a big integer from <em>bInt</em>, and then we make it the integer portion of <em>b + h</em> by shifting it left and &lsquo;ORing&rsquo; a 1 into its least significant bit. We maintain the exponent of the power of two as a native integer, setting it to <em>bExp</em>-1.</p>
<h4>2. Scale <em>b + h</em> so its first significant digit is just to the left of the decimal point</h4>
<p>If you were to compute <em>b + h</em> from the expression above, you&#8217;d see its value is</p>
<p>0.0000000000136947136494643226044416541235590733258456475063269408565<br />
24743139743804931640625</p>
<p>(We&#8217;re not going to do that; I just wanted to show you where we&#8217;re headed. By the way, I computed it using <a title="Read Rick Regan's article &ldquo;Exploring Binary Numbers With PARI/GP Calculator&rdquo;" href="http://www.exploringbinary.com/exploring-binary-numbers-with-parigp-calculator/">PARI/GP</a>.)</p>
<p>What we want to do is scale <em>b + h</em> so that it looks like</p>
<p>1.3694713649464322604441654123559073325845647506326940856524743139743<br />
804931640625</p>
<p>In other words, we want to multiply it by a power of ten so that, when viewed as a decimal number, its significant digits look normalized; 10<sup>11</sup> does the trick, effectively shifting <em>b + h</em> <em>left</em> by 11 decimal places:</p>
<pre class="noborder">
Scaled <em>b + h</em> = 16953234353219897 x 2<sup>-90</sup> x 10<sup>11</sup>
</pre>
<p>This &ldquo;normalization&rdquo; sets up the digit generation algorithm.</p>
<p><em>In code</em>: A full-blown floating-point to decimal conversion would need to use logarithms to compute the power of ten; but in this case, we can obtain the power of ten exponent directly from <em>dStr</em>.</p>
<h4>3. Combine powers of two</h4>
<p>If we decompose the power of ten into a power of two and a power of five, we get</p>
<pre class="noborder">
Scaled <em>b + h</em> = 16953234353219897 x 2<sup>-90</sup> x 2<sup>11</sup> x 5<sup>11</sup>
</pre>
<p>We can now <a title="Read Rick Regan's Article &ldquo;Composing Powers of Two Using The Laws of Exponents&rdquo;" href="http://www.exploringbinary.com/composing-powers-of-two-using-the-laws-of-exponents/">combine powers of two</a> to reduce the size of the resulting big integers.</p>
<pre class="noborder">
Scaled <em>b + h</em> = 16953234353219897 x 2<sup>-79</sup> x 5<sup>11</sup>
</pre>
<p><em>In code</em>: The code is simple: just add the exponents of the power of two from <em>b + h</em> and the power of two from the power of ten scaling factor.</p>
<h4>4. Express scaled <em>b + h</em> as a fraction</h4>
<p>The negative power of two, 2<sup>-79</sup>, makes this a fraction:</p>
<pre class="noborder">
Scaled <em>b + h</em> = (16953234353219897 x 5<sup>11</sup>)/2<sup>79</sup>
             = 827794646153315283203125/604462909807314587353088
</pre>
<p><em>In code</em>: Compute the powers as big integers, using the absolute values of the exponents; then, multiply the terms out, as appropriate.</p>
<h4>5. Generate and check digits</h4>
<p>Scaled <em>b + h</em> has a numerator <em>num</em> and denominator <em>den</em>:</p>
<pre class="noborder">
<em>num</em> = 827794646153315283203125
<em>den</em> = 604462909807314587353088
</pre>
<p>The significant digits are generated from these integers using the &ldquo;repeated multiplication by ten&rdquo; algorithm; this is what&#8217;s done at each iteration (<em>dig</em> stands for &lsquo;digit&rsquo;, and <em>rem</em> stands for &lsquo;remainder&rsquo;):</p>
<ol>
<li><em>dig</em> = <em>num</em>/<em>den</em> (just the integer part of the quotient)</li>
<li><em>rem</em> = <em>num</em> &#8211; <em>dig</em> x <em>den</em></li>
<li><em>num</em> = <em>rem</em> x 10</li>
</ol>
<p>In other words, at each iteration, a digit is peeled off and the rest of the number is shifted left one decimal place. This step is repeated until a corresponding digit from <em>dStr</em> and <em>b + h</em> don&#8217;t match; here are the first three iterations:</p>
<pre class="noborder">
1. 827794646153315283203125/<em>den</em>  = 1 <em>rem</em> 223331736346000695850037
2. 2233317363460006958500370/<em>den</em> = 3 <em>rem</em> 419928634038063196441106
3. 4199286340380631964411060/<em>den</em> = 6 <em>rem</em> 572508881536744440292532
4. ...
</pre>
<p>This stops after step 19; that is, <em>dStr</em> and <em>b + h</em> differ at the 19th digit. Here are the digits of <em>dStr</em> and <em>b + h</em> aligned, to show where they differ:</p>
<pre class="noborder">
Digits of <em>dStr</em>  = 136947136494643226<span class="highlight_gray4">3</span>1
Digits of <em>b + h</em> = 136947136494643226<span class="highlight_gray4">0</span>444165412355907332584564750632
6940856524743139743804931640625
</pre>
<p>Since 3 is greater than 0, <em>dStr</em> is closer to <em>b + u</em> than it is to <em>b</em>, so <em>b + u</em> is the answer.</p>
<h3>Example 2: Check 9.3170532238714134438e+16</h3>
<p>In this example, we want to check if <em>dStr</em> = 9.3170532238714134438e+16 is closer to <em>b</em> = 0&#120;1.4b021afd9f651p+56 or <em>b + u</em> = 0&#120;1.4b021afd9f652p+56. We do this by comparing <em>dStr</em> to <em>b + h</em> =  0&#120;1.4b021afd9f6518p+56.</p>
<h4>1. Represent <em>b + h</em> as an integer times a power of two</h4>
<pre class="noborder">
<em>b</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup>
  = 5823158264919633 x 2<sup>4</sup>

<em>b + h</em> = (<em>bInt</em> x 2 + 1) x 2<sup><em>bExp</em>-1</sup>
      = (5823158264919633 x 2 + 1) x 2<sup>3</sup>
      = 11646316529839267 x 2<sup>3</sup>
</pre>
<h4>2. Scale <em>b + h</em> so its first significant digit is just to the left of the decimal point</h4>
<p>If you were to compute <em>b + h</em> from the expression above, you&#8217;d see its value is</p>
<p>93170532238714136</p>
<p>We want to shift it <em>right</em> 16 decimal places so it looks like</p>
<p>9.3170532238714136</p>
<pre class="noborder">
Scaled <em>b + h</em> = 11646316529839267 x 2<sup>3</sup> x 10<sup>-16</sup>
</pre>
<h4>3. Combine powers of two</h4>
<p>If we decompose the power of ten into a power of two and a power of five, we get</p>
<pre class="noborder">
Scaled <em>b + h</em> = 11646316529839267 x 2<sup>3</sup> x 2<sup>-16</sup> x 5<sup>-16</sup>
</pre>
<p>Combining powers of two we get</p>
<pre class="noborder">
Scaled <em>b + h</em> = 11646316529839267 x 2<sup>-13</sup> x 5<sup>-16</sup>
</pre>
<h4>4. Express scaled <em>b + h</em> as a fraction</h4>
<p>The negative powers of two and five make this a fraction:</p>
<pre class="noborder">
Scaled <em>b + h</em> = 11646316529839267/(2<sup>13</sup> x 5<sup>16</sup>)
             = 11646316529839267/1250000000000000
</pre>
<h4>5. Generate and check digits</h4>
<p>Scaled <em>b + h</em> has a numerator <em>num</em> and denominator <em>den</em>:</p>
<pre class="noborder">
<em>num</em> = 11646316529839267
<em>den</em> = 1250000000000000
</pre>
<p>Here are the first three iterations of the repeated multiplication by ten algorithm:</p>
<pre class="noborder">
1. 11646316529839267/<em>den</em> = 9 <em>rem</em> 396316529839267
2. 3963165298392670/<em>den</em>  = 3 <em>rem</em> 213165298392670
3. 2131652983926700/<em>den</em>  = 1 <em>rem</em> 881652983926700
4. ...
</pre>
<p>This stops after step 17; that is, <em>dStr</em> and <em>b + h</em> differ at the 17th digit. Here are the digits of <em>dStr</em> and <em>b + h</em> aligned, to show where they differ:</p>
<pre class="noborder">
Digits of <em>dStr</em>  = 9317053223871413<span class="highlight_gray4">4</span>438
Digits of <em>b + h</em> = 9317053223871413<span class="highlight_gray4">6</span>
</pre>
<p>Since 4 is less than 6, <em>dStr</em> is closer to <em>b</em> than it is to <em>b + u</em>, so <em>b</em> is the answer.</p>
<h3>Discussion</h3>
<p>strtod() has an extra step, which I&#8217;ve not described. It scales the numerator and denominator by a common power of two, which is necessary for its &lsquo;quorem()&rsquo; function to compute the quotient and remainder properly. The power of two exponent is determined by a function called &lsquo;dshift().&rsquo;</p>
<p>If <em>dStr</em> represents a fractional value, step 2 (scale by power of ten) and step 3 (combine powers of two) are technically unnecessary. Without them, step 5 (generate digits) would generate leading zeros, which could be ignored. (Of course, the pre-scaling is more elegant, as it reduces everything to one form.)</p>
<h2>The 40-Digit Cutoff Is Too Low</h2>
<p>bigcomp() does a lot of work with big integers, which made me wonder why it performed better than the basic approximation check. I ran some tests, and found that it <em>doesn&#8217;t</em> perform better &#8212; not at the default STRTOD_DIGLIM value of 40! My tests indicate that bigcomp() only performs better starting at somewhere between 80 and 83 digits. (I don&#8217;t know why 40 was chosen as the default.) bigcomp() outperforms the basic check for really long inputs; for example, the difference in performance is great at 500 digits, and substantial at 1000 digits.</p>
<p>Let&#8217;s look at a 1000-digit input to see the difference in the big integers involved:</p>
<pre class="noborder">
<em>dStr</em> = 5.39393324717760434968487301635560570642059608762708229108908
24537754965408622951477281124864324524316631524646435056101989215441
13243584680626358335990970896315873327586087827333915843584231417344
18113410301502801068079096892041478634628161830670245841932386407800
12745803879948980583567722824049594060350336856210415679895612943123
50098271611832693583464221850190462604938332826437767915224673375308
74570398083131101089815004834148500912320646940469604931721018872877
13566715387976682236017863055187831755129573243931855002598816011501
73130862094578091926124182802567489337171671008075773833668718543916
59331687129078431660248055444327878892028873933579030052796118406453
59419797286387142281725295340184928613290361074947899276429780250293
85401861377815243939607988608126354001619571808095966555370152710852
57007714188421395421027623995592231545518458559072989336658831420578
08271261804364513313549623924640140003927894134813865873451090057841
21193123838848522389829029095139051820951352412966707477e+16
</pre>
<p>Without bigcomp, using the <a title="Read Rick Regan's article &ldquo;Using Integers to Check a Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">basic approximation check</a>, these are the variables involved:</p>
<pre class="noborder">
<em>dS</em> = 539393324717760434968487301635560570642059608762708229108908245
37754965408622951477281124864324524316631524646435056101989215441132
43584680626358335990970896315873327586087827333915843584231417344181
13410301502801068079096892041478634628161830670245841932386407800127
45803879948980583567722824049594060350336856210415679895612943123500
98271611832693583464221850190462604938332826437767915224673375308745
70398083131101089815004834148500912320646940469604931721018872877135
66715387976682236017863055187831755129573243931855002598816011501731
30862094578091926124182802567489337171671008075773833668718543916593
31687129078431660248055444327878892028873933579030052796118406453594
19797286387142281725295340184928613290361074947899276429780250293854
01861377815243939607988608126354001619571808095966555370152710852570
07714188421395421027623995592231545518458559072989336658831420578082
71261804364513313549623924640140003927894134813865873451090057841211
93123838848522389829029095139051820951352412966707477

<em>bS</em> = 539393324717760400000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000

<em>hS</em> = 400000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000

&#124;<em>dS-bS</em>&#124; = 3496848730163556057064205960876270822910890824537754965408
62295147728112486432452431663152464643505610198921544113243584680626
35833599097089631587332758608782733391584358423141734418113410301502
80106807909689204147863462816183067024584193238640780012745803879948
98058356772282404959406035033685621041567989561294312350098271611832
69358346422185019046260493833282643776791522467337530874570398083131
10108981500483414850091232064694046960493172101887287713566715387976
68223601786305518783175512957324393185500259881601150173130862094578
09192612418280256748933717167100807577383366871854391659331687129078
43166024805544432787889202887393357903005279611840645359419797286387
14228172529534018492861329036107494789927642978025029385401861377815
24393960798860812635400161957180809596655537015271085257007714188421
39542102762399559223154551845855907298933665883142057808271261804364
51331354962392464014000392789413481386587345109005784121193123838848
522389829029095139051820951352412966707477
</pre>
<p>&#124;<em>dS-bS</em>&#124; is less than <em>hS</em>, so <em>b</em> is chosen.</p>
<p><em>With</em> bigcomp, there are two sets of big integers involved: one set for the basic approximation check (on the truncated value), and one set for bigcomp. Here are the big integers used for the basic approximation check:</p>
<pre class="noborder">
<em>dS</em> = 539393324717760434
<em>bS</em> = 539393324717760400
<em>hS</em> = 40
&#124;<em>dS-bS</em>&#124; = 34
</pre>
<p>Here are the big integers used for bigcomp (not shown is the extra factor of 2<sup>8</sup> in the numerator and denominator due to dshift()):</p>
<pre class="noborder">
<em>num</em> = 13484833117944011
<em>den</em> = 2500000000000000
</pre>
<p>bigcomp() finds that digit 17 of <em>dStr</em> (3) is less than digit 17 of <em>b + h</em> (4), so it chooses <em>b</em>.</p>
<h2>Acknowledgement</h2>
<p>Thanks to Mark Dickinson for his commentary on bigcomp, both in <a title="Python dtoa.c" href="http://svn.python.org/view/python/branches/py3k-dtoa/Python/dtoa.c?revision=83813&#038;view=markup">Python&#8217;s version of David Gay&#8217;s dtoa.c</a>, and in private communication.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/">Bigcomp: Deciding Truncated, Near Halfway Conversions</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Integers to Check a Floating-Point Approximation</title>
		<link>http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/</link>
		<comments>http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/#comments</comments>
		<pubDate>Mon, 31 Oct 2011 14:49:32 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Exponents]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=349</guid>
		<description><![CDATA[For decimal inputs that don&#8217;t qualify for fast path conversion, David Gay&#8217;s strtod() function does three things: first, it uses IEEE double-precision floating-point arithmetic to calculate an approximation to the correct result; next, it uses arbitrary-precision integer arithmetic (AKA big integers) to check if the approximation is correct; finally, it adjusts the approximation, if necessary. [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">Using Integers to Check a Floating-Point Approximation</a></p>
]]></description>
			<content:encoded><![CDATA[<p>For decimal inputs that don&#8217;t qualify for <a title="Read Rick Regan's article &ldquo;Fast Path Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">fast path conversion</a>, David Gay&#8217;s strtod() function does three things: first, it uses IEEE double-precision floating-point arithmetic to <a title="Read Rick Regan's article &ldquo;strtod()&rsquo;s Initial Decimal to Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/">calculate an approximation</a> to the correct result; next, it uses arbitrary-precision integer arithmetic (AKA big integers) to check if the approximation is correct; finally, it adjusts the approximation, if necessary. In this article, I&#8217;ll explain the second step &#8212; how the check of the approximation is done.</p>
<p><span id="more-349"></span></p>
<h2>Overview</h2>
<p>The goal of David Gay&#8217;s strtod() is to perform efficient, correctly rounded decimal to floating point conversion. To that end, it minimizes the overhead due to arbitrary-precision arithmetic, which is unavoidable for some conversions. In particular, strtod() avoids <em>division</em> of big integers, as would be required by a <a title="Read Rick Regan's article &ldquo;Correct Decimal To Floating-Point Using Big Integers&rdquo;" href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">full-blown big integer algorithm</a>. Instead of dividing, it multiplies, adds, subtracts, and compares.</p>
<p>The key to the algorithm is the floating-point approximation. It can be compared to the original decimal input to see if it&#8217;s too small, correct, or too big (<a title="Read Rick Regan's article &ldquo;strtod()&rsquo;s Initial Decimal to Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/">it will be within 10 ULPs or so of correct</a>). The approximation can be adjusted accordingly.</p>
<p>(My discussion will deal only with normal double-precision numbers; strtod() handles subnormal numbers as well.)</p>
<h2>Correctly Rounded Conversions</h2>
<p>For a floating-point number <em>b</em> to be considered the correctly rounded conversion of a decimal number <em>d</em>, it must be the <em>closest</em> floating-point number to <em>d</em>. That definition sounds simple enough, but there&#8217;s a problem: the meaning of &ldquo;closest&rdquo; can depend on the value of <em>b</em>. In any case, the tests for closeness are all based on a measure called a ULP &#8212; a binary (for our purposes) <em>unit in the last place</em>. A ULP is the gap between adjacent floating-point numbers.</p>
<p>I&#8217;ll present the tests as three separate cases, although in code they must be considered together.</p>
<h3>Case 1: <em>d</em> is less than one-half ULP away from <em>b</em></h3>
<p>If <em>d</em> is less than one-half of a ULP from <em>b</em>, then <em>b</em> is the correctly rounded conversion of <em>d</em> (this test is ambiguous when <em>b</em> is a power of two &#8212; I&#8217;ll explain below). Mathematically, this is stated as</p>
<p>b &#8211; <em>h</em> &lt; d &lt; b + <em>h</em> ,</p>
<p>where <em>h</em> = 1/2&middot;<em>u</em>, and where <em>u</em> equals one ULP.</p>
<p>Here&#8217;s what the relationship looks like:</p>
<div class="wp-caption aligncenter" style="width: 494px"><img src="http://www.exploringbinary.com/wp-content/uploads/decBinInterval.png" alt="Correct rounding when d is less than one-half ULP away from b" width="484" height="166"/><p class="wp-caption-text">Correct rounding when <em>d</em> is less than one-half ULP away from <em>b</em></p></div>
<p>Equivalently, this relationship can be written as</p>
<p>-<em>h</em> &lt; <em>d</em> &#8211; <em>b</em> &lt; <em>h</em> ,</p>
<p>or</p>
<p>&#124;<em>d</em> &#8211; <em>b</em>&#124; &lt; <em>h</em> .</p>
<h3>Case 2: <em>d</em> is one-half ULP away from <em>b</em></h3>
<p>If <em>d</em> is one-half of a ULP away from <em>b</em>, then <em>b</em> is the correctly rounded conversion of <em>d</em> only if <em>b</em>&rsquo;s least significant bit is 0  (this test is ambiguous when <em>b</em> is a power of two &#8212; I&#8217;ll explain below). This is called <a title="Wikipedia article definition of &ldquo;round half to even&rdquo; rounding mode" href="http://en.wikipedia.org/wiki/Rounding#Round_half_to_even">round-half-to-even</a> rounding.</p>
<p>The relationship is written as</p>
<p>&#124;<em>d</em> &#8211; <em>b</em>&#124; = <em>h</em></p>
<div class="wp-caption aligncenter" style="width: 495px"><img src="http://www.exploringbinary.com/wp-content/uploads/decBinInterval.halfway.png" alt="Correct rounding when d is one-half ULP away from b" width="485" height="197"/><p class="wp-caption-text">Correct rounding when <em>d</em> is one-half ULP away from <em>b</em></p></div>
<h3>Case 3: <em>b</em> is a power of two</h3>
<p>If <em>b</em> is a power of two, two different ULP values come into play, making our half ULP test ambiguous. One ULP <em>below</em> <em>b</em> is half as much as one ULP <em>above</em> <em>b</em>, since a ULP doubles at each increasing power of two boundary. This means that if <em>b</em> is a power of two, it is the correctly rounded conversion of <em>d</em> only if <em>d</em> is no more than one-half ULP above <em>b</em> or no more than one-quarter of that same (upper) ULP below <em>b</em>; here is the inequality:</p>
<p>-<em>h</em>/2 &le; <em>d</em> &#8211; <em>b</em> &le; <em>h</em> .</p>
<p>This is what it looks like:</p>
<div class="wp-caption aligncenter" style="width: 499px"><img src="http://www.exploringbinary.com/wp-content/uploads/decBinInterval.PO2.png" alt="Correct rounding when b is a power of two" width="489" height="175"/><p class="wp-caption-text">Correct rounding when <em>b</em> is a power of two</p></div>
<p>We use the &ldquo;&le;&rdquo; signs in the inequality because <em>b</em> is the correctly rounded result even if <em>d</em> = <em>b</em> &#8211; <em>h</em>/2 or <em>d</em> = <em>b</em> + <em>h</em>. This is because the significands of (normal) powers of two are zero.</p>
<h2>Comparing Using Big Integers</h2>
<p>To explain how the algorithm works, I will use this example: <em>d</em> = 1.7864e-45 and <em>b</em> = 0&#120;1.465a72e467d88p-149 (the latter is expressed as a <a title="Read Rick Regan's article &ldquo;Hexadecimal Floating-Point Constants&rdquo;" href="http://www.exploringbinary.com/hexadecimal-floating-point-constants/">hexadecimal floating-point constant</a>).  Here&#8217;s how <em>d</em> and <em>b</em> are related:</p>
<div class="wp-caption aligncenter" style="width: 502px"><img src="http://www.exploringbinary.com/wp-content/uploads/decBinExample.png" alt="1.7864 x 10^-45 In Relation to Its Binary Floating-Point Approximation 0&#120;1.465a72e467d88p-149" width="492" height="215"/><p class="wp-caption-text">1.7864 x 10<sup>-45</sup> In Relation to Its Binary Floating-Point Approximation, 0&#120;1.465a72e467d88p-149</p></div>
<p>The diagram shows that <em>b</em> is the closest floating-point number to <em>d</em> &#8212; but how do we check that programmatically, using binary arithmetic? We need to apply the above test: &#124;<em>d</em> &#8211; <em>b</em>&#124; &lt; <em>h</em>. (I know a priori this is not a special case &#8212; I wanted to keep my example simple.) But there&#8217;s a problem: how do we represent <em>d</em>? Answer: as a big integer. If we also represent <em>b</em> and <em>h</em> as big integers, such that they are proportional to <em>d</em>, we can apply the test given by the inequality.</p>
<p>I&#8217;ve broken the process into four steps:</p>
<h3>Step 1: Represent the significant digits of <em>d</em> and <em>b</em> as integers</h3>
<p>The first step is to represent <em>d</em> as an integer times a power of ten, and <em>b</em> as an integer times a power of two:</p>
<p><em>d</em> = <em>dInt</em> x 10<sup><em>dExp</em></sup><br />
<em>b</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup></p>
<p><em>d</em> was already put in this form when the <a title="Read Rick Regan's article &ldquo;strtod()’s Initial Decimal to Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/">floating point approximation <em>b</em> was computed</a> (but for the purposes of the comparison, <em>all</em> digits of <em>d</em> are used, not just the first 16). <em>b</em> can be put in this form simply, by <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">accessing its underlying floating-point representation</a>: we &ldquo;unnormalize&rdquo; its significand into an integer by decreasing its power of two exponent. We represent the significant digits as a <a title="Read Rick Regan's article &ldquo;Correct Decimal To Floating-Point Using Big Integers&rdquo;" href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">53-bit integer</a>, which means we subtract 52 from its power of two exponent.</p>
<p>For our example, <em>d</em> = 1.7864e-45 is represented as 17864 x 10<sup>-49</sup>, and <em>b</em> = 0&#120;1.465a72e467d88p-149, which in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a> is</p>
<p>1.0100011001011010011100101110010001100111110110001 x 2<sup>-149</sup> ,</p>
<p>is represented (in decimal numerals) as 5741268244528520 x 2<sup>-201</sup>. </p>
<h3>Step 2: Represent <em>h</em> as a power of two</h3>
<p>Since <em>bInt</em> is 53-bit integer, <em>u</em> = 2<sup><em>bExp</em></sup>, and <em>h</em> is half that: 2<sup><em>bExp</em>-1</sup>. For consistency in my discussion, I&#8217;ll substitute the variable <em>hExp</em> for <em>bExp</em>-1, giving</p>
<p><em>h</em> = 2<sup><em>hExp</em></sup> .</p>
<h3>Step 3: Scale the inequality</h3>
<p>To complete the conversion to integers, we need to get rid of any negative powers; we do this by scaling both sides of the inequality. Let&#8217;s look at the inequality again:</p>
<p>&#124;<em>d</em> &#8211; <em>b</em>&#124; &lt; <em>h</em> .</p>
<p>Here are the three values as we expressed them above:</p>
<p><em>d</em> = <em>dInt</em> x 10<sup><em>dExp</em></sup><br />
<em>b</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup><br />
<em>h</em> = 2<sup><em>hExp</em></sup></p>
<p>For efficiency, the scaling process will factor out common powers of two from the three powers; to facilitate this, we rewrite 10<sup><em>dExp</em></sup> as 2<sup><em>dExp</em></sup> x 5<sup><em>dExp</em></sup>; now we have</p>
<p><em>d</em> = <em>dInt</em> x 2<sup><em>dExp</em></sup> x 5<sup><em>dExp</em></sup><br />
<em>b</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup><br />
<em>h</em> = 2<sup><em>hExp</em></sup></p>
<p>If any of the exponents <em>dExp</em>, <em>bExp</em>, or <em>hExp</em> are negative, we make them nonnegative by scaling. To reflect any scaling, let&#8217;s rename <em>d</em>, <em>b</em>, and <em>h</em> to <em>dS</em>, <em>bS</em>, and <em>hS</em>, respectively. The scaled inequality is now written</p>
<p>&#124;<em>dS</em> &#8211; <em>bS</em>&#124; &lt; <em>hS</em> ,</p>
<p>and the initial values of <em>dS</em>, <em>bS</em>, and <em>hS</em> are</p>
<p><em>dS</em> = <em>dInt</em> x 2<sup><em>dExp</em></sup> x 5<sup><em>dExp</em></sup><br />
<em>bS</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup><br />
<em>hS</em> = 2<sup><em>hExp</em></sup></p>
<p>To show how scaling works, let&#8217;s continue with our example above; here are the initial values of the variables:</p>
<p><em>dS</em> = 17864  x  2<sup>-49</sup> x 5<sup>-49</sup><br />
<em>bS</em> = 5741268244528520 x 2<sup>-201</sup><br />
<em>hS</em> = 2<sup>-202</sup></p>
<p>In this example, all three exponents &#8212; <em>dExp</em>, <em>bExp</em>, and <em>hExp</em>  &#8212; are negative; we scale the values with a common factor <em>S</em> that makes them all nonnegative. The least common factor that does this is <em>S</em> = 2<sup>202</sup> x 5<sup>49</sup>; our variables then become</p>
<p><em>dS</em> = 17864  x  2<sup>153</sup><br />
<em>bS</em> = 5741268244528520 x 2<sup>1</sup> x 5<sup>49</sup><br />
<em>hS</em> = 5<sup>49</sup></p>
<p>Multiplied out, these are</p>
<p><em>dS</em> = 203970822259994138521801764465966248930731085529088<br />
<em>bS</em> = 203970822259994122305215569213032722473144531250000<br />
<em>hS</em> = 17763568394002504646778106689453125</p>
<h3>Step 4: Apply the test</h3>
<p>Now all we have to do is check the inequality, &#124;<em>dS</em> &#8211; <em>bS</em>&#124; &lt; <em>hS</em>:</p>
<p>&#124;<em>dS</em> &#8211; <em>bS</em>&#124; = 16216586195252933526457586554279088, which is less than <em>hS</em> = 17763568394002504646778106689453125. This proves that <em>b</em> is the correct conversion of <em>d</em>.</p>
<p>(In the diagram for this example above, I&#8217;ve drawn the relationship of <em>d</em> and <em>b</em> proportionally: (<em>dS</em> &#8211; <em>bS</em>)/<em>hS</em> is approximately 0.91, which means that <em>d</em> is about  0.91<em>h</em> above <em>b</em>.)</p>
<h2>Computing the Scaling Exponents In Code</h2>
<p>In code, the scaling is done by operating on native integer variables representing the exponents of the powers of two and five in each of the three scaled values. After the exponents are created, common factors of two are removed to reduce the size of the resulting big integers. Here is how I coded it in C (this is essentially how it&#8217;s done in David Gay&#8217;s strtod() and Java’s FloatingDecimal class; I&#8217;ve renamed the variables and structured it a bit differently to make it clearer):</p>
<pre>
//Initialize scaling exponents
dS_Exp2 = 0;
dS_Exp5 = 0;
bS_Exp2 = 0;
bS_Exp5 = 0;
hS_Exp2 = 0;
hS_Exp5 = 0;

//Adjust for decimal exponent
if (dExp >= 0) {
   dS_Exp2 += dExp;
   dS_Exp5 += dExp;
   }
else {
   bS_Exp2 -= dExp;
   bS_Exp5 -= dExp;
   hS_Exp2 -= dExp;
   hS_Exp5 -= dExp;
   }

//Adjust for binary exponent
if (bExp >= 0) {
    bS_Exp2 += bExp;
   }
else {
   dS_Exp2 -= bExp;
   hS_Exp2 -= bExp;
   }

//Adjust for half ulp exponent
if (hExp >= 0) {
   hS_Exp2 += hExp;
   }
else {
   dS_Exp2 -= hExp;
   bS_Exp2 -= hExp;
   }

//Remove common power of two factor from all three scaled values
common_Exp2 = min(dS_Exp2, min(bS_Exp2, hS_Exp2));
dS_Exp2 -= common_Exp2;
bS_Exp2 -= common_Exp2;
hS_Exp2 -= common_Exp2;
</pre>
<p>Here is how the code executes for our example (I&#8217;ll show the state of the exponent variables after adjusting for each exponent):</p>
<pre class="indented">
//After initializing scaling exponents
dS_Exp2 = 0;
dS_Exp5 = 0;
bS_Exp2 = 0;
bS_Exp5 = 0;
hS_Exp2 = 0;
hS_Exp5 = 0;

//After adjusting for decimal exponent (dExp = -49)
dS_Exp2 = 0;
dS_Exp5 = 0;
bS_Exp2 = 49;
bS_Exp5 = 49;
hS_Exp2 = 49;
hS_Exp5 = 49;

//After adjusting for binary exponent (bExp = -201)
dS_Exp2 = 201;
dS_Exp5 = 0;
bS_Exp2 = 49;
bS_Exp5 = 49;
hS_Exp2 = 250;
hS_Exp5 = 49;

//After adjusting for half ulp exponent (hExp = -202)
dS_Exp2 = 403;
dS_Exp5 = 0;
bS_Exp2 = 251;
bS_Exp5 = 49;
hS_Exp2 = 250;
hS_Exp5 = 49;

//After removing common power of two factor of 2^250
dS_Exp2 = 153;
dS_Exp5 = 0;
bS_Exp2 = 1;
bS_Exp5 = 49;
hS_Exp2 = 0;
hS_Exp5 = 49;
</pre>
<p>After this code executes, the powers are multiplied out as big integers and used to create big integers <em>dS</em>, <em>bS</em>, and <em>hS</em>.</p>
<h3>When <em>d</em>, <em>b</em>, and <em>h</em> are already integers</h3>
<p>When all three exponents are nonnegative, <em>d</em>, <em>b</em>, and <em>h</em> start out as integer values. In this case, they are scaled <em>down</em>, by their common power of two factor.</p>
<h2>Testing the Inequality In Code</h2>
<p>In strtod(), the checking centers around a comparison of &#124;<em>d</em> &#8211; <em>b</em>&#124; (actually, &#124;<em>b</em> &#8211; <em>d</em>&#124;, which is equivalent) to <em>h</em>, which results in three outcomes: &#124;<em>d</em> &#8211; <em>b</em>&#124; &lt; <em>h</em>, &#124;<em>d</em> &#8211; <em>b</em>&#124; = <em>h</em>, or &#124;<em>d</em> &#8211; <em>b</em>&#124; &gt; <em>h</em>. The special cases are worked into the first two outcomes. (Java’s FloatingDecimal class does similar checks, but it is structured differently.)</p>
<p>Regarding the required check at a power of two boundary (comparing <em>d</em> &#8211; <em>b</em> to -<em>h</em>/2), strtod() scales <em>d</em> &#8211; <em>b</em> by two &#8212; to ensure that all values remain integers. This gives the equivalent comparison of 2(<em>d</em> &#8211; <em>b</em>) to -<em>h</em>. (In many cases, <em>h</em> is divisible by two; strtod() could just as easily divide <em>h</em> by two instead of multiplying <em>d</em> &#8211; <em>b</em> by two. Java’s FloatingDecimal class does the former.)</p>
<h3>Adjusting <em>b</em></h3>
<p>strtod() makes theses checks in a loop, and adjusts <em>b</em> as necessary (that is beyond the scope of this article).</p>
<h2>strtod()&rsquo;s Optimization Called &ldquo;Bigcomp&rdquo;</h2>
<p>strtod() has an optimization called &ldquo;<a title="Read Rick Regan's Article &ldquo;Bigcomp: Deciding Truncated, Near Halfway Conversions&rdquo;" href="http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/">bigcomp</a>&rdquo;, which limits the big integer value of <em>d</em> to 40 decimal digits. For decimal inputs of 40 significant digits or less, all 40 are used for <em>d</em>; for decimal inputs greater than 40 digits, only the first 18 are used. Compensating for this truncation of <em>d</em> leads to a more complicated test procedure. (This optimization is on by default; you have to &#35;define variable &ldquo;NO_STRTOD_BIGCOMP&rdquo; to turn it off.)</p>
<h2>References</h2>
<ul>
<li>David Gay&#8217;s <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">dtoa.c</a> (see the strtod() function).</li>
<li>David Gay&#8217;s Technical Report <a title="David Gay's Technical Report &ldquo;Correctly Rounded Binary-Decimal and Decimal-Binary Conversions&rdquo;" href="http://www.ampl.com/REFS/rounding.pdf">&ldquo;Correctly Rounded Binary-Decimal and Decimal-Binary Conversions&rdquo;</a>.</li>
<li>William D. Clinger&#8217;s paper <a title="Read William D. Clinger's paper &ldquo;How to Read Floating Point Numbers Accurately&rdquo;" href="http://www.cesura17.net/~will/Professional/Research/Papers/howtoread.pdf">&ldquo;How to Read Floating Point Numbers Accurately&rdquo;</a> (see &lsquo;algorithmR&rsquo;).</li>
<li><a title="FloatingDecimal.java" href="http://www.docjar.com/html/api/sun/misc/FloatingDecimal.java.html">FloatingDecimal.java</a> (see the doubleValue() method).</li>
<li><a title="Python dtoa.c" href="http://svn.python.org/view/python/branches/py3k-dtoa/Python/dtoa.c?revision=83813&#038;view=markup">Python&#8217;s version of David Gay&#8217;s dtoa.c</a>.</li>
</ul>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">Using Integers to Check a Floating-Point Approximation</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>strtod()’s Initial Decimal to Floating-Point Approximation</title>
		<link>http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/</link>
		<comments>http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/#comments</comments>
		<pubDate>Tue, 27 Sep 2011 15:13:05 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Exponents]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=347</guid>
		<description><![CDATA[David Gay&#8217;s strtod() function does decimal to floating-point conversion using both IEEE double-precision floating-point arithmetic and arbitrary-precision integer arithmetic. For some inputs, a simple IEEE floating-point calculation suffices to produce the correct result; for other inputs, a combination of IEEE arithmetic and arbitrary-precision arithmetic is required. In the latter case, IEEE arithmetic is used to [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/">strtod()’s Initial Decimal to Floating-Point Approximation</a></p>
]]></description>
			<content:encoded><![CDATA[<p>David Gay&#8217;s strtod() function does decimal to floating-point conversion using both IEEE double-precision floating-point arithmetic and arbitrary-precision integer arithmetic. For some inputs, <a title="Read Rick Regan's article &ldquo;Fast Path Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">a simple IEEE floating-point calculation suffices</a> to produce the correct result; for other inputs, a combination of IEEE arithmetic and arbitrary-precision arithmetic is required. In the latter case, IEEE arithmetic is used to calculate an <em>approximation</em> to the correct result, which is then refined using arbitrary-precision arithmetic. In this article, I&#8217;ll describe the approximation calculation, which is based on a form of binary exponentiation.</p>
<p><span id="more-347"></span></p>
<h2>Overview of the Approximation Calculation</h2>
<p>Conceptually, the approximation calculation involves two floating-point values: the significant digits of the decimal number, expressed as an integer, and a corresponding power of ten. For example, for the decimal string 163.118762e+109, the calculation is 163118762 x 10<sup>103</sup>; for the decimal string 8.453127e-67, the calculation is 8453127 x 10<sup>-73</sup>. If there are more than 16 significant digits, only the first 16 are used. The decimal string 6.2187331579177550499956283e+100, for example, would be treated as 6.218733157917755e+100, making the calculation 6218733157917755 x 10<sup>85</sup>.</p>
<p>The powers of ten can be very small or very large. For double-precision floating-point (I&#8217;m considering normal numbers only), they range from 10<sup>-323</sup> (e.g., for DBL_MIN = 2.2250738585072014e-308) to 10<sup>308</sup> (e.g., for 1e+308). The goal is to compute them efficiently.</p>
<h2>Right-to-Left Binary Exponentiation</h2>
<p>An efficient way to compute powers of ten is using <a title="Read Dr. Math's description of binary exponentiation" href="http://mathforum.org/library/drmath/view/55603.html">right-to-left binary exponentiation</a>. Conceptually, the process is this: given a power b<sup>n</sup> to compute, write <em>n</em> as a sum of powers of two; then, using the <a href="http://www.exploringbinary.com/the-laws-of-exponents/" title="Read Rick Regan's Article &ldquo;The Laws of Exponents&rdquo;">laws of exponents</a>, rewrite b<sup>n</sup> as a product of <em>b</em> raised to each power of two. For example:</p>
<p>10<sup>103</sup> = 10<sup>1 + 2 + 4 + 32 + 64</sup> = 10<sup>1</sup> * 10<sup>2</sup> * 10<sup>4</sup> * 10<sup>32</sup> * 10<sup>64</sup> .</p>
<p>Evaluating this expression requires only ten multiplications &#8212; six to compute the powers of ten 10<sup>2</sup> through 10<sup>64</sup> (by successive squaring), and four to multiply the five required powers together. If 10<sup>103</sup> were computed naively, it would require 102 multiplications.</p>
<p>As an example for computing negative exponents, consider </p>
<p>10<sup>-73</sup> = 10<sup>-(1 + 8 + 64)</sup> = 10<sup>-1</sup> * 10<sup>-8</sup> * 10<sup>-64</sup> .</p>
<p>This requires eight multiplications &#8212; six to compute the powers of ten 10<sup>-2</sup> through 10<sup>-64</sup>, and two to multiply the three required powers together. The naive calculation would do 72 multiplications.</p>
<h3>Required Set of Powers</h3>
<p>For the range of exponents of double-precision values, the required set of &ldquo;binary powers&rdquo; (powers of ten raised to powers of two) for right-to-left binary exponentiation are</p>
<ul>
<li>{10<sup>-1</sup>, 10<sup>-2</sup>, 10<sup>-4</sup>, 10<sup>-8</sup>, 10<sup>-16</sup>, 10<sup>-32</sup>, 10<sup>-64</sup>, 10<sup>-128</sup>, 10<sup>-256</sup>} for negative exponents, and</li>
<li>{10<sup>1</sup>, 10<sup>2</sup>, 10<sup>4</sup>, 10<sup>8</sup>, 10<sup>16</sup>, 10<sup>32</sup>, 10<sup>64</sup>, 10<sup>128</sup>, 10<sup>256</sup>} for positive exponents.</li>
</ul>
<p>Instead of computing these powers for each conversion, they can be computed once and stored in a table. This reduces the number of multiplications further. In the case of our examples, 10<sup>103</sup> would now require only four multiplications, and 10<sup>-73</sup> would require only two.</p>
<h2>The Approximation Calculation in strtod()</h2>
<p>The approximation calculation in strtod() is based on the calculations I just described, but there are a few differences. The significant digits and power of ten are not built in two variables; one variable is initialized with the significant digits, and then the power of ten is factored in incrementally. This allows for a two-phased calculation: the first phase is a <a title="Read Rick Regan's article &ldquo;Fast Path Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">fast-path</a> like calculation that handles any component power of ten between 10<sup>-15</sup> and 10<sup>15</sup>; the second phase uses binary exponentiation to finish the computation of the power of ten, for component binary powers 10<sup>-16</sup> and smaller and 10<sup>16</sup> and greater. </p>
<p>In the first phase of the calculation, only positive powers of ten, 10<sup>1</sup> to 10<sup>15</sup>, are used. When negative powers of ten 10<sup>-1</sup> to 10<sup>-15</sup> are called for, the calculation is changed to the equivalent calculation that <em>divides</em> the significant digits by the corresponding <em>positive</em> power of ten. A separate table contains each of the powers of ten from 10<sup>1</sup> to 10<sup>15</sup>, which are looked up directly. The one division or multiplication replaces the up to three multiplications that might be needed for binary exponentiation (10<sup>1</sup> * 10<sup>2</sup> * 10<sup>4</sup> * 10<sup>8</sup>). This not only is potentially more efficient, but reduces the number of possibly inexact operations (operations in which the result must be rounded).</p>
<p>Here are two examples: 163.118762e+109, which is parsed as 163118762 x 10<sup>103</sup>, is calculated as 163118762 * 10<sup>7</sup> * 10<sup>32</sup> * 10<sup>64</sup>; 8.453127e-67, which is parsed as 8453127 x 10<sup>-73</sup>, is calculated as 8453127/10<sup>9</sup> * 10<sup>-64</sup>.</p>
<h3>The Calculation In Code</h3>
<p>I&#8217;ve taken <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">strtod()</a>&#8216;s approximation code and distilled it into my own C program. The main difference between my code and strtod() is that strtod() handles overflow (for values near DBL_MAX) and underflow (for values that are near DBL_MIN or are subnormal). I&#8217;ve also renamed a few variables and made a few other small changes to make the code more readable.</p>
<pre>
double strtod_approx(char* decimal)
{
 int exponent, sign, i;
 long long sigDigits16;

 static const double fastPosTens[] = {
   1e0, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9,
   1e10, 1e11, 1e12, 1e13, 1e14, 1e15};
 static const double binaryPosTens[] = {
   1e16, 1e32, 1e64, 1e128, 1e256};
 static const double binaryNegTens[] = {
   1e-16, 1e-32, 1e-64, 1e-128, 1e-256};

 double d;

 <span class="grayed">parseDecimalApprox(decimal,&amp;sign,&amp;sigDigits16,&amp;exponent);</span>

 <strong>d = sigDigits16;</strong>
 if (sign)
   d = -d;

 if (exponent &gt; 0)
   {
    if (i = exponent &amp; 0xF)
      <strong>d *= fastPosTens[i];</strong>
    if (exponent &gt;&gt;= 4)
      for(i = 0; exponent &gt; 0; i++, exponent &gt;&gt;= 1)
        if (exponent &amp; 1)
          <strong>d *= binaryPosTens[i];</strong>
   }
 else
   if (exponent &lt; 0)
     {
      exponent = -exponent;
      if (i = exponent &amp; 0xF)
        <strong>d /= fastPosTens[i];</strong>
      if (exponent &gt;&gt;= 4)
        for(i = 0; exponent &gt; 0; i++, exponent &gt;&gt;= 1)
          if (exponent &amp; 1)
            <strong>d *= binaryNegTens[i];</strong>
     }

 return d;
}
</pre>
<h4>Notes</h4>
<ul>
<li>This function calls an auxiliary function I wrote called parseDecimalApprox(), which I&#8217;ve not shown. It returns the sign of the number, the value of its first 16 significant digits, and its power of ten exponent.</li>
<li>There are three tables of powers of ten, all precomputed at compile time:
<ul>
<li><strong>fastPosTens</strong>: This holds the 15 <a title="Read Rick Regan's article &ldquo;Why Powers of Ten Up to 10^22 Are Exact As Doubles&rdquo;" href="http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/">exactly representable powers of ten</a> needed for the first phase of the calculation (1e0 is not used &#8212; it&#8217;s just a placeholder).</li>
<li><strong>binaryPosTens</strong>: This holds the five positive powers of ten used for binary exponentiation.</li>
<li><strong>binaryNegTens</strong>: This holds the five negative powers of ten used for binary exponentiation.</li>
</ul>
</li>
<li>If the exponent is zero, no power of ten is factored in (it would be 10<sup>0</sup> = 1).</li>
</ul>
<p>Realized in code, you can see why the algorithm to compute the power of ten is called right-to-left binary exponentiation. Converting the exponent to a sum of powers of two is the same as converting it to binary; in code, the exponent is a binary integer. The bits of the exponent are accessed from right to left &#8212; least significant to most significant &#8212; using the right shift operator; a 1-bit means the corresponding power of two is part of the exponent. The exponent will be a maximum of 9 bits long.</p>
<h3>Analysis</h3>
<p>There are at most five floating-point multiplications and divisions in this calculation: three multiplications plus one multiplication or division for the worst case powers of ten, and one multiplication to combine the significant digits and the power of ten. If the significant digits start out inexact &#8212; either because they are truncated to 16 digits, or they are 16 digits (truncated or otherwise) but represent a value greater than 2<sup>53</sup> &#8211; 1 &#8212; that&#8217;s a maximum of six inexact operations.</p>
<p>Using my code above, I measured approximations as far as 10 ULPs off; 1.00431469722921494e-140 is one example. (This is consistent with my <a title="Read Rick Regan's article &ldquo;Properties of the Correction Loop in David Gay’s strtod()&rdquo;" href="http://www.exploringbinary.com/properties-of-the-correction-loop-in-david-gays-strtod/">analysis of David Gay&#8217;s code</a>, where the maximum difference I found was 11 ULPs. Why the difference? It could be due to the code that handles underflow and overflow, or it could be because I didn&#8217;t run the same exact tests in both cases.)</p>
<p>Contrast this with my <a title="Read Rick Regan's article &ldquo;Quick and Dirty Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/">quick and dirty decimal to floating-point conversion program</a>, which does a multiplication or division for every digit. Using that program, I found an example that was 14 ULPs off.</p>
<h2>Discussion</h2>
<h3>A Program That Converts Decimal Values Uses Decimal Values?</h3>
<p>The code converts decimal numbers to floating-point, and yet it itself needs decimal numbers converted to floating-point &#8212; isn&#8217;t that circular? Not really. The compiler&#8217;s implementation would have to be different, or at least modified so that the decimal constants are specified in some other way &#8212; say as <a title="Read Rick Regan's article &ldquo;Hexadecimal Floating-Point Constants&rdquo;" href="http://www.exploringbinary.com/hexadecimal-floating-point-constants/">hexadecimal floating-point constants</a>.</p>
<p>This raises another issue: what if the <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">compiler converts these decimal literals incorrectly</a>? After all, this is part of a larger conversion routine whose goal is correct conversion. My guess is that it does not matter. The worst that can happen is that the approximation will be a little less accurate; it will ultimately be corrected &#8212; by the <a title="Read Rick Regan's Article &ldquo;Using Integers to Check a Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">arbitrary-precision based reconciliation process in strtod()</a>. (For what it&#8217;s worth, I checked &#8212; Visual C++ got all of the conversions right.)</p>
<h3>Java’s FloatingDecimal Class</h3>
<p>The same approximation calculation is done in the doubleValue() method of Java’s FloatingDecimal Class, which is modeled on David Gay’s strtod().</p>
<h3>Using Division For the Negative Exponent Case</h3>
<p>For the negative exponent case of the binary exponentiation calculation, I tried using division (by positive powers of ten) instead of multiplication (by negative powers of ten). The division-based calculation ran a little slower, which was expected. But interestingly, the approximations were a little less accurate, on average. I don&#8217;t know why.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/">strtod()’s Initial Decimal to Floating-Point Approximation</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fast Path Decimal to Floating-Point Conversion</title>
		<link>http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/</link>
		<comments>http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/#comments</comments>
		<pubDate>Thu, 01 Sep 2011 13:18:08 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary integers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>
		<category><![CDATA[Fractions]]></category>
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=344</guid>
		<description><![CDATA[In general, to convert an arbitrary decimal number into a binary floating-point number, arbitrary-precision arithmetic is required. However, a subset of decimal numbers can be converted correctly with just ordinary limited-precision IEEE floating-point arithmetic, taking what I call the fast path to conversion. Fast path conversion is an optimization used in practice: it&#8217;s in David [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">Fast Path Decimal to Floating-Point Conversion</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In general, to convert an arbitrary decimal number into a binary floating-point number, <a title="Read Rick Regan's article &ldquo;Correct Decimal To Floating-Point Using Big Integers&rdquo;" href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">arbitrary-precision arithmetic is required</a>. However, a subset of decimal numbers can be converted correctly with just ordinary limited-precision IEEE floating-point arithmetic, taking what I call the <em>fast path</em> to conversion. Fast path conversion is an optimization used in practice: it&#8217;s in David Gay&#8217;s strtod() function and in Java&#8217;s FloatingDecimal class. I will explain how fast path conversion works, and describe the set of numbers that qualify for it.</p>
<p><span id="more-344"></span></p>
<h2>An Unnormalized Form of Decimal Scientific Notation</h2>
<p>Any number can be expressed in <a title="Wikipedia article on scientific notation" href="http://en.wikipedia.org/wiki/Scientific_notation">decimal scientific notation</a>, a form where it is written as a product of its significant digits and a power of ten. Conventionally, the significant digits are expressed in normalized form, where the first digit is between 1 and 9, and any subsequent digits follow a decimal point. For example, 0.0926 would be written as 9.26 x 10<sup>-2</sup>, and 7390 would be written as 7.390 x 10<sup>3</sup>.</p>
<p>In the context of decimal to floating-point conversion, a nonstandard, <em>unnormalized</em> form of scientific notation is used, where the significant digits are written as an integer. For example, 0.0926 would be written as 926 x 10<sup>-4</sup>, and 7390 would be written as 739 x 10<sup>1</sup>.</p>
<p>In this unnormalized form, there are two cases: one where the integer containing the significant digits is multiplied by a nonnegative power of ten, and one where it&#8217;s multiplied by a <em>negative</em> power of ten. It&#8217;s convenient to think of the latter case as a fraction, where the significant digits integer is <em>divided</em> by the corresponding <em>positive</em> power of ten. For example, 926 x 10<sup>-4</sup> equals 926/10<sup>4</sup>. These two cases &#8212; multiplication or division of an integer by a power of ten that is an integer &#8212; form the basis of fast path conversion.</p>
<h2>The Fast Path</h2>
<p>For the fast path conversion process, we take a decimal number, represented as a string, and convert it to the nearest IEEE double-precision binary floating-point number. You can view this process as having two main steps. In the first step, floating-point integers <em>s</em> and <em>p</em> are created: <em>s</em> represents the significant digits of the decimal number (positive or negative), and <em>p</em> represents the corresponding nonnegative power of ten. In the second step, an IEEE floating-point multiplication or division (the fmul or fdiv instruction on Intel processors) operates on <em>s</em> and <em>p</em>, producing the desired double-precision result.</p>
<p>This process is guaranteed to work only when <em>s</em> and <em>p</em> are exactly representable in double-precision; this happens when their binary representations have no 1 bits beyond bit 53. The values of <em>p</em> that meet this criteria are <a title="Read Rick Regan's article &ldquo;Why Powers of Ten Up to 10^22 Are Exact As Doubles&rdquo;" href="http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/">10<sup>0</sup> &le; <em>p</em> &le; 10<sup>22</sup></a>. The values of <em>s</em> that meet this criteria are, by definition, all floating-point integers. (Think of it this way: Take any floating-point integer and print all the digits of its decimal equivalent &#8212; that would be a valid value for <em>s</em>. For example, 52882540679000003517910258026656900519026609372478984461456769024, which is the floating-point integer <a title="Read Rick Regan's article &ldquo;Hexadecimal Floating-Point Constants&rdquo;" href="http://www.exploringbinary.com/hexadecimal-floating-point-constants/">0&#120;</a>1.0119c58f25bc6p+215.)</p>
<p>Practically speaking, there is no simple test to check all the valid values of <em>s</em>. But there is a simple test that identifies a <em>subset</em> of its values, <strong>0 &le; <em>s</em> &le; 2<sup>53</sup> &#8211; 1 = 9,007,199,254,740,991</strong>. I will consider only these 2<sup>53</sup> values as fast path cases.</p>
<p>This process relies implicitly on two things &#8212; that base conversion between integers is exact, and that IEEE multiplication and division produce correctly rounded results. Under these conditions, a maximum of one rounding is done; the result is a correctly rounded decimal to floating-point conversion.</p>
<p>Regarding the power of ten, it can be looked up in a table, or it can be computed. If it&#8217;s computed &#8212; say naively by repeated multiplication by ten &#8212; it will be done exactly. Also, if <em>p</em> = 1, a further optimization applies &#8212; the multiplication can be skipped.</p>
<h3>Examples</h3>
<p>Here are some examples that qualify for fast path conversion:</p>
<table class="center" border="1" summary="Example Fast Path Cases">
<caption><strong>Example Fast Path Cases</strong></caption>
<tbody>
<tr>
<th class="center">Decimal Literal</th>
<th class="center"><em>s</em></th>
<th class="center"><em>p</em></th>
<th class="center">Mult or Div?</th>
</tr>
<tr>
<td class="left">3.14159</td>
<td class="right">314159</td>
<td class="right">100000</td>
<td class="center">Div</td>
</tr>
<tr>
<td class="left">0.0001256789876643</td>
<td class="right">1256789876643</td>
<td class="right">10000000000000000</td>
<td class="center">Div</td>
</tr>
<tr>
<td class="left">9.11234e-17</td>
<td class="right">911234</td>
<td class="right">10000000000000000000000</td>
<td class="center">Div</td>
</tr>
<tr>
<td class="left">537.81e8</td>
<td class="right">53781</td>
<td class="right">1000000</td>
<td class="center">Mult</td>
</tr>
<tr>
<td class="left">9.007199254740991e37</td>
<td class="right">9007199254740991</td>
<td class="right">10000000000000000000000</td>
<td class="center">Mult</td>
</tr>
<tr>
<td class="left">299792458</td>
<td class="right">299792458</td>
<td class="right">1</td>
<td class="center">NA</td>
</tr>
<tr>
<td class="left">0</td>
<td class="right">0</td>
<td class="right">1</td>
<td class="center">NA</td>
</tr>
</tbody>
</table>
<h3>Fast Path Cases In Disguise</h3>
<p>There is another set of decimal numbers that can be transformed easily into fast path multiplication cases: numbers with a <em>p</em> greater than 10<sup>22</sup> but with an <em>s</em> that can be multiplied by those excess powers of ten and still remain exact. For example, take the number 123e34. As written, it has <em>s</em> = 123 and <em>p</em> = 10<sup>34</sup>; <em>p</em> is too big. However, if you treat it as 123000000000000e22, it qualifies as a fast path case. You transform it by multiplying <em>s</em> by 10<sup>12</sup> and by using a <em>p</em> that&#8217;s 12 orders of magnitude smaller. In other words, if you can transfer the extra powers of ten to the significant digits, you have a fast path case.</p>
<h2>The Fast Path in David Gay&#8217;s strtod()</h2>
<p><a title="David Gay's strtod() function in dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod()</a> function implements a fast path for significant digits 0 &le; <em>s</em> &le; 10<sup>15</sup> &#8211; 1 = 999,999,999,999,999 and powers of ten 10<sup>0</sup> &le; <em>p</em> &le; 10<sup>22</sup> (the powers of ten are precomputed and stored in a table). It also handles the &ldquo;disguised&rdquo; fast path cases, up to the same 15-digit limit.</p>
<p>Why is strtod()&#8217;s fast path limited to 15 digits? My guess is because that was the easiest check to make; all integers of 15 digits or less are exact. But what about the remaining fast path cases? At a minimum, we could easily allow any <em>s</em> up to 8,999,999,999,999,999 &#8212; an eightfold increase in the number of fast path cases. All we&#8217;d need to do is add a check for &ldquo;number of digits = 16 and first digit &le; 8.&rdquo;</p>
<p>To cover the remaining cases, we could convert the entire 16-digit string to a double and then check whether its value is less than or equal to 2<sup>53</sup> &#8211; 1. The code currently converts the significant digits to a double anyhow, so why not use it for an additional fast path test? In fact, why not check the double value exclusively, skipping testing of the string altogether?</p>
<p>(<strike>I have a note out to Dave Gay asking him these questions; I&#8217;ll update this article if he responds.</strike> <strong><em>Update</em></strong>: Dave Gay agrees that the change to check for 16 digits/first digit &le; 8 would work and would be simple enough, but he&#8217;s not sure if it is worth making.)</p>
<h2>The Fast Path in Java&#8217;s FloatingDecimal Class</h2>
<p>Java does its decimal to floating-point conversion in the doubleValue() method of its <a title="FloatingDecimal.java" href="http://www.docjar.com/html/api/sun/misc/FloatingDecimal.java.html">FloatingDecimal</a> class. It is modeled on David Gay&#8217;s strtod(), and covers the exact set of fast path cases.</p>
<h2>Discussion</h2>
<h3>Fast Path Vs. the Arbitrary-Precision Integer Algorithm</h3>
<p>Fast path conversion is simple because most of the work is done in IEEE floating-point. Contrast this with the <a title="Read Rick Regan's article &ldquo;Correct Decimal To Floating-Point Using Big Integers&rdquo;" href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">arbitrary-precision integer conversion algorithm</a>, which</p>
<ul>
<li>Scales the fraction to make rounding of the significand easier.</li>
<li>Normalizes the result.</li>
<li>Encodes the result in IEEE floating-point format.</li>
</ul>
<p>In IEEE floating-point, scaling is unnecessary; correct rounding is handled automatically. Also, there is no need for explicit normalization, nor any need to encode the result &#8212; it&#8217;s already encoded! Additionally, IEEE floating-point handles the edge case, where the significand would otherwise round up to 2<sup>53</sup>.</p>
<p>The fast path reduces decimal to floating-point conversion to this: parsing of the decimal string, creation of two binary floating-point integers to represent it, and delegation of the hard work to a single floating-point instruction.</p>
<h3>Processor Rounding Mode</h3>
<p>Because the fast path algorithm uses IEEE arithmetic, it will be sensitive to the processor rounding mode. Typically, the rounding mode is set to round-to-nearest, but it could be set to one of three other modes. A conversion routine must ensure that both its <a title="Read Rick Regan's article &ldquo;Visual C++ and GLIBC strtod() Ignore Rounding Mode&rdquo;" href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/#rounding-mode-detection-in-david-gays-strtod">fast path and arbitrary-precision path round the same way</a>.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">Fast Path Decimal to Floating-Point Conversion</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why Powers of Ten Up to 1022 Are Exact As Doubles</title>
		<link>http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/</link>
		<comments>http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/#comments</comments>
		<pubDate>Fri, 12 Aug 2011 02:30:14 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=342</guid>
		<description><![CDATA[The fast path in David Gay&#8217;s decimal to floating-point conversion routine relies on this property of the first twenty-three nonnegative powers of ten: they have exact representations in double-precision floating-point. While it&#8217;s easy to see why powers of ten up to 1015 are exact, it&#8217;s less clear why the powers of ten from 1016 to [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/">Why Powers of Ten Up to 10<sup>22</sup> Are Exact As Doubles</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Read Rick Regan's article &ldquo;Fast Path Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">The fast path in David Gay&#8217;s decimal to floating-point conversion routine</a> relies on this property of the first twenty-three nonnegative powers of ten: they have exact representations in double-precision floating-point. While it&#8217;s easy to see why powers of ten up to 10<sup>15</sup> are exact, it&#8217;s less clear why the powers of ten from 10<sup>16</sup> to 10<sup>22</sup> are. To see why, you have to look at their binary representations.</p>
<p><span id="more-342"></span></p>
<h2>Trailing Zeros Become &ldquo;Significantly Insignificant&rdquo;</h2>
<p>10<sup>15</sup> is the largest power of ten less than 2<sup>53</sup> &#8211; 1, the largest integer representable in 53 bits. 10<sup>15</sup> has 50 bits, and the next larger power of ten, 10<sup>16</sup>, has 54 bits. So how can 10<sup>16</sup> &#8212; and the 74-bit 10<sup>22</sup> for that matter &#8212; have an exact double-precision representation, when doubles only store a 53-bit significand? The answer is simple: it has a binary representation with trailing zeros after bit 53.</p>
<p>A nonnegative power of ten 10<sup>n</sup> equals 5<sup>n</sup> * 2<sup>n</sup>; <a title="Read Rick Regan's article &ldquo;A Pattern in Powers of Ten and Their Binary Equivalents&rdquo;" href="http://www.exploringbinary.com/a-pattern-in-powers-of-ten-and-their-binary-equivalents/">in binary,  that&#8217;s a power of five followed by n zeros</a> (a power of five always ends in &lsquo;1&rsquo;). For example, <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">10<sup>22</sup> (10000000000000000000000) is</a> </p>
<pre class="small  noborder">
10000111100001100111100000110010011011101010110010010000000000000000000000
</pre>
<p>This is 5<sup>22</sup> (2384185791015625), which is 52 bits long, followed by 22 zeros. Here it is in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a>, showing only its leading nonzero significant bits &#8212; the bits of its power of five factor:</p>
<p>1.000011110000110011110000011001001101110101011001001 x 2<sup>73</sup></p>
<p>Since this is 52 bits, it maps directly to a double &#8212; no rounding is required.</p>
<p>So what happened to the 22 significant trailing zeros? They were preserved in the exponent, as the factor 2<sup>22</sup>. (The remaining factor, 2<sup>51</sup>, undoes the normalization.) </p>
<h3>Why 10<sup>22</sup> Is the Maximum</h3>
<p>To see why 10<sup>22</sup> is the maximum exact power of ten, look at 10<sup>23</sup>. Its power of five factor is 5<sup>23</sup> (11920928955078125), which is 54 bits long; this is too long for exact representation in a double.</p>
<h2>10<sup>16</sup> to 10<sup>22</sup></h2>
<p>Here are the powers of ten from 10<sup>16</sup> to 10<sup>22</sup>, shown in binary and in binary scientific notation (in the binary representation, bits 54 and beyond are highlighted):</p>
<table border="1" summary="10^16 in Binary">
<tbody>
<tr>
<th class="center">10<sup>16</sup></th>
</tr>
<tr>
<td class="left small">10001110000110111100100110111111000001000000000000000<span class="highlight_gray4">0</span></td>
</tr>
<tr>
<td class="left small">1.0001110000110111100100110111111000001 x 2<sup>53</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^17 in Binary">
<tbody>
<tr>
<th class="center">10<sup>17</sup></th>
</tr>
<tr>
<td class="left small">10110001101000101011110000101110110001010000000000000<span class="highlight_gray4">0000</span></td>
</tr>
<tr>
<td class="left small">1.011000110100010101111000010111011000101 x 2<sup>56</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^18 in Binary">
<tbody>
<tr>
<th class="center">10<sup>18</sup></th>
</tr>
<tr>
<td class="left small">11011110000010110110101100111010011101100100000000000<span class="highlight_gray4">0000000</span></td>
</tr>
<tr>
<td class="left small">1.10111100000101101101011001110100111011001 x 2<sup>59</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^19 in Binary">
<tbody>
<tr>
<th class="center">10<sup>19</sup></th>
</tr>
<tr>
<td class="left small">10001010110001110010001100000100100010011110100000000<span class="highlight_gray4">00000000000</span></td>
</tr>
<tr>
<td class="left small">1.00010101100011100100011000001001000100111101 x 2<sup>63</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^20 in Binary">
<tbody>
<tr>
<th class="center">10<sup>20</sup></th>
</tr>
<tr>
<td class="left small">10101101011110001110101111000101101011000110001000000<span class="highlight_gray4">00000000000000</span></td>
</tr>
<tr>
<td class="left small">1.0101101011110001110101111000101101011000110001 x 2<sup>66</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^21 in Binary">
<tbody>
<tr>
<th class="center">10<sup>21</sup></th>
</tr>
<tr>
<td class="left small">11011000110101110010011010110111000101110111101010000<span class="highlight_gray4">00000000000000000</span></td>
</tr>
<tr>
<td class="left small">1.101100011010111001001101011011100010111011110101 x 2<sup>69</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^22 in Binary">
<tbody>
<tr>
<th class="center">10<sup>22</sup></th>
</tr>
<tr>
<td class="left small">10000111100001100111100000110010011011101010110010010<span class="highlight_gray4">000000000000000000000</span></td>
</tr>
<tr>
<td class="left small">1.000011110000110011110000011001001101110101011001001 x 2<sup>73</sup></td>
</tr>
</tbody>
</table>
<h3>On Binary Scientific Notation for Exact Integers</h3>
<p>In decimal scientific notation, when trailing zeros are significant, they are displayed. For example, if you know a value to be exactly 300,000, you would write it as 3.00000 x 10<sup>5</sup>. In binary scientific notation, however, this convention is not followed. I don&#8217;t know if there&#8217;s an official explanation, but I have my own. Binary scientific notation is used to represent binary floating-point numbers, and binary floating-point numbers are, in general, approximations to decimal numbers. Displaying trailing zeros, which could be the result of rounding, might imply an exactness that&#8217;s not there. Also, binary floating-point numbers have a limited length significand; keeping the number of significant bits displayed within this limit makes the mapping clearer.</p>
<h2>Other Large Numbers With Exact Representations</h2>
<p>For the same reason, other large numbers &#8212; like 14<sup>18</sup> (greater than 10<sup>20</sup>), 6<sup>33</sup> (greater than 10<sup>25</sup>), and <a title="Read Rick Regan's article &ldquo;A Simple C Program That Prints 2,098 Powers of Two&rdquo;" href="http://www.exploringbinary.com/a-simple-c-program-that-prints-2098-powers-of-two/">2<sup>1023</sup></a> (greater than 10<sup>307</sup>) &#8212; are exact as doubles. In fact, any integer with a power of two factor that takes it beyond 53 bits can be represented exactly (as long as the maximum exponent isn&#8217;t exceeded). In this case, <strong>a double can represent more than 53 significant bits!</strong></p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/">Why Powers of Ten Up to 10<sup>22</sup> Are Exact As Doubles</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Correct Decimal To Floating-Point Using Big Integers</title>
		<link>http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/</link>
		<comments>http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/#comments</comments>
		<pubDate>Thu, 04 Aug 2011 02:44:19 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary integers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>
		<category><![CDATA[Fractions]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=340</guid>
		<description><![CDATA[Producing correctly rounded decimal to floating-point conversions is hard, but only because it is made to be done efficiently. There is a simple algorithm that produces correct conversions, but it&#8217;s too slow &#8212; it&#8217;s based entirely on arbitrary-precision integer arithmetic. Nonetheless, you should know this algorithm, because it will help you understand the highly-optimized conversion [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">Correct Decimal To Floating-Point Using Big Integers</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.exploringbinary.com/topics/#correctly-rounded-decimal-to-floating-point" title="Read Articles from Rick Regan's Topics Page: &ldquo;Correctly Rounded Decimal to Floating-Point Conversion&rdquo;">Producing correctly rounded decimal to floating-point conversions is hard</a>, but only because it is made to be done efficiently. There is a simple algorithm that produces correct conversions, but it&#8217;s too slow &#8212; it&#8217;s based entirely on arbitrary-precision integer arithmetic. Nonetheless, you should know this algorithm, because it will help you understand the highly-optimized conversion routines used in practice, like <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a>. I will outline the algorithm, which is easily implemented in a language like C, using a &#8220;big integer&#8221; library like <a href="http://www.exploringbinary.com/how-to-install-and-run-gmp-on-windows-using-mpir/" title="Read Rick Regan's Article &ldquo;How to Install and Run GMP on Windows Using MPIR&rdquo;">GMP</a>.</p>
<div class="wp-caption aligncenter" style="width: 535px"><img src="http://www.exploringbinary.com/wp-content/uploads/BigIntegers.1eminus20.png" alt="Ratio of Big Integers (2^119/10^20) Producing the 53-Bit Significand of 1e-20" width="525" height="107"/><p class="wp-caption-text">Ratio of Big Integers (2<sup>119</sup>/10<sup>20</sup>) Producing the 53-Bit Significand of 1e-20</p></div>
<p><span id="more-340"></span></p>
<h2>Arbitrary-Precision is Needed</h2>
<p>Our task is to write a computer program that uses <em><strong>binary arithmetic</strong></em> to convert a decimal number &#8212; represented as a character string in standard or scientific notation &#8212; into an <a title="Wikipedia article on IEEE double-precision format" href="http://en.wikipedia.org/wiki/Double_precision">IEEE double-precision binary floating-point number</a>. What makes this task difficult is that a given decimal number may not have a double-precision equivalent; an approximation &#8212; the double-precision number <em>closest</em> to it &#8212; must be used in its place.</p>
<p>An IEEE double-precision binary floating-point number, or <em>double</em>, represents a number in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a>, with a 53-bit significand (subnormal numbers have less than 53 bits, but I won&#8217;t be discussing them) and an 11-bit power of two exponent. The limited precision of the significand is what makes approximation necessary. For example, a decimal fraction like 0.1 has an infinite binary representation, so it must be rounded to 53 significant bits. A large integer like 9007199254740997 has a binary representation that, although finite, is more than 53 bits long; it must also be rounded to 53 bits.</p>
<p>To round correctly, in general, you need to compute the approximation using more than 53 bits of precision; <em>arbitrary-precision</em> is required, an amount which depends on the input. An arbitrary-precision floating point or arbitrary-precision integer based algorithm could be used, but the algorithm I&#8217;ll discuss uses arbitrary-precision integers &#8212; AKA <em>big integers</em>. This makes for a more elegant solution.</p>
<h2>Big Integer Based Algorithm</h2>
<p>The big integer based algorithm I will show you is essentially <em>AlgorithmM</em>, as described in William D. Clinger&#8217;s paper &ldquo;<a title="Read William D. Clinger's paper &ldquo;How to Read Floating Point Numbers Accurately&rdquo;" href="http://www.cesura17.net/~will/Professional/Research/Papers/howtoread.pdf">How to Read Floating Point Numbers Accurately</a>.&rdquo; He presents it in the programming language Scheme, but I will present it to you in words, and by example. (Like Clinger, I will deal only with positive, normal numbers.)</p>
<h3>53-Bit Integer Significands</h3>
<p>The significand of a normalized double-precision floating-point number is 53 bits, with its most significant bit equal to 1. For the purposes of our conversion algorithm, we&#8217;ll think of it as a 53-bit integer; that is, an integer between 2<sup>52</sup> (4503599627370496) and 2<sup>53</sup> &#8211; 1 (9007199254740991).</p>
<h3>Overview</h3>
<p>The idea of the algorithm is to represent the decimal number as a fraction, one that produces a 53-bit integer quotient and an arbitrary-precision integer remainder. The significand is the quotient, rounded up or down according to the remainder; the power of two exponent is derived from the <a title="Wikipedia article on binary scaling" href="http://en.wikipedia.org/wiki/Binary_scaling">binary scale factor</a> used to produce the fraction.</p>
<p>There are two basic cases the algorithm must handle: one where the fraction needs to be <em>scaled up</em> to produce a 53-bit quotient, and one where the fraction needs to be <em>scaled down</em> to produce a 53-bit quotient. I will demonstrate the algorithm with an example of each case.</p>
<h3>Converting 3.14159 to Double-Precision</h3>
<p>Here&#8217;s how to convert the decimal literal 3.14159 to double-precision floating-point:</p>
<ol>
<li><strong>Express as an integer times a power of ten</strong>
<p>3.14159 = 314159 x 10<sup>-5</sup></p>
<p><em>In code</em>: This step is done as the decimal string is parsed. The significant digits are collected (sans the decimal point) and then converted to binary as a big integer. The exponent of the power of ten is determined based on the correct placement of the decimal point. (For strings expressed in decimal scientific notation, the exponent must be converted to binary, and then adjusted for the decimal point accordingly.)</p>
</li>
<li><strong>Express as a fraction</strong>
<p>The negative power of ten, 10<sup>-5</sup>, makes this a fraction, with 314159 as the numerator and 10<sup>5</sup> = 100000 as the denominator:</p>
<p>314159/100000</p>
<p><em>In code</em>: The positive power of ten is computed as a big integer,  raising ten to the absolute value of the exponent.</p>
</li>
<li><strong>Scale into the range [2<sup>52</sup>,2<sup>53</sup>)</strong>
<p>314159/100000 is a number <em>r</em>, 2<sup>1</sup> &le; <em>r</em> &lt; 2<sup>2</sup>. We need to scale it <em>up</em>, so that we have a scaled value <em>r&#8217;</em> in the range 2<sup>52</sup> &le; <em>r&#8217;</em> &lt; 2<sup>53</sup>; we do this by multiplying the numerator by 2<sup>51</sup>:</p>
<p>314159 x 2<sup>51</sup>/100000 = 707423177667543826432/100000</p>
<p>(We scale the numerator because we need the scaling done <em>before</em> the division, not after!)</p>
<p><em>In code</em>: You could use logarithms to compute the scale factor, but you&#8217;d have to use care. You would need arbitrary precision to properly distinguish values near powers of two. A simpler approach, in the spirit of sticking with integer arithmetic, is to use a loop: multiply the numerator by two until the quotient is in range. (For conversions where the quotient starts out <em>above</em> range, multiply the <em>denominator</em> by two until the quotient is in range.)</p>
</li>
<li><strong>Divide and round</strong>
<p>707423177667543826432/100000 = 7074231776675438 remainder 26432</p>
<p>The quotient is an integer <em>q</em> in the range 2<sup>52</sup> &le; <em>q</em> &le; 2<sup>53</sup> &#8211; 1. The remainder is an integer <em>r</em> in the range 0 &le; <em>r</em> &lt; 100000, and it tells us which way to round <em>q</em> &#8212; up or down to the next integer, which we do according to IEEE 754 <a title="Wikipedia article definition of &ldquo;round half to even&rdquo; rounding mode" href="http://en.wikipedia.org/wiki/Rounding#Round_half_to_even">round-half-to-even</a> rounding: if <em>r</em> is less than half the denominator, we round <em>q</em> down; if <em>r</em> is more than half the denominator, we round <em>q</em> up; if <em>r</em> equals half the denominator, we round <em>q</em> down if <em>q</em> is even, and round <em>q</em> up if <em>q</em> is odd. In our example, <em>r</em> is less than half the denominator, so <em>q</em> rounds down to <strong>7074231776675438</strong>; this is our significand.</p>
<p>(In the event that <em>q</em> rounds up to 2<sup>53</sup>, set it to 2<sup>52</sup> instead, and decrement the exponent of the scale factor.)</p>
<p><em>In code</em>: The remainder is easily calculated as <em>r</em> = numerator &#8211; <em>q</em> x denominator. For the &ldquo;half the denominator&rdquo; comparison, note that the denominator will either be 1 (for integers that start out in range and thus don&#8217;t need scaling) or divisible by 2: if it&#8217;s 1, there&#8217;s no remainder and no rounding; if it&#8217;s divisible by 2, then half the denominator is an integer, and the remainder can be compared to it with integer operations. For the test for evenness, just divide the quotient by two and check for a remainder of zero.</p>
</li>
<li><strong>Express in normalized binary scientific notation</strong>
<p>For this step, it&#8217;s easiest to think in terms of binary. The significand, 7074231776675438, <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">converts to</a></p>
<p>11001001000011111100111110000000110111000011001101110</p>
<p>This value is scaled, by 2<sup>51</sup>; we <strong>unscale</strong> it by multiplying it by 2<sup>-51</sup>. Actually, we don&#8217;t multiply; we just express the number in unnormalized <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a>:</p>
<p>11001001000011111100111110000000110111000011001101110 x 2<sup>-51</sup></p>
<p>To be represented in the double-precision format, the significand must be <strong>normalized</strong>; this means we must move the radix point 52 bits to the left, and multiply the resulting binary fraction by 2<sup>52</sup>. The unscaled and normalized value becomes:</p>
<p>1.100100100001111110011111000000011011100001100110111 x 2<sup>-51</sup> x 2<sup>52</sup></p>
<p>By the <a title="Read Rick Regan's Article &ldquo;Composing Powers of Two Using The Laws of Exponents&rdquo;" href="http://www.exploringbinary.com/composing-powers-of-two-using-the-laws-of-exponents/">laws of exponents</a>, 2<sup>-51</sup> x 2<sup>52</sup> = 2<sup>1</sup>, so our resulting double-precision value, in normalized binary scientific notation, is</p>
<p>1.100100100001111110011111000000011011100001100110111 x 2<sup>1</sup></p>
<p><em>In code</em>: The code for this step is trivial: all we do is compute a power of two exponent, by adding 52 to the negative of the exponent of the scale factor.</p>
</li>
<li><strong>Encode as a double</strong>
<p>Normalized binary scientific notation maps straightforwardly to the IEEE double-precision format; here are the values of the fields for our example, in decimal:</p>
<ul>
<li>The sign field is 0, since the number is positive.</li>
<li>The exponent field is 1024, which equals the power of two exponent (1) plus the exponent bias (1023).</li>
<li>The fraction field is 2570632149304942, which is 7074231776675438 &#8211; 2<sup>52</sup>, the trailing 52 bits of the significand (the leading 1 bit is implicit).</li>
</ul>
<p>Here is the encoded double, in binary:</p>
<p><span class="highlight_gray4">0</span>10000000000<span class="highlight_gray4">1001001000011111100111110000000110111000011001101110</span></p>
<p>The sign field is 0, the exponent field is 10000000000, and the fraction field is 1001001000011111100111110000000110111000011001101110.</p>
<p>This double has the exact decimal value of</p>
<p>3.14158999999999988261834005243144929409027099609375</p>
<p><em>In code</em>: In C, you can set these values directly by <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">casting the double as a 64-bit integer</a> and treating the fields as integer subfields within it.</p>
</li>
</ol>
<h3>Converting 1.2345678901234567e22 to Double-Precision</h3>
<p>Here’s how to convert the decimal literal 1.2345678901234567e22 to double-precision floating-point:</p>
<ol>
<li><strong>Express as an integer times a power of ten</strong>
<p>1.2345678901234567e22 = 12345678901234567 x 10<sup>6</sup></p>
</li>
<li><strong>Express as a fraction</strong>
<p>The positive power of ten makes this an integer, but for the purposes of our algorithm, we still want to express it as a fraction; the numerator is 12345678901234567 x 10<sup>6</sup> = 12345678901234567 x 1000000 = 12345678901234567000000, and the denominator is 1:</p>
<p>12345678901234567000000/1</p>
</li>
<li><strong>Scale into the range [2<sup>52</sup>,2<sup>53</sup>)</strong>
<p>12345678901234567000000/1 is a number <em>r</em>, 2<sup>73</sup> &le; <em>r</em> &lt; 2<sup>74</sup>. We need to scale it <em>down</em>, so that we have a scaled value <em>r&#8217;</em> in the range 2<sup>52</sup> &le; <em>r&#8217;</em> &lt; 2<sup>53</sup>; we do this by multiplying the denominator by 2<sup>21</sup>:</p>
<p>12345678901234567000000/2<sup>21</sup> = 12345678901234567000000/2097152</p>
</li>
<li><strong>Divide and round</strong>
<p>12345678901234567000000/2097152 = 5886878443352969 remainder 1355712</p>
<p>The remainder is <em>more</em> than half the denominator, so the quotient &#8212; our significand &#8212; rounds up to <strong>5886878443352970</strong>.</p>
</li>
<li><strong>Express in normalized binary scientific notation</strong>
<p>In binary, 5886878443352970 is</p>
<p>10100111010100001010110110010011100111011001110001010</p>
<p>Unscaled, it is</p>
<p>10100111010100001010110110010011100111011001110001010 x 2<sup>21</sup></p>
<p>Normalized, it is</p>
<p>1.010011101010000101011011001001110011101100111000101 x 2<sup>21</sup> x 2<sup>52</sup></p>
<p>Simplified, it is the resulting double-precision value</p>
<p>1.010011101010000101011011001001110011101100111000101 x 2<sup>73</sup></p>
</li>
<li><strong>Encode as a double</strong>
<p>Here are the values of the fields for our example, in decimal:</p>
<ul>
<li>The sign field is 0, since the number is positive.</li>
<li>The exponent field is 1096, which equals the power of two exponent (73) plus the exponent bias (1023).</li>
<li>The fraction field is 1383278815982474, which is 5886878443352970 &#8211; 2<sup>52</sup>.</li>
</ul>
<p>Here is the encoded double, in binary:</p>
<p><span class="highlight_gray4">0</span>10001001000<span class="highlight_gray4">0100111010100001010110110010011100111011001110001010</span></p>
<p>The sign field is 0, the exponent field is 10001001000, and the fraction field is 0100111010100001010110110010011100111011001110001010.</p>
<p>This double has the exact decimal value of</p>
<p>12345678901234567741440</p>
</li>
</ol>
<h2>The Integers Get Huge</h2>
<p>For very large and very small decimal numbers, the scaled fractions have very large numerators and denominators. For example, for DBL_MAX, which is defined as the decimal literal 1.7976931348623158e308, the algorithm produces this scaled fraction:</p>
<p>179769313486231580000000000000000000000000000000000000000000000000000<br />
000000000000000000000000000000000000000000000000000000000000000000000<br />
000000000000000000000000000000000000000000000000000000000000000000000<br />
000000000000000000000000000000000000000000000000000000000000000000000<br />
000000000000000000000000000000000<strong>/</strong>19958403095347198116563727130368385<br />
660674512604354575415025472424372118918689640657849579654926357010893<br />
424468441924952439724379883935936607391717982848314203200056729510856<br />
765175377214443629871826533567445439239933308104551208703888888552684<br />
480441575071209068757560416423584952303440099278848</p>
<p>(The numerator is 1.7976931348623158 x 10<sup>308</sup>; the denominator is 2<sup>971</sup>. The quotient rounds to 9007199254740991, which in binary is 53 bits of 1s.)</p>
<p>Another example is DBL_MIN, which is defined as 2.2250738585072014e-308; it results in a fraction with even larger integers.</p>
<h2>Handling Remaining Cases</h2>
<p>The algorithm as presented only deals with positive, normal numbers. If you were to implement this according to the IEEE 754 specification, you&#8217;d have to handle zero, negative numbers, subnormal numbers, underflow, and overflow:</p>
<ul>
<li>To handle zero, check for it as a special case when parsing the input string.</li>
<li>To handle negative numbers, treat the number as positive and then just negate the converted result.</li>
<li>To handle subnormal numbers, reduce the precision of the significand &#8212; to between 1 and 52 bits &#8212; when scaling to 53 bits would take the power of two exponent into the subnormal range.</li>
<li>To handle underflow, return zero when the result would be below the subnormal range.</li>
<li>To handle overflow, return &ldquo;infinity&rdquo; when the result would be bigger than DBL_MAX.</li>
</ul>
<h2>Discussion</h2>
<h3>Big Integers vs. Big Floats</h3>
<p>As I mentioned, <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">correct conversion can be done using arbitrary-precision floating point arithmetic</a> &#8212; AKA <em>big floats</em>. However, I think the big integer based algorithm is cleaner. Its main advantage comes from the &ldquo;baselessness&rdquo; of integers. You can work with them conceptually as decimal numbers, <a title="Read Rick Regan's article &ldquo;The Four Stages of Floating-Point Competence&rdquo;" href="http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/">blissfully ignorant</a> that they are implemented in binary &#8212; conversion of integers to and from binary is exact.</p>
<p>A big advantage of &ldquo;baselessness&rdquo; is in rounding. To round correctly, there must be enough precision to ensure the accuracy of the rounding bits &#8212; significant bit 54 and an arbitrary number of bits beyond. In the integer algorithm, the required precision comes automatically, through scaling. Each conversion gets the rounding precision it needs, as reflected in the integer remainder. And the calculation is simple: the remainder is compared to one-half of the denominator.</p>
<p>In a floating-point algorithm, precision needs to be set explicitly (either on a case-by-case basis, or once, covering the worst case). Also, the calculation is not as straightforward, requiring knowledge of the absolute location of the rounding bits. For example, consider <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/#308984926168550152811">3.08984926168550152811e-32</a>, a decimal number very nearly halfway between two floating-point numbers. The calculation a floating-point algorithm must make is to compare 2<sup>-158</sup> (one-half ULP) to 2<sup>-158</sup> + 2<sup>-234</sup> (the value of the 77 required rounding bits).</p>
<h3>Sometimes Arbitrary-Precision Is Overkill</h3>
<p>The algorithm I presented is simple and works for every case, but sometimes it is overkill. For many conversions, a simple IEEE double-precision floating-point calculation does the trick. For example, 3.14159 can be converted by this line of standard C code: <em>double d = 314159.0/100000;</em>. This works because both 314159 and 100000 are exactly representable in double-precision, and a standard IEEE floating-point multiplication is guaranteed to produce a correctly rounded result. (<a title="Read Rick Regan's article &ldquo;Fast Path Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">This is one optimization used in David Gay&#8217;s strtod()</a>.)</p>
<p>Arbitrary-precision <em>is</em> required for my second example though, since 12345678901234567 is not exactly representable in double-precision; <em>double d = 12345678901234567.0*1000000;</em> produces a result that is 1 ULP off.</p>
<p>(After publishing this I realized that the C statements above are processed by the compiler &#8212; constant folding. There&#8217;s no guarantee that the compiler will use floating-point division and multiplication instructions; it could use its own algorithms. My point is still valid though. In a real program, the two integer arguments will not be constants, so floating-point instructions will be used &#8212; at runtime.)</p>
<h3>Implemented In a C++ Program</h3>
<p>I implemented the big integer algorithm in a C++ program using GMP (<a href="http://www.exploringbinary.com/how-to-install-and-run-gmp-on-windows-using-mpir/" title="Read Rick Regan's Article &ldquo;How to Install and Run GMP on Windows Using MPIR&rdquo;">MPIR</a>). I validated it by comparing its results to David Gay&#8217;s strtod(). (I have used strtod() as a benchmark against which to compare other conversion routines; by writing this program, I&#8217;ve gotten a way to confirm that strtod() in fact produces correct results.)</p>
<h2>Summary</h2>
<p>I showed you a simple algorithm that uses binary integer arithmetic to convert a decimal string to an IEEE double-precision binary floating-point number. The algorithm starts by representing a decimal number as fraction, with a numerator and denominator that are arbitrary-precision integers. The fraction is scaled by a power of two &#8212; positive or <a title="Read Rick Regan's Article &ldquo;A Table of Negative Powers of Two&rdquo;" href="http://www.exploringbinary.com/a-table-of-negative-powers-of-two/">negative</a> &#8212; so that the integer part of the resulting quotient will contain exactly 53 bits. The 53-bit quotient is rounded to the nearest integer, unambiguously according to the value of the integer remainder. The rounded quotient &#8212; the 53 bits of the floating-point number &#8212; is unscaled and normalized by setting the power of two exponent of the floating-point number accordingly.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">Correct Decimal To Floating-Point Using Big Integers</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Pi and e In Binary</title>
		<link>http://www.exploringbinary.com/pi-and-e-in-binary/</link>
		<comments>http://www.exploringbinary.com/pi-and-e-in-binary/#comments</comments>
		<pubDate>Wed, 18 May 2011 18:52:36 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Convert to decimal]]></category>
		<category><![CDATA[Convert to hexadecimal]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=333</guid>
		<description><![CDATA[Some people are curious about the binary representations of the mathematical constants pi and e. Mathematically, they&#8217;re like every other irrational number &#8212; infinite strings of 0s and 1s (with no discernible pattern). In a computer, they&#8217;re finite, making them only approximations to their true values. I will show you what their approximations look like [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/pi-and-e-in-binary/">Pi and e In Binary</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Some people are curious about the binary representations of the mathematical constants <a title="Wikipedia article on the mathematical constant &ldquo;pi&rdquo;" href="http://en.wikipedia.org/wiki/Pi">pi</a> and <a title="Wikipedia article on the mathematical constant &ldquo;e&rdquo;" href="http://en.wikipedia.org/wiki/E_%28mathematical_constant%29">e</a>. Mathematically, they&#8217;re like every other irrational number &#8212; infinite strings of 0s and 1s (with no discernible pattern). In a computer, they&#8217;re finite, making them only approximations to their true values. I will show you what their approximations look like in <a title="Wikipedia article on floating-point precision" href="http://en.wikipedia.org/wiki/Floating_point#Internal_representation">five different levels of binary floating-point precision</a>.</p>
<div class="wp-caption aligncenter" style="width: 450px"><img src="http://www.exploringbinary.com/wp-content/uploads/pi.e.png" alt="The first 43 bits of pi and e" width="440" height="77"/><p class="wp-caption-text">The first 43 bits of pi and e</p></div>
<p><span id="more-333"></span> </p>
<h2>Binary Floating-Point Formats</h2>
<p>In binary floating-point, infinitely precise values are rounded to finite precision. Here&#8217;s how rounding works in five different levels of precision:</p>
<ul>
<li>In <em>half-precision</em>, values are rounded to 11 significant bits.</li>
<li>In <em>single-precision</em>, values are rounded to 24 significant bits.</li>
<li>In <em>double-precision</em>, values are rounded to 53 significant bits.</li>
<li>In <em>extended-precision</em>, values are rounded to 64 significant bits.</li>
<li>In <em>quadruple-precision</em>, values are rounded to 113 significant bits.</li>
</ul>
<p>The rounding rule used most often in practice is <a title="Wikipedia article definition of &ldquo;round half to even&rdquo; rounding mode" href="http://en.wikipedia.org/wiki/Rounding#Round_half_to_even">round-to-nearest, round-half-to-even</a>; that&#8217;s the rule I will use. For pi and <em>e</em>, there are no &ldquo;half to even&rdquo; cases, since their binary expansions are infinite. This makes the rounding rule simple: if the rounding bit is 0, round down; if the rounding bit is 1, round up.</p>
<p>I will show the correctly rounded approximations of pi and <em>e</em> in these five formats.</p>
<h2>Pi (&pi;)</h2>
<p>Here are the first 50 decimal digits of pi:</p>
<p>3.1415926535897932384626433832795028841971693993751&#8230;</p>
<p>Here are the first 128 <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">bits of pi</a>:</p>
<pre class="noborder">
11.00100100001111110110101010001000100001011010001100001000110100
1100010011000110011000101000101110000000110111000001110011010001...
</pre>
<p>Here are the 128 bits again, with the rounding bit for each level of precision highlighted (bits 12, 25, 54, 65, and 114):</p>
<pre class="noborder">
11.001001000<span class="highlight_gray4">0</span>111111011010<span class="highlight_gray4">1</span>0100010001000010110100011000<span class="highlight_gray4">0</span>1000110100
<span class="highlight_gray4">1</span>100010011000110011000101000101110000000110111000<span class="highlight_gray4">0</span>01110011010001...
</pre>
<p>Here are the correctly rounded values of pi in each of the five levels of precision, shown in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">normalized binary scientific notation</a> and as <a title="Read Rick Regan's article &ldquo;Hexadecimal Floating-Point Constants&rdquo;" href="http://www.exploringbinary.com/hexadecimal-floating-point-constants/">hexadecimal floating-point constants</a>:</p>
<h3>Half-Precision</h3>
<pre class="noborder">
pi = 1.1001001 x 2<sup>1</sup>
   = 0&#120;1.92p+1
</pre>
<p><a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">This equals 3.140625 in decimal</a> (all binary floating-point numbers have exact decimal representations), which approximates pi accurately to about 4 decimal digits.</p>
<p>(You can verify this conversion by hand, by adding the powers of two corresponding to the positions of the 1 bits: 11.001001 = 2<sup>1</sup> + 2<sup>0</sup> + 2<sup>-3</sup> + 2<sup>-6</sup> = 3.140625.)</p>
<h3>Single-Precision</h3>
<pre class="noborder">
pi = 1.10010010000111111011011 x 2<sup>1</sup>
   = 0&#120;1.921fb6p+1
</pre>
<p>This equals 3.1415927410125732421875, which approximates pi accurately to about 8 decimal digits.</p>
<h3>Double-Precision</h3>
<pre class="noborder">
pi = 1.1001001000011111101101010100010001000010110100011 x 2<sup>1</sup>
   = 0&#120;1.921fb54442d18p+1
</pre>
<p>This equals 3.141592653589793115997963468544185161590576171875, which approximates pi accurately to about 16 decimal digits.</p>
<h3>Extended-Precision</h3>
<pre class="noborder">
pi = 1.100100100001111110110101010001000100001011010001100001000110
     101 x 2<sup>1</sup>
   = 0&#120;1.921fb54442d1846ap+1
</pre>
<p>This equals</p>
<p>3.14159265358979323851280895940618620443274267017841339111328125</p>
<p>which approximates pi accurately to about 20 decimal digits.</p>
<h3>Quadruple-Precision</h3>
<pre class="noborder">
pi = 1.100100100001111110110101010001000100001011010001100001000110
     1001100010011000110011000101000101110000000110111 x 2<sup>1</sup>
   = 0&#120;1.921fb54442d18469898cc51701b8p+1
</pre>
<p>This equals</p>
<p>3.141592653589793238462643383279502797479068098137295573004504331<br />
874296718662975536062731407582759857177734375</p>
<p>which approximates pi accurately to about 35 decimal digits.</p>
<h2>Euler&#8217;s Number (e)</h2>
<p>Here are the first 50 decimal digits of <em>e</em>:</p>
<p>2.7182818284590452353602874713526624977572470936999&#8230;</p>
<p>Here are the first 128 bits of <em>e</em>:</p>
<pre class="noborder">
10.10110111111000010101000101100010100010101110110100101010011010
1010111111011100010101100010000000100111001111010011110011110001...
</pre>
<p>Here are the 128 bits again, with the rounding bits for each level of precision highlighted:</p>
<pre class="noborder">
10.101101111<span class="highlight_gray4">1</span>100001010100<span class="highlight_gray4">0</span>1011000101000101011101101001<span class="highlight_gray4">0</span>1010011010
<span class="highlight_gray4">1</span>010111111011100010101100010000000100111001111010<span class="highlight_gray4">0</span>11110011110001...
</pre>
<p>Here are the correctly rounded values of <em>e</em> in each of the five levels of precision:</p>
<h3>Half-Precision</h3>
<pre class="noborder">
<em>e</em> = 1.010111 x 2<sup>1</sup>
  = 0&#120;1.5cp+1
</pre>
<p>This equals 2.71875, which approximates <em>e</em> accurately to about 4 decimal digits.</p>
<h3>Single-Precision</h3>
<pre class="noborder">
<em>e</em> = 1.010110111111000010101 x 2<sup>1</sup>
  = 0&#120;1.5bf0a8p+1
</pre>
<p>This equals 2.71828174591064453125, which approximates <em>e</em> accurately to about 8 decimal digits.</p>
<h3>Double-Precision</h3>
<pre class="noborder">
<em>e</em> = 1.0101101111110000101010001011000101000101011101101001 x 2<sup>1</sup>
  = 0&#120;1.5bf0a8b145769p+1
</pre>
<p>This equals 2.718281828459045090795598298427648842334747314453125, which approximates <em>e</em> accurately to about 16 decimal digits.</p>
<h3>Extended-Precision</h3>
<pre class="noborder">
<em>e</em> = 1.010110111111000010101000101100010100010101110110100101010011
    011 x 2<sup>1</sup>
  = 0&#120;1.5bf0a8b145769536p+1
</pre>
<p>This equals</p>
<p>2.71828182845904523542816810799394033892895095050334930419921875</p>
<p>which approximates <em>e</em> accurately to about 20 decimal digits.</p>
<h3>Quadruple-Precision</h3>
<pre class="noborder">
<em>e</em> = 1.010110111111000010101000101100010100010101110110100101010011
    010101011111101110001010110001000000010011100111101 x 2<sup>1</sup>
  = 0&#120;1.5bf0a8b1457695355fb8ac404e7ap+1
</pre>
<p>This equals</p>
<p>2.718281828459045235360287471352662314358421867193548862669230860<br />
32766716801933881697550532408058643341064453125</p>
<p>which approximates <em>e</em> accurately to about 34 decimal digits.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/pi-and-e-in-binary/">Pi and e In Binary</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/pi-and-e-in-binary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Four Stages of Floating-Point Competence</title>
		<link>http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/</link>
		<comments>http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/#comments</comments>
		<pubDate>Wed, 20 Apr 2011 16:35:19 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=332</guid>
		<description><![CDATA[The four stages of competence model describes the phases you go through when acquiring a skill: Unconscious incompetence: You don&#8217;t know what you don&#8217;t know. Conscious incompetence: You know what you don&#8217;t know. Conscious competence: You know what you know. Unconscious competence: You don&#8217;t know what you know. I&#8217;ve applied this model to assess competence [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/">The Four Stages of Floating-Point Competence</a></p>
]]></description>
			<content:encoded><![CDATA[<p>The <a title="Wikipedia article on the the four stages of competence" href="http://en.wikipedia.org/wiki/Four_stages_of_competence">four stages of competence model</a> describes the phases you go through when acquiring a skill:</p>
<ol>
<li>Unconscious incompetence: You don&#8217;t know what you don&#8217;t know.</li>
<li>Conscious incompetence: You know what you don&#8217;t know.</li>
<li>Conscious competence: You know what you know.</li>
<li>Unconscious competence: You don&#8217;t know what you know.</li>
</ol>
<p>I&#8217;ve applied this model to assess competence in binary floating-point arithmetic; let&#8217;s see where you stand:</p>
<p><span id="more-332"></span>$19.24 plus $6.95 equals $26.189999999999998; you are</p>
<ol>
<li><em>Unconsciously incompetent</em> when you think there&#8217;s a bug in your programming language.</li>
<li><em>Consciously incompetent</em> when you realize that most decimal fractions are infinite in binary.</li>
<li><em>Consciously competent</em> when you study and apply the principles discussed in &ldquo;<a title="Read &ldquo;What Every Computer Scientist Should Know About Floating-Point Arithmetic&rdquo;" href="http://download.oracle.com/docs/cd/E19957-01/806-3568/ncg_goldberg.html">What Every Computer Scientist Should Know About Floating-Point Arithmetic</a>.&rdquo;</li>
<li><em>Unconsciously competent</em> when you know to use decimal or integer arithmetic instead of binary floating-point.</li>
</ol>
<p class="break"> <img src='http://www.exploringbinary.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/">The Four Stages of Floating-Point Competence</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Properties of the Correction Loop in David Gay&#8217;s strtod()</title>
		<link>http://www.exploringbinary.com/properties-of-the-correction-loop-in-david-gays-strtod/</link>
		<comments>http://www.exploringbinary.com/properties-of-the-correction-loop-in-david-gays-strtod/#comments</comments>
		<pubDate>Wed, 23 Mar 2011 17:59:50 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Floating-point]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[PHP]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=329</guid>
		<description><![CDATA[The infinite loop I discovered in PHP was caused by a bug in its decimal to floating-point conversion routine, which is based on David Gay&#8217;s widely used strtod() function. strtod() has a &#8220;correction loop,&#8221; the purpose of which is to refine an initial estimate of a converted double-precision value to its correctly rounded result. This [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/properties-of-the-correction-loop-in-david-gays-strtod/">Properties of the Correction Loop in David Gay&#8217;s strtod()</a></p>
]]></description>
			<content:encoded><![CDATA[<p>The <a title="Read Rick Regan's Article &ldquo;PHP Hangs On Numeric Value 2.2250738585072011e-308&rdquo;" href="http://www.exploringbinary.com/php-hangs-on-numeric-value-2-2250738585072011e-308/">infinite loop I discovered in PHP</a> was caused by a bug in its decimal to floating-point conversion routine, which is based on <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s widely used strtod() function</a>. strtod() has a &ldquo;correction loop,&rdquo; the purpose of which is to refine an initial estimate of a converted double-precision value to its correctly rounded result. This got me thinking: infinite loops notwithstanding, how many times <em>should</em> the loop execute? Does it depend on the accuracy of the initial estimate? I instrumented strtod() and gathered some data to help answer these questions. </p>
<p>The most interesting thing I discovered was this: strtod()&#8217;s correction procedure can execute at most <em>three</em> times. So why was it coded as an infinite loop?</p>
<p><span id="more-329"></span></p>
<h2>Methodology</h2>
<p>I ran two sets of tests: one set to test conversions to normal numbers, and one set to test conversions to subnormal numbers. For each set, I generated random decimal strings <em>d</em> representing positive decimal numbers in scientific notation, in the appropriate exponent range:</p>
<ol>
<li>Normal numbers: DBL_MIN &le; <em>d</em> &le; DBL_MAX</li>
<li>Subnormal numbers: <a title="Read Rick Regan's article &ldquo;What Powers of Two Look Like Inside a Computer&rdquo;" href="http://www.exploringbinary.com/what-powers-of-two-look-like-inside-a-computer/#double-powers-of-two">2<sup>-1074</sup></a> &le; <em>d</em> &lt; DBL_MIN</li>
</ol>
<p>(DBL_MIN is approximately 2.2250738585072014e-308; DBL_MAX is approximately 1.7976931348623158e+308; 2<sup>-1074</sup> is approximately 4.9406564584124654e-324.)</p>
<p>For each set of tests, I did 40 runs of 10 million conversions each, corresponding to the testing of 1 to 40 significant digit decimal numbers. (I stopped at 40 digits because, beyond that, conversions can require &ldquo;<a title="Read Rick Regan's Article &ldquo;Bigcomp: Deciding Truncated, Near Halfway Conversions&rdquo;" href="http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/">bigcomp()</a>&rdquo;, an extra adjustment step that is done <em>after</em> the loop.)</p>
<p>I modified strtod() in dtoa.c to count the number of loops and to record the sequence of conversions made for each call. I compared each intermediate conversion to the correct conversion, computing the difference in units-in-the-last-place (ULPs) by <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">treating each double-precision conversion as a 64-bit integer</a>.</p>
<h2>Results for Normal Numbers</h2>
<h3>Number of Passes Through the Correction Loop</h3>
<p>Decimal strings that convert to normal double-precision binary floating-point numbers may require up to two passes of the loop, although in the overwhelming majority of cases just one pass suffices. For decimal numbers of 20 significant digits or less, some conversions require no passes of the loop (the number of such conversions in the 17-20 digit range is so low that they&#8217;re not visible in the chart).</p>
<div class="wp-caption aligncenter" style="width: 554px"><img src="http://www.exploringbinary.com/wp-content/uploads/strtod.loops.normal.png" alt="Number of Times Through Correction Loop (as percentage of total conversions attempted) -- normal" width="544" height="392"/><p class="wp-caption-text">Number of Times Through Correction Loop (as percentage of total conversions)</p></div>
<h3>Amount of Error in the Initial Estimate</h3>
<p>I measured conversions that were inaccurate by as much as 11 ULPs in their initial estimate, although the overwhelming majority were either correct or off by just one ULP. For decimal numbers of 17 significant digits or more, the number of correct initial estimates decreases, and the number that are off by two ULPs or more increases. </p>
<div class="wp-caption aligncenter" style="width: 558px"><img src="http://www.exploringbinary.com/wp-content/uploads/strtod.ULPs.normal.png" alt="Number of ULPs of Error in Initial Estimate (as percentage of total conversions attempted) -- normal" width="548" height="392"/><p class="wp-caption-text">Number of ULPs of Error in Initial Estimate (as percentage of total conversions)</p></div>
<p>Looking at the two charts you can infer that a lot of initial estimates are correct yet require one pass of the correction loop. This pass is required to confirm the initial estimate. At 21 significant digits or more, <em>all</em> conversions require at least one pass, whether initially correct or not. </p>
<p>It&#8217;s interesting that two passes of the correction loop suffice no matter how far off the initial estimate is. I observed that after one pass, a correct estimate is never changed, and an incorrect estimate is either made correct or brought within one ULP.</p>
<h2>Results for Subnormal Numbers</h2>
<h3>Number of Passes Through the Correction Loop</h3>
<p>Decimal strings that convert to <em>subnormal</em> double-precision binary floating-point numbers require at least one pass of the loop, with a tiny percentage requiring up to <em>three</em> passes. The number of three-loop cases jumps starting at 17 significant digits (but still not enough so that they are visible on my chart).</p>
<div class="wp-caption aligncenter" style="width: 553px"><img src="http://www.exploringbinary.com/wp-content/uploads/strtod.loops.subnormal.png" alt="Number of Times Through Correction Loop (as percentage of total conversions attempted) -- subnormal" width="543" height="381"/><p class="wp-caption-text">Number of Times Through Correction Loop (as percentage of total conversions)</p></div>
<p>2.171e-308, which converts to <a title="Read Rick Regan's article &ldquo;Hexadecimal Floating-Point Constants&rdquo;" href="http://www.exploringbinary.com/hexadecimal-floating-point-constants/">0&#120;</a>0.f9c7573d7fe52p-1022, is an example that requires 3 loops:</p>
<ul>
<li>Initial estimate: 0&#120;0.f9c7573d7fe50p-1022</li>
<li>Start of pass 1: 0&#120;0.f9c7573d7fe50p-1022</li>
<li>Start of pass 2: 0&#120;0.f9c7573d7fe51p-1022</li>
<li>Start of pass 3: 0&#120;0.f9c7573d7fe52p-1022</li>
</ul>
<p>The initial estimate was two ULPs off, and each pass (but the last) improved it by one ULP. </p>
<h3>Amount of Error in the Initial Estimate</h3>
<p>The overwhelming majority of conversions were either correct or off by just one ULP, with some inaccurate by as much as 4 ULPs.</p>
<div class="wp-caption aligncenter" style="width: 555px"><img src="http://www.exploringbinary.com/wp-content/uploads/strtod.ULPs.subnormal.png" alt="Number of ULPs of Error in Initial Estimate (as percentage of total conversions attempted) -- subnormal" width="545" height="384"/><p class="wp-caption-text">Number of ULPs of Error in Initial Estimate (as percentage of total conversions)</p></div>
<p>I found that, for 16 significant digits or less, the following hold:</p>
<ul>
<li>One loop is required if and only if the initial conversion is correct.</li>
<li>Two loops are required if and only if the initial conversion is one ULP off.</li>
<li>Three loops are required if and only if the initial conversion is two ULPs off.</li>
</ul>
<p>For 17 significant digits or more, only the first relationship holds.</p>
<h2>What&#8217;s the Theoretical Limit On the Number of Passes?</h2>
<p>David Gay wrote a paper in 1990 describing the <a title="David Gay's Technical Report &ldquo;Correctly Rounded Binary-Decimal and Decimal-Binary Conversions&rdquo;" href="http://www.ampl.com/REFS/rounding.pdf">theory behind strtod()</a>; on page 3 it says</p>
<blockquote><p>
&ldquo;..it is possible to use the native floating-point arithmetic together with high-precision integer calculations in a correction procedure that yields b(m) = b with m&lt;=3, and that usually has m&lt;=2.&rdquo;</p></blockquote>
<p>In other words, it says the limit is two. b(1) is the initial estimate, done outside the loop, and adjusted values b(2) and b(3) are done inside the loop. This is consistent with my measurements of normal values, but doesn&#8217;t explain the three-loop subnormal conversions. </p>
<p>In the examples I looked at, the third pass didn&#8217;t adjust the conversion &#8212; so is it really needed? Was it added intentionally or is it the byproduct of other changes? (I have a note out to Dave Gay; Ill update this article when he responds.)</p>
<h2>A Look At Correction Loops in Practice</h2>
<h3>Python</h3>
<p><a title="Python's dtoa.c" href="http://svn.python.org/view/python/branches/py3k-dtoa/Python/dtoa.c?view=markup&#038;pathrev=83813">Python uses dtoa.c</a>, but it has changed it considerably from David Gay&#8217;s version. For one, it has removed the infinite loop from strtod(), which is strong evidence that a loop is not required.</p>
<h3>PHP</h3>
<p>I instrumented PHP&#8217;s zend_strtod() to count passes through its correction loop. After doing tens of millions of random conversions, I discovered that it executes at most <em>two</em> times. I also tried a few specific three-loop strtod() examples; each required only one loop. <a title="Read Rick Regan's Article &ldquo;A Better Fix for the PHP 2.2250738585072011e-308 Bug&rdquo;" href="http://www.exploringbinary.com/a-better-fix-for-the-php-2-2250738585072011e-308-bug/">zend_strtod() is based on a very old copy of dtoa.c</a>, a copy from back in the day where I&#8217;m guessing strtod() required only two loops &#8212; as advertised.</p>
<p>A hardcoded limit would have prevented the 2.2250738585072011e-308 infinite loop bug. A forced exit from the loop would have resulted in a conversion with an error of one ULP &#8212; an imperfect but much better outcome!</p>
<h3>Java</h3>
<p>Java had an <a title="Read Rick Regan's Article &ldquo;Java Hangs When Converting 2.2250738585072012e-308&rdquo;" href="http://www.exploringbinary.com/java-hangs-when-converting-2-2250738585072012e-308/">infinite loop conversion bug</a> of its own, in the doubleValue() method of its FloatingDecimal class. doubleValue() is not strtod(), but is clearly modeled on it. It has a correction loop too, so I instrumented it and <a title="Read Rick Regan's Article &ldquo;A Closer Look at the Java 2.2250738585072012e-308 Bug&rdquo;" href="http://www.exploringbinary.com/a-closer-look-at-the-java-2-2250738585072012e-308-bug/">traced it using Eclipse</a>. I discovered that doubleValue() loops much more than strtod(). I found a normal conversion that required 11 loops, and a subnormal conversion that required 8 loops. 1.0020284025808569e-134, which converts to 0&#120;1.d21ecf36d4a22p-446, is an example that requires 10 loops:</p>
<ul>
<li>Initial estimate: 0&#120;1.d21ecf36d4a19p-446</li>
<li>Start of pass 1: 0&#120;1.d21ecf36d4a19p-446</li>
<li>Start of pass 2: 0&#120;1.d21ecf36d4a1ap-446</li>
<li>Start of pass 3: 0&#120;1.d21ecf36d4a1bp-446</li>
<li>Start of pass 4: 0&#120;1.d21ecf36d4a1cp-446</li>
<li>Start of pass 5: 0&#120;1.d21ecf36d4a1dp-446</li>
<li>Start of pass 6: 0&#120;1.d21ecf36d4a1ep-446</li>
<li>Start of pass 7: 0&#120;1.d21ecf36d4a1fp-446</li>
<li>Start of pass 8: 0&#120;1.d21ecf36d4a2p-446</li>
<li>Start of pass 9: 0&#120;1.d21ecf36d4a21p-446</li>
<li>Start of pass 10: 0&#120;1.d21ecf36d4a22p-446</li>
</ul>
<p>The initial estimate was 9 ULPs off, and each pass (but the last) improved it by one ULP.</p>
<p>I don&#8217;t know what doubleValue()&#8217;s theoretical limit is, but I suspect it&#8217;s not much higher than 11. A hardcoded limit would have prevented the 2.2250738585072012e-308 infinite loop bug, resulting in a <a title="Read Rick Regan's Article &ldquo;A Closer Look at the Java 2.2250738585072012e-308 Bug&rdquo;" href="http://www.exploringbinary.com/a-closer-look-at-the-java-2-2250738585072012e-308-bug/">conversion within one ULP of correct</a>.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/properties-of-the-correction-loop-in-david-gays-strtod/">Properties of the Correction Loop in David Gay&#8217;s strtod()</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/properties-of-the-correction-loop-in-david-gays-strtod/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Better Fix for the PHP 2.2250738585072011e-308 Bug</title>
		<link>http://www.exploringbinary.com/a-better-fix-for-the-php-2-2250738585072011e-308-bug/</link>
		<comments>http://www.exploringbinary.com/a-better-fix-for-the-php-2-2250738585072011e-308-bug/#comments</comments>
		<pubDate>Thu, 24 Feb 2011 19:59:06 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Floating-point]]></category>
		<category><![CDATA[PHP]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=325</guid>
		<description><![CDATA[Recently I discovered a bug in PHP&#8217;s decimal to floating-point conversion routine, zend_strtod(): it went into an infinite loop trying to convert the decimal string 2.2250738585072011e-308 to floating-point. zend_strtod() is based on David Gay&#8217;s strtod() function in dtoa.c, as are the decimal to floating-point conversion routines of many other open source projects. So why hasn&#8217;t [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/a-better-fix-for-the-php-2-2250738585072011e-308-bug/">A Better Fix for the PHP 2.2250738585072011e-308 Bug</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Recently I discovered a bug in PHP&#8217;s decimal to floating-point conversion routine, zend_strtod(): it went into an infinite loop trying to convert the decimal string <a title="Read Rick Regan's Article &ldquo;PHP Hangs On Numeric Value 2.2250738585072011e-308&rdquo;" href="http://www.exploringbinary.com/php-hangs-on-numeric-value-2-2250738585072011e-308/">2.2250738585072011e-308</a> to floating-point. zend_strtod() is based on David Gay&#8217;s strtod() function in dtoa.c, as are the decimal to floating-point conversion routines of many other open source projects. So why hasn&#8217;t this bug affected these other projects?</p>
<p>zend_strtod() is based on a very old copy of dtoa.c. The current version of dtoa.c is immune to the 2.2250738585072011e-308 bug &#8212; and has been since 1997 by my reckoning. So while the <a title="Read Rick Regan's Article &ldquo;Why &ldquo;Volatile&rdquo; Fixes the 2.2250738585072011e-308 Bug&rdquo;" href="http://www.exploringbinary.com/why-volatile-fixes-the-2-2250738585072011e-308-bug/">&lsquo;volatile&rsquo; keyword fixes the PHP problem</a>, I think there&#8217;s a better solution: upgrade zend_strtod() to the latest dtoa.c.</p>
<p><span id="more-325"></span></p>
<h2>Volatile Fixes It, But It Shouldn&#8217;t Have Executed In the First Place</h2>
<p>As <a title="Read Rick Regan's Article &ldquo;Why &ldquo;Volatile&rdquo; Fixes the 2.2250738585072011e-308 Bug&rdquo;" href="http://www.exploringbinary.com/why-volatile-fixes-the-2-2250738585072011e-308-bug/">I showed</a>, this section of code in <a title="PHP's dtoa.c" href="http://svn.php.net/viewvc/php/php-src/branches/PHP_5_3/Zend/zend_strtod.c?revision=307192&#038;view=markup">zend_strtod()</a> caused the problem:</p>
<pre class="indented">
/* Compute adj so that the IEEE rounding rules will
 * correctly round rv + adj in some half-way cases.
 * If rv * ulp(rv) is denormalized (i.e.,
 * y &lt;= (P-1)*Exp_msk1), we must adjust aadj to avoid
 * trouble from bits lost to denormalization;
 * example: 1.2e-307 .
 */
if (y &lt;= (P-1)*Exp_msk1 &#038;&#038; aadj &gt;= 1.) {
	aadj1 = (double)(int)(aadj + 0.5);
	if (!dsign)
		aadj1 = -aadj1;
}
<strong>adj = aadj1 * ulp(value(rv));</strong>
<strong>value(rv) += adj;</strong>
</pre>
<p>In the current version of <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">dtoa.c on netlib</a>, that section of code now appears in the <em>#else</em> of a &ldquo;new&rdquo; (circa <a title="Change Log for David Gay's dtoa.c" href="http://www.netlib.org/fp/changes">1997</a>) <em>#ifdef</em>:</p>
<pre class="indented">
#ifdef Avoid_Underflow
...
#else
... the above zend_strtod() code that hits is now here ...
#endif /*Avoid_Underflow*/
</pre>
<p>In the netlib copy, <em>Avoid_Underflow</em> is defined by default (indirectly by <em>not</em> defining another flag, <em>NO_IEEE_Scale</em>). In other words, this code &#8212; the cause of the 2.2250738585072011e-308 bug &#8212; is no longer supposed to be executed!</p>
<h2>Some Other Projects That Use dtoa.c</h2>
<p>There are many projects that use dtoa.c; here are some of them, with links to their copies of dtoa.c:</p>
<ul>
<li>Mozilla (Firefox and others): <a title="Mozilla dtoa.c" href="http://mxr.mozilla.org/js/source/js/src/jsdtoa.c">js/source/js/src/jsdtoa.c</a> and <a title="Mozilla dtoa.c" href="http://mxr.mozilla.org/comm-central/source/mozilla/js/src/dtoa.c">source/mozilla/js/src/dtoa.c</a>.</li>
<li>WebKit (Safari): <a title="WebKit dtoa.c" href="http://trac.webkit.org/browser/releases/Apple/Safari%205.0.3/JavaScriptCore/wtf/dtoa.cpp">JavaScriptCore/wtf/dtoa.cpp</a>.</li>
<li>V8 JavaScript Engine (Chrome): <a title="V8 JavaScript Engine dtoa.c" href="http://code.google.com/p/v8/issues/attachmentText?id=266&#038;aid=-5781349513196502249&#038;name=dtoa.c.no-alias&#038;token=f856542b1159035074740370aeb409aa">dtoa.c</a>.</li>
<li>NetBSD: <a title="NetBSD dtoa.c" href="http://cvsweb.netbsd.org/bsdweb.cgi/src/lib/libc/gdtoa/strtod.c?annotate=1.5&#038;only_with_tag=MAIN">src/lib/libc/gdtoa/strtod.c</a>.</li>
<li>Python: <a title="Python dtoa.c" href="http://svn.python.org/view/python/trunk/Python/dtoa.c?revision=80813&#038;view=markup">python/trunk/Python/dtoa.c</a>.</li>
</ul>
<p>The NetBSD, WebKit, V8, and Mozilla &ldquo;js/src/dtoa.c&rdquo; copies of dtoa.c are relatively current; they contain the <em>#ifdef</em>s that exclude the problematic section of code. Python&#8217;s copy has removed most of the <em>#ifdef</em>s, and with them, that section of code. </p>
<p>The Mozilla &ldquo;js/src/jsdtoa.c&rdquo; copy of dtoa.c is a little different. The problem section of code is not <em>#ifdef</em>ed out, although there are <em>Avoid_Underflow</em> <em>#ifdef</em>s elsewhere. I took a copy of jsdtoa.c and modified it slightly so I could run it standalone in Visual C++. Even with full optimization, I couldn&#8217;t get that section of code to execute &#8212; at least not for input 2.2250738585072011e-308 or nearby inputs. I do not understand the deltas in that code enough to say why.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/a-better-fix-for-the-php-2-2250738585072011e-308-bug/">A Better Fix for the PHP 2.2250738585072011e-308 Bug</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/a-better-fix-for-the-php-2-2250738585072011e-308-bug/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Nondeterministic Floating-Point Conversions in Java</title>
		<link>http://www.exploringbinary.com/nondeterministic-floating-point-conversions-in-java/</link>
		<comments>http://www.exploringbinary.com/nondeterministic-floating-point-conversions-in-java/#comments</comments>
		<pubDate>Mon, 21 Feb 2011 23:05:53 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Floating-point]]></category>
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=324</guid>
		<description><![CDATA[Recently I discovered that Java converts some very small decimal numbers to double-precision floating-point incorrectly. While investigating that bug, I stumbled upon something very strange: Java&#8217;s decimal to floating-point conversion routine, Double.parseDouble(), sometimes returns two different results for the same decimal string. The culprit appears to be just-in-time compilation of Double.parseDouble() into SSE instructions, which [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/nondeterministic-floating-point-conversions-in-java/">Nondeterministic Floating-Point Conversions in Java</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Recently I discovered that <a title="Read Rick Regan's Article &ldquo;Incorrectly Rounded Subnormal Conversions in Java&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-subnormal-conversions-in-java/">Java converts some very small decimal numbers to double-precision floating-point incorrectly</a>. While investigating that bug, I stumbled upon something very strange: Java&#8217;s decimal to floating-point conversion routine, <em>Double.parseDouble()</em>, sometimes returns two different results for the same decimal string. The culprit appears to be just-in-time compilation of <em>Double.parseDouble()</em> into SSE instructions, which exposes an architecture-dependent bug in Java&#8217;s conversion algorithm &#8212; and another real-world example of a <a title="Read Rick Regan's Article &ldquo;Why &ldquo;Volatile&rdquo; Fixes the 2.2250738585072011e-308 Bug&rdquo;" href="http://www.exploringbinary.com/why-volatile-fixes-the-2-2250738585072011e-308-bug/">double rounding on underflow error</a>. I&#8217;ll describe the problem, and take you through the detective work to find its cause.</p>
<p><span id="more-324"></span></p>
<h2>Nondeterministic Conversion</h2>
<p>This program converts the decimal string representing the value 2<sup>-1047</sup> + 2<sup>-1075</sup> repeatedly, in a loop (I inserted 11 newlines in the string so it would fit here; you&#8217;ll have to delete them if you want to run this):</p>
<pre>
class diffconvert {
 public static void main(String[] args) {
   double conversion, conversion_prev = 0;

   //2^-1047 + 2^-1075
   String decimal = &quot;6.6312368714697582767853966302759672433990999
473553031442499717587362866301392654396180682007880487441059604205
526018528897150063763256665955396033303618005191075917832333584923
372080578494993608994251286407188566165030934449228547591599881603
044399098682919739314266256986631577498362522745234853124423586512
070512924530832781161439325697279187097860044978723221938561502254
152119972830784963194121246401117772161481107528151017752957198119
743384519360959074196224175384736794951486324803914359317679811223
967034438033355297560033532098300718322306892013830155987921841729
099279241763393155074022348361207309147831684007154624400538175927
027662135590421159867638194826541287705957668068727833491469671712
93949598850675682115696218943412532098591327667236328125E-316&quot;;

   for(int i = 1; i &lt;= 10000; i++) {
     conversion = <strong>Double.parseDouble(decimal)</strong>;
     if (i != 1 &#038;&#038; conversion != conversion_prev) {
        System.out.printf(&quot;Iteration %d converts as %a%n&quot;,
                          i-1,conversion_prev);
        System.out.printf(&quot;Iteration %d converts as %a%n&quot;,
                          i,conversion);
     }
     conversion_prev = conversion;
   }
 }
}
</pre>
<p>When the program runs, it should print nothing, because the conversion should be the same every time. However, at some point during the loop &#8212; a point which varies &#8212; the result of the conversion changes, from correct (0&#120;0.0000008p-1022) to incorrect (0&#120;0.0000008000001p-1022). And once the conversion changes, it never changes back (for as long as I&#8217;ve cared to extend the loop that is).</p>
<p>Here&#8217;s an example of three consecutive runs of the program:</p>
<p><strong>First Run:</strong></p>
<pre class="indented">
Iteration 103 converts as 0x0.0000008p-1022
Iteration 104 converts as 0x0.0000008000001p-1022
</pre>
<p><strong>Second Run:</strong></p>
<pre class="indented">
Iteration 105 converts as 0x0.0000008p-1022
Iteration 106 converts as 0x0.0000008000001p-1022
</pre>
<p><strong>Third Run:</strong></p>
<pre class="indented">
Iteration 90 converts as 0x0.0000008p-1022
Iteration 91 converts as 0x0.0000008000001p-1022
</pre>
<h2>Environment</h2>
<p>I ran this on a 32-bit Intel machine. The problem occurs under both Windows and Linux. These are the versions of Java I&#8217;m using:</p>
<p><strong>Windows</strong></p>
<pre class="indented">
<strong>java -version</strong>
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)

<strong>javac -version</strong>
javac 1.6.0_24
</pre>
<p><strong>Linux</strong></p>
<pre class="indented">
<strong>java -version</strong>
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.4) (6b20-1.9.4-0ubuntu1)
OpenJDK Client VM (build 19.0-b09, mixed mode, sharing)

<strong>javac -version</strong>
javac 1.6.0_20
</pre>
<h2>The Failing Code In doubleValue()</h2>
<p>I traced the execution of the FloatingDecimal class with Eclipse and zeroed  in on this section of code in <em>doubleValue()</em>:</p>
<pre class="indented">
<strong>double t = dValue * tiny10pow[j];</strong>
if ( t == 0.0 ){
  /*
   * It did underflow.
   * Look more closely at the result.
   * If the exponent is just one too small,
   * then use the minimum finite as our estimate
   * value. Else call the result 0.0
   * and punt it.
   * ( I presume this could happen because
   * rounding forces the result here to be
   * an ULP or two less than
   * Double.MIN_VALUE ).
   */
  t = dValue * 2.0;
  t *= tiny10pow[j];
  if ( t == 0.0 ){
    return (isNegative)? -0.0 : 0.0;
  }
  t = Double.MIN_VALUE;
}
dValue = t;
</pre>
<p>Here are the values of the relevant variables after the highlighted line of code is executed, for both a correct conversion and an incorrect conversion:</p>
<p>For a correct conversion:</p>
<pre class="indented">
dValue:       0x1.54fdd80c8bd14p-197
tiny10pow[j]: 0x1.8062864ac6f43p-851
t:            0x0.0000008p-1022
</pre>
<p>For an incorrect conversion:</p>
<pre class="indented">
dValue:       0x1.54fdd80c8bd14p-197
tiny10pow[j]: 0x1.8062864ac6f43p-851
t:            0x0.0000008<span class="highlight_gray4">000001</span>p-1022
</pre>
<p><em>dValue</em> and <em>tiny10pow[j]</em> are the same in each case &#8212; so how could <em>t</em> be different?</p>
<h2>My x87 Theory</h2>
<p>My first theory was that x87 extended-precision arithmetic was coming into play. I thought that the processor had switched from 53-bit to 64-bit precision mode (although I had no theory as to how).</p>
<h3>Calculating dValue * tiny10pow[j] in Extended-Precision Registers</h3>
<p>Let&#8217;s take the double-precision values of <em>dValue</em> and <em>tiny10pow[j]</em> and multiply them by hand:</p>
<p><em>dValue</em> = 0&#120;1.54fdd80c8bd14p-197 =</p>
<p>1.01010100111111011101100000001100100010111101000101 x 2<sup>-197</sup> </p>
<p><em>tiny10pow[j]</em> = 0&#120;1.8062864ac6f43p-851 =</p>
<p>1.1000000001100010100001100100101011000110111101000011 x 2<sup>-851</sup> </p>
<p><em>dValue * tiny10pow[j]</em> = </p>
<p>1.0000000000000000000000000001000000000000000000000000<span class="highlight_gray4">0</span>0001111001<span class="highlight_gray4">0</span>1101<br />
01010100101111111010100101000001111 x 2<sup>-1047</sup></p>
<p>Depending on which precision mode the processor is in, this result has to be rounded, to either 53 or 64 bits (I&#8217;ve highlighted bits 54 and 65, the rounding bits for each case). Let&#8217;s round it both ways.</p>
<h4>53-bit mode</h4>
<p>If the processor is in 53-bit mode, the full-precision result would round to</p>
<p>1.0000000000000000000000000001 x 2<sup>-1047</sup></p>
<p>Since this is subnormal relative to double-precision, it has to be rounded again &#8212; to 52 bits. To see better where bit 53 (the rounding bit) is, let&#8217;s rewrite the 53-bit result in unnormalized <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a>:</p>
<p>0.0000000000000000000000001000000000000000000000000000<span class="highlight_gray4">1</span> x 2<sup>-1022</sup></p>
<p>Because of round-half-to-even rounding &#8212; the mode the processor is most certainly in &#8212; this would round to </p>
<p>0.0000000000000000000000001 x 2<sup>-1022</sup></p>
<p>This is the correct answer, so I concluded that 53-bit mode is the normal mode of operation.</p>
<h4>64-bit mode</h4>
<p>If the processor is in 64-bit mode, the full-precision result would round to</p>
<p>1.000000000000000000000000000100000000000000000000000000001111001 x 2<sup>-1047</sup></p>
<p>In unnormalized binary scientific notation this is</p>
<p>0.0000000000000000000000001000000000000000000000000000<span class="highlight_gray4">1</span>000000000000000<br />
00000000000001111001 x 2<sup>-1022</sup></p>
<p>Because bits 53 and beyond are greater than 1/2 ULP, this would round to 52 bits as</p>
<p>0.0000000000000000000000001000000000000000000000000001 x 2<sup>-1022</sup></p>
<p>This matches the <em>incorrect</em> answer, so I concluded that the processor had switched from 53-bit mode to 64-bit mode sometime during the loop.</p>
<h2>Other Pieces to The Puzzle</h2>
<h3>32-Bit vs. 64-Bit, x87 vs. SSE</h3>
<p>I asked <a title="Konstantin Preisser&rsquo;s Web Site" href="http://www.preisser-it.de">Konstantin Prei&szlig;er</a> to run my testcase. He confirmed seeing the same problem on a 32-Bit JRE. He thinks that at some point during the loop, just-in-time compilation kicks in, causing native instructions to be executed. I took that as a plausible explanation for how the processor could switch modes.</p>
<p>In subsequent correspondence, Konstantin told me that on a 64-bit JRE, the conversion never changes. That&#8217;s what I expected, since I assumed the 64-bit JRE uses SSE instructions. But what I didn&#8217;t expect was that the conversion is always <em>incorrect</em>: 0&#120;0.0000008000001p-1022.</p>
<p>Konstantin further told me that he ran my testcase on a 32-bit JRE <em>without</em> SSE:</p>
<pre class="indented">
java -XX:UseSSE=0 diffconvert
</pre>
<p>The conversion was always correct and never changed.</p>
<h3>The 53-bit Mode Answer is Correct &#8212; But For the Wrong Reason</h3>
<p>Looking closer at my computation of the 53-bit mode answer, 0.0000000000000000000000001 x 2<sup>-1022</sup>, I realized it was the result of a double rounding on underflow error. If the full-precision result were rounded directly to 52 subnormal bits, it would have rounded up to the <em>incorrect</em> answer:</p>
<p>0.0000000000000000000000001000000000000000000000000001 x 2<sup>-1022</sup>.</p>
<p>This directly rounded result is the same as the result computed with SSE instructions.</p>
<h2>My x87/SSE Theory</h2>
<p>So now my theory is that both x87 and SSE come into play, with 53-bit mode x87 computations being done before a switchover to SSE &ldquo;hot spot&rdquo; computations. (I no longer think that 64-bit mode is involved; the 64-bit mode result is just coincidentally the same as the SSE result &#8212; it avoids the double rounding error as well.)</p>
<h3>Further Support for My New Theory</h3>
<p>If there were a way to step through assembler code and inspect registers with Eclipse for Java, I could verify my theory directly. Not being able to do that, I wrote two small C programs to do the example computation of dValue * tiny10pow[j]: one to do it with x87 instructions in 53-bit mode, and the other to do it with SSE instructions. I ran both programs on Linux, and traced them using the gdb debugger.</p>
<h4>Computation In x87 53-bit Mode (conv_x87_53.c)</h4>
<p>This program, when compiled with this command </p>
<pre class="indented">
gcc -o conv_x87_53 conv_x87_53.c
</pre>
<p>produces the correct conversion, 0&#120;0.0000008p-1022:</p>
<pre class="indented">
#include &lt;stdio.h&gt;
int main (void)
{
 double dValue, tiny10pow, t;
 volatile short oldcw, newcw;

 /* Set 53-bit mode */
 asm (&quot;fstcw %0\n&quot; : :&quot;m&quot;(oldcw));
 newcw = (oldcw &#038; 0xCFF) | 0x200;
 asm (&quot;fldcw %0\n&quot; : :&quot;m&quot;(newcw));

 dValue = 0x1.54fdd80c8bd14p-197;
 tiny10pow = 0x1.8062864ac6f43p-851;
 t = dValue * tiny10pow;

 printf(&quot;%a\n&quot;,t);
}
</pre>
<p>The result of the x87 multiplication, 0&#120;3be88000000800000000, is put in FPU register st0. This is the IEEE 754 extended-precision encoding of the 53-bit rounded result. It gets rounded again, to 52 bits, when stored to the variable <em>t</em> in memory. This results in a double rounding on underflow error.</p>
<h4>Computation With SSE (conv_sse.c)</h4>
<p>This program, when compiled with this command</p>
<pre class="indented">
gcc -msse -mfpmath=sse -march=pentium4 -o conv_sse conv_sse.c
</pre>
<p>produces the incorrect conversion, 0&#120;0.0000008000001p-1022:</p>
<pre class="indented">
#include &lt;stdio.h&gt;
int main (void)
{
 double dValue, tiny10pow, t;

 dValue = 0x1.54fdd80c8bd14p-197;
 tiny10pow = 0x1.8062864ac6f43p-851;
 t = dValue * tiny10pow;

 printf(&quot;%a\n&quot;,t);
}
</pre>
<p>(This is the same program as above, but without the code to set the precision mode.)</p>
<p>The result of the multiplication is put in SSE register xmm0. It represents the final result, as no further rounding can occur when it&#8217;s stored to variable <em>t</em> in memory.</p>
<h2>Summary</h2>
<p>There is a bug in Java&#8217;s decimal to floating-point conversion algorithm, as demonstrated by my example. Interestingly, the bug is masked when done with x87 instructions; the correct result comes accidentally from a double rounding error in hardware. Using SSE instructions, the bug surfaces; the hardware does the calculation it was asked to do &#8212; faithfully. </p>
<p>This bug makes Java&#8217;s decimal to floating-point conversion algorithm sensitive to the floating-point architecture on which it runs. And with just-in-time compilation on a 32-bit processor, both architectures come into play in one program, giving two conversion results for the same decimal string.</p>
<h3>Discussion</h3>
<p>I described an anomaly in the conversion of one tiny decimal number; no one will lose sleep over that. But are there implications for Java&#8217;s floating-point calculations in general?</p>
<h2>How I Found This</h2>
<p>In my article  &ldquo;<a title="Read Rick Regan's Article &ldquo;Incorrectly Rounded Subnormal Conversions in Java&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-subnormal-conversions-in-java/">Incorrectly Rounded Subnormal Conversions in Java</a>&rdquo; I used 2<sup>-1047</sup> + 2<sup>-1075</sup> as an example of incorrect conversion. I found it using an <a title="OpenJDK JDK6 Changeset" href="http://hg.openjdk.java.net/jdk6/jdk6/jdk/rev/d00080320e43">OpenJDK testcase</a>, which I modified to generate conversions randomly. When I looked back at the unmodified testcase&#8217;s output, I realized that the conversion of 2<sup>-1047</sup> + 2<sup>-1075</sup> was not listed as incorrect.</p>
<p>To see what was going on, I hardcoded the decimal string in a small test program and ran it; it converted correctly as well. Then I put the conversion in a loop, to see if it changed &#8212; and it did!</p>
<h2>Bug Reports</h2>
<p>To address the nondeterministic conversion, I wrote this bug report, calling for the use of the &lsquo;strictfp&rsquo; keyword in the FloatingDecimal class (see the discussion in the comments below): <a title="Java Bug Report &ldquo;Double.parseDouble() returns architecture dependent results&rdquo;" href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7021568">Bug ID 7021568: Double.parseDouble() returns architecture dependent results</a>.</p>
<p>As for the incorrect conversion, I think it is covered by the bug report I wrote previously (I added a comment singling out this case): <a title="Java Bug Report &ldquo;Double.parseDouble() converts some subnormal numbers incorrectly&rdquo;" href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7019078">Bug ID 7019078: Double.parseDouble() converts some subnormal numbers incorrectly</a>.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/nondeterministic-floating-point-conversions-in-java/">Nondeterministic Floating-Point Conversions in Java</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/nondeterministic-floating-point-conversions-in-java/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Incorrectly Rounded Subnormal Conversions in Java</title>
		<link>http://www.exploringbinary.com/incorrectly-rounded-subnormal-conversions-in-java/</link>
		<comments>http://www.exploringbinary.com/incorrectly-rounded-subnormal-conversions-in-java/#comments</comments>
		<pubDate>Wed, 16 Feb 2011 18:59:43 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Floating-point]]></category>
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=322</guid>
		<description><![CDATA[While verifying the fix to the Java 2.2250738585072012e-308 bug I found an OpenJDK testcase for verifying conversions of edge case subnormal double-precision numbers. I ran the testcase, expecting it to work &#8212; but it failed! I determined it fails because Java converts some subnormal numbers incorrectly. (By the way, this bug exists in prior versions [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrectly-rounded-subnormal-conversions-in-java/">Incorrectly Rounded Subnormal Conversions in Java</a></p>
]]></description>
			<content:encoded><![CDATA[<p>While verifying the <a title="Read Rick Regan's Article &ldquo;FPUpdater Fixes the Java 2.2250738585072012e-308 Bug&rdquo;" href="http://www.exploringbinary.com/fpupdater-fixes-the-java-2-2250738585072012e-308-bug/">fix</a> to the <a title="Read Rick Regan's Article &ldquo;Java Hangs When Converting 2.2250738585072012e-308&rdquo;" href="http://www.exploringbinary.com/java-hangs-when-converting-2-2250738585072012e-308/">Java 2.2250738585072012e-308 bug</a> I found an <a title="OpenJDK JDK6 Changeset" href="http://hg.openjdk.java.net/jdk6/jdk6/jdk/rev/d00080320e43">OpenJDK testcase</a> for verifying conversions of edge case subnormal double-precision numbers. I ran the testcase, expecting it to work &#8212; but <a title="Read Rick Regan's Comment on His Article &ldquo;FPUpdater Fixes the Java 2.2250738585072012e-308 Bug&rdquo;" href="http://www.exploringbinary.com/fpupdater-fixes-the-java-2-2250738585072012e-308-bug/#comment-4800">it failed</a>! I determined it fails because Java converts some subnormal numbers incorrectly. </p>
<p>(By the way, this bug exists in prior versions of Java &#8212; it has nothing to do with the fix.)</p>
<p><span id="more-322"></span></p>
<h2>The OpenJDK Testcase</h2>
<p>The OpenJDK testcase is a Java class that tries to verify that a decimal number &ldquo;just above&rdquo; or &ldquo;just below&rdquo; a subnormal power of two rounds to that subnormal power of two (the subnormal powers of two are <a title="Read Rick Regan's article &ldquo;What Powers of Two Look Like Inside a Computer&rdquo;" href="http://www.exploringbinary.com/what-powers-of-two-look-like-inside-a-computer/#double-powers-of-two">2<sup>-1023</sup> through 2<sup>-1074</sup></a>). Precisely, what it tests are numbers that are within &plusmn;2<sup>-1075</sup> of each of those <a title="Read Rick Regan's article &ldquo;The Powers of Two&rdquo;" href="http://www.exploringbinary.com/the-powers-of-two/">negative powers of two</a>. 2<sup>-1075</sup> is a half a unit in the last place (ULP) &#8212; half of 2<sup>-1074</sup>, the value of a ULP for all subnormal numbers.</p>
<p>The testcase uses <em>Math.scalb()</em> to generate a double representing each power of two. From each double, it uses the BigDecimal class to generate the decimal strings representing the numbers 1/2 ULP above and 1/2 ULP below the power of two. It then calls <em>Double.parseDouble()</em> to convert each pair of strings, making sure they both convert to the original power of two.</p>
<p>Subnormal double-precision numbers range from <a title="Read Rick Regan's article &ldquo;A Simple C Program That Prints 2,098 Powers of Two&rdquo;" href="http://www.exploringbinary.com/a-simple-c-program-that-prints-2098-powers-of-two/">1023 to 1074 decimal places</a>. The BigDecimal strings, which are expressed in scientific notation, represent numbers with 1075 decimal places. They are one decimal place too long, and must be rounded to 1074 places.</p>
<p>In terms of binary, <a title="Read Rick Regan's Article &ldquo;Why &ldquo;Volatile&rdquo; Fixes the 2.2250738585072011e-308 Bug&rdquo;" href="http://www.exploringbinary.com/why-volatile-fixes-the-2-2250738585072011e-308-bug/">subnormal numbers have 1 to 52 significant bits</a>, unlike normalized numbers, which have 53. The BigDecimal strings, when converted to binary, are 53 significant bits long, with bit 53 always 1; they must be rounded to 52 bits. These are tough <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/#halfway-case">halfway cases</a>, that are sometimes rounded incorrectly by Java.</p>
<h2>My Testcase</h2>
<p>I modified the OpenJDK testcase to generate 1/2 ULP tests for randomly generated subnormal numbers, not just powers of two. I discovered that any decimal number halfway between any two subnormal numbers may be rounded incorrectly.</p>
<h2>Examples of Incorrect Conversions</h2>
<p>I ran my testcase on a 32-bit Windows XP system, using Java SE 6 &#8212; both updates 23 and 24. Below, I&#8217;ll show four examples of incorrectly rounded conversions.</p>
<h3>Example 1: Half-ULP Above a Power of Two</h3>
<p>Java converts this 1075-digit decimal number (315 leading zeros plus 760 significant digits) incorrectly:</p>
<pre class="indented">
6.6312368714697582767853966302759672433990999473553031442499717587
362866301392654396180682007880487441059604205526018528897150063763
256665955396033303618005191075917832333584923372080578494993608994
251286407188566165030934449228547591599881603044399098682919739314
266256986631577498362522745234853124423586512070512924530832781161
439325697279187097860044978723221938561502254152119972830784963194
121246401117772161481107528151017752957198119743384519360959074196
224175384736794951486324803914359317679811223967034438033355297560
033532098300718322306892013830155987921841729099279241763393155074
022348361207309147831684007154624400538175927027662135590421159867
638194826541287705957668068727833491469671712939495988506756821156
96218943412532098591327667236328125E-316
</pre>
<p>This is the number 2<sup>-1047</sup> + 2<sup>-1075</sup>. Shown in <em>unnormalized</em> <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a>, it looks like</p>
<p>0.0000000000000000000000001000000000000000000000000000<span class="highlight_gray4">1</span> x 2<sup>-1022</sup>.</p>
<p>(Bit 53, the rounding bit, is highlighted.)</p>
<p>By the <a title="Wikipedia article on rounding algorithms" href="http://en.wikipedia.org/wiki/Rounding#Round_half_to_even">round-half-to-even</a> rule, it should round down to 2<sup>-1047</sup>, which equals</p>
<p>0.0000000000000000000000001000000000000000000000000000 x 2<sup>-1022</sup>.</p>
<p>(I included superfluous trailing zeros to pad everything out to 52 bits so that the alignment is obvious.)</p>
<p>Instead, Java converts it to this number, one ULP <em>above</em> its correctly rounded value:</p>
<p>0.0000000000000000000000001000000000000000000000000001 x 2<sup>-1022</sup>.</p>
<h3>Example 2: Half-ULP Below a Power of Two</h3>
<p>Java converts this 1075-digit decimal number (318 leading zeros plus 757 significant digits) incorrectly:</p>
<pre class="indented">
3.2378839133029012895883524125015321748630376694231080599012970495
523019706706765657868357425877995578606157765598382834355143910841
531692526891905643964595773946180389283653051434639551003566966656
292020173313440317300443693602052583458034314716600326995807313009
548483639755486900107515300188817581841745696521731104736960227499
346384253806233697747365600089974040609674980283891918789639685754
392222064169814626901133425240027243859416510512935526014211553334
302252372915238433223313261384314778235911424088000307751706259156
707286570031519536642607698224949379518458015308952384398197084033
899378732414634842056080000272705311068273879077914449185347715987
501628125488627684932015189916680282517302999531439241685457086639
13273994694463908672332763671875E-319
</pre>
<p>This is the number 2<sup>-1058</sup> &#8211; 2<sup>-1075</sup>:</p>
<p>0.0000000000000000000000000000000000001111111111111111<span class="highlight_gray4">1</span> x 2<sup>-1022</sup>.</p>
<p>Again, by half-to-even rounding, it should round up to 2<sup>-1058</sup>, which equals</p>
<p>0.0000000000000000000000000000000000010000000000000000 x 2<sup>-1022</sup>.</p>
<p>Instead, Java converts it to this number, one ULP <em>below</em> its correctly rounded value:</p>
<p>0.0000000000000000000000000000000000001111111111111111 x 2<sup>-1022</sup>.</p>
<h3>Example 3: Half-ULP Above a Non Power of Two</h3>
<p>Java converts this 1075-digit decimal number (309 leading zeros plus 766 significant digits) incorrectly:</p>
<pre class="indented">
6.9533558078476771059728052155218916902221198171459507544162056079
800301315496366888061157263994418800653863998640286912755395394146
528315847956685600829998895513577849614468960421131982842131079351
102171626549398024160346762138294097205837595404767869364138165416
212878432484332023692099166122496760055730227032447997146221165421
888377703760223711720795591258533828013962195524188394697705149041
926576270603193728475623010741404426602378441141744972109554498963
891803958271916028866544881824524095839813894427833770015054620157
450178487545746683421617594966617660200287528887833870748507731929
971029979366198762266880963149896457660004790090837317365857503352
620998601508967187744019647968271662832256419920407478943826987518
09812609536720628966577351093292236328125E-310
</pre>
<p>This is the number (2<sup>-1027</sup> + 2<sup>-1066</sup>) + 2<sup>-1075</sup>:</p>
<p>0.0000100000000000000000000000000000000000000100000000<span class="highlight_gray4">1</span> x 2<sup>-1022</sup>.</p>
<p>It should round down, converting to 2<sup>-1027</sup> + 2<sup>-1066</sup>, which equals</p>
<p>0.0000100000000000000000000000000000000000000100000000 x 2<sup>-1022</sup>.</p>
<p>Instead, Java converts it to this number, one ULP <em>above</em> its correctly rounded value:</p>
<p>0.0000100000000000000000000000000000000000000100000001 x 2<sup>-1022</sup>.</p>
<h3>Example 4: Half-ULP Below a Non Power of Two</h3>
<p>Java converts this 1075-digit decimal number (318 leading zeros plus 757 significant digits) incorrectly:</p>
<pre class="indented">
3.3390685575711885818357137012809439119234019169985217716556569973
284403145596153181688491490746626090999981130094655664268081703784
340657229916596426194677060348844249897410807907667784563321682004
646515939958173717821250106683466529959122339932545844611258684816
333436749050742710644097630907080178565840197768788124253120088123
262603630354748115322368533599053346255754042160606228586332807443
018924703005556787346899784768703698535494132771566221702458461669
916553215355296238706468887866375289955928004361779017462862722733
744717014529914330472578638646014242520247915673681950560773208853
293843223323915646452641434007986196650406080775491621739636492640
497383622906068758834568265867109610417379088720358034812416003767
05491726170293986797332763671875E-319
</pre>
<p>This is the number (2<sup>-1058</sup> + 2<sup>-1063</sup>) &#8211; 2<sup>-1075</sup>:</p>
<p>0.0000000000000000000000000000000000010000011111111111<span class="highlight_gray4">1</span> x 2<sup>-1022</sup>.</p>
<p>It should round up, converting to 2<sup>-1058</sup> + 2<sup>-1063</sup>, which equals</p>
<p>0.0000000000000000000000000000000000010000100000000000 x 2<sup>-1022</sup>.</p>
<p>Instead, Java converts it to this number, one ULP <em>below</em> its correctly rounded value:</p>
<p>0.0000000000000000000000000000000000010000011111111111 x 2<sup>-1022</sup>.</p>
<h2>Bug Report</h2>
<p>I wrote a Java bug report for this problem: <a title="Java Bug Report &ldquo;Double.parseDouble() converts some subnormal numbers incorrectly&rdquo;" href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7019078">Bug ID 7019078: Double.parseDouble() converts some subnormal numbers incorrectly</a>.</p>
<h3>A Related Existing Bug</h3>
<p><a title="Java Bug Report &ldquo;Parsing doubles fails to follow IEEE for largest decimal that should yield 0&rdquo;" href="http://bugs.sun.com/view_bug.do;jsessionid=25194dc266f1b860256a012b3405?bug_id=4396272">Java Bug 4396272</a> reports an error in the conversion of 2<sup>-1075</sup>: it should round to zero (by half-to-even rounding), but instead rounds to 2<sup>-1074</sup>.</p>
<h2>Discussion</h2>
<p>All of the incorrect conversions I found were off by one ULP. This is consistent with other languages &#8212; like <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Visual C++</a> and <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in GCC and GLIBC&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/">GCC C</a> &#8212; that don&#8217;t use <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s correctly rounding strtod() function</a>. However, Java&#8217;s <a title="FloatingDecimal.java" href="http://www.docjar.com/html/api/sun/misc/FloatingDecimal.java.html">FloatingDecimal class</a> is clearly modeled on David Gay’s code, so I assume it was Java&#8217;s intent to do all its conversions correctly.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrectly-rounded-subnormal-conversions-in-java/">Incorrectly Rounded Subnormal Conversions in Java</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/incorrectly-rounded-subnormal-conversions-in-java/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

