# The Answer is One (Unless You Use Floating-Point)

Copyright © 2008-2016 Exploring Binary

http://www.exploringbinary.com/the-answer-is-one-unless-you-use-floating-point/

What does this C function do?

double f(double a) { double b, c; b = 10*a - 10; c = a - 0.1*b; return (c); }

Based solely on reading the code, you’ll conclude that it always returns 1: c = a – 0.1*(10*a – 10) = a – (a-1) = 1. But if you *execute* the code, you’ll find that it may or may not return 1, depending on the input. If you know anything about binary floating-point arithmetic, that won’t surprise you; what *might* surprise you is how far from 1 the answer can be — as far away as a large negative number!

## Examples of Incorrect Results

I tested my function in Visual C++; here’s a sampling of calls that gave an answer other than 1:

Function Call | Result (to 17 Digits) |
---|---|

f(5.1) | 0.99999999999999911 |

f(91.3) | 0.99999999999998579 |

f(451.9) | 0.99999999999994316 |

f(7341.4) | 0.99999999999909051 |

f(51367.7) | 0.99999999999272404 |

f(897451.7) | 0.99999999988358468 |

f(1923556.4) | 0.99999999976716936 |

f(59567771.9) | 0.9999999925494194 |

f(176498634.7) | 0.99999997019767761 |

f(2399851001.7) | 0.9999995231628418 |

f(60098761442.7) | 0.99999237060546875 |

f(772555211114.1) | 0.9998779296875 |

f(1209983553611.9) | 0.999755859375 |

f(59871426773404.9) | 0.9921875 |

f(190776306245331.2) | 0.96875 |

f(2987154209766221.6) | 0.5 |

f(19843566622234755.9) | 0 |

f(719525522284533115.3) | -128 |

f(8399871243556765103.9) | 0 |

f(39847765103525225199.1) | 0 |

f(553774558711019983333.9) | -65536 |

(I got similar, but not identical results, when using Linux gcc and MinGW gcc.)

You can see that, as the input number gets bigger, the result gets farther from 1. When the input number gets big enough though, things get really wacky: zero and negative integers are returned!

## Why the Calculations Go Wrong

The incorrect answers are due to the limits of finite floating-point precision: decimal input numbers, as well as the results of binary calculations done on them, are approximated with nearby floating-point numbers. Let’s see how this plays out in three of the example function calls:

- f(5.1)
- f(19843566622234755.9)
- f(553774558711019983333.9)

For each example, I’ll show the double-precision value of each variable and intermediate result, as printed by David Gay’s dtoa() function.

### f(5.1)

a = 5.0999999999999996447286321199499070644378662109375 b = 51 - 10 = 41 c = 5.0999999999999996447286321199499070644378662109375 - 4.10000000000000053290705182007513940334320068359375 ---------------------------------------------------- 0.99999999999999911182158029987476766109466552734375

The computed answer is close to 1, accurate to 15 decimal places.

### f(19843566622234755.9)

a = 19843566622234756 b = 198435666222347552 - 10 = 198435666222347552 c = 19843566622234756 - 19843566622234756 = 0

The computed answer is 0.

The thing to notice is that the input has such a large integer part — it would require 55 bits — that it exceeds the limit of double-precision, which is 53 bits. This means that *a*’s integer part is represented inexactly, and its fractional part is not represented at all! This inaccuracy carries over into the calculations, which result in large, inaccurate integers — and a 0 result caused by catastrophic cancellation.

### f(553774558711019983333.9)

a = 553774558711019995136 b = 5537745587110200475648 - 10 = 5537745587110200475648 c = 553774558711019995136 - 553774558711020060672 = -65536

The computed answer is -65536.

As in the prior example, the large integer part is responsible for the inaccurate calculations. This time, however, the floating-point representation of 0.1**b* is greater than the floating-point representation of *a*, causing a negative integer result.

## Is Getting an Answer of One Just Luck?

In some sense, getting an answer of 1 with floating-point is just luck; the inaccuracies have to cancel out. Let’s look at a function call that gives the *correct* answer, f(2.1):

a = 2.100000000000000088817841970012523233890533447265625 b = 21 - 10 = 11 c = 2.100000000000000088817841970012523233890533447265625 - 1.100000000000000088817841970012523233890533447265625 ----------------------------------------------------- 1

*a* and 0.1**b* have the same fractional part, which cancels out in the subtraction.

## Discussion

The function I analyzed is admittedly contrived, but it illustrates the inherent limits of floating-point calculations (even simple ones).

November 23rd, 2010 at 3:46 pm

It is not luck. It is as well defined as any computer arithmetic, but most programmers do not understand why or how floating point works.

This is the reference everyone who does not understand this blog post should read.

http://docs.sun.com/source/806-3568/ncg_goldberg.html

It is a tough read on your first pass, but every programmer should understand these issues as they advance in skill.

November 23rd, 2010 at 5:07 pm

You probably got different results with Linux GCC and MinGW because of the different default rounding precision of the x87 FPU between Linux vs. Windows (or SSE2 vs. x87).

November 23rd, 2010 at 5:13 pm

Chris,

I didn’t mean luck as in random or non-deterministic. I meant it loosely, as in if you get 1, you’re lucky, because you shouldn’t have expected it as the answer in the first place (it is only one of many possible answers).

As for “WECSSKAFPA,” I’d say the second and third passes are not much easier.

November 23rd, 2010 at 6:25 pm

Matt,

It’s the rounding mode. I just re-ran the examples in Visual C++ using 64-bit mode and it gave the same results as gcc (both Linux and MinGW).

Thanks for pointing this out. (I hadn’t looked into it, but I had thought it might have been due to the compilers optimizing the function in different ways.)

November 24th, 2010 at 9:26 am

to a physicist, this is actually reasonably sensible. if you want a function that returns 1.0, write “return (double)1.0;”. If you want a function that compares a number of unknown order of magnitude to 1, this is what you get (i.e. undefined behaviour.)

10^something – 1 does (basically) evaluate to 10^something, to all intents and purposes.

You only need about 100 significant figures of accuracy to do the most accurate calculations EVER (e.g. how long is the belt that goes around the universe only needs 100 s.f.s of pi), so if you write a sensible program (which only ever +/-s numbers of similar orders of magnitude) then you will end up with sensible answers.

Floating point numbers are not very useful most of the time, and you basically normally want a “decimal” number for most FP calculations in the real world anyway, which you can represent as an int trivially.

This is an example of “look we divided by zero and weird stuff happened”, but with floating point numbers; funny, but not all that surprising or useful.

November 24th, 2010 at 10:06 am

tehwalrus,

A lot of people understand this behavior (you visitors from reddit) and a lot of people don’t (users on stackoverflow asking “why does 0.1 + 0.2 = 0.30000000000000004?”). I’m hoping the latter group finds this article and learns something from it.

November 25th, 2010 at 2:58 pm

This is indeed something each computer programmer should understand and know about. I’m a bit disappointed though that GCC & co (which are arguably more advanced compilers) did not just optimize the function to “return 1.0″. Did you compile with optimisations turned on?

tehwalrus,

The problem here is not really a linear loss of some accuracy, which, as you describe, would be rather non-critical. However, if you use loops to model say, recurrence relations or similar things, the error might accumulate so that in the end your result is not only wrong by a few decimal places, it’s infinitely wrong (when your error grows so large you get an overflow), so it becomes completely unusable. Thus error analysis is an important topic in computational science, and it is often crucial for real-world applications to do algebraic transformations to formulas to make them less susceptible to rounding errors, and/or to use other methods like turning a recurrence relation into a closed-form function to improve numerical stability (though it’s not guarantueed that a closed-form function actually does have a better numerical stability when compared to non-closed form variants)

November 25th, 2010 at 5:49 pm

Amadiro,

I hadn’t turned on optimizations, but I just tried it now (on MinGW); o1, o2, o3, ofast, and -ffast-math had no apparent effect (all got the same results as I reported).

February 1st, 2011 at 6:23 am

FYI – your link to WECSSKAFPA has rotted, due to Oracle’s crapping of Sun’s website. There appears to be a copy available at:

http://www.validlab.com/goldberg/paper.pdf

February 1st, 2011 at 7:30 am

Thanks Glenn. (I wonder why they would kill the link to one of the most cited papers on the Web.)

October 18th, 2011 at 10:54 am

Amadiro, you should be glad, not disappointed, that gcc did not “optimize” the function to return 1.0, since that would be wrong and non-conformant…