I have been writing about the spacing of decimal and binary floating-point numbers, and about how their relative spacing determines whether numbers round-trip between those two bases. I’ve stated an inequality that captures the required spacing, and from it I have derived formulas that specify the number of digits required for round-trip conversions. I have not proven that this inequality holds, but I will prove “half” of it here. (I’m looking for help to prove the other half.)
Base Conversion Theorem
David W. Matula states and proves this theorem, which governs round-trip conversions: Given n base B digits and m base v digits, conversions from base B round-trip through base v if and only if Bn < vm-1.
The inequality specifies the precision required of both bases to make the gaps in the original base (B) greater than or equal to the gaps in the intermediate base (v), the required condition for round-trip conversions.
Proof of the theorem requires two steps:
- If Bn < vm-1, then conversions from B round-trip through v.
- If conversions from B round-trip through v, then Bn < vm-1.
The first step shows that the inequality is a sufficient condition for round-tripping, and the second step shows that the inequality is a necessary condition for round-tripping. In other words: if the inequality holds, numbers round-trip; if numbers round-trip, the inequality holds. The full proof can be found here.
(Matula states his theorem in different terms and symbols than I’ve used. For example, he uses the phrase one-to-one conversion mapping instead of round-trip, and he defines gap in a relative, not absolute way. Also, he covers a second class of conversions, called truncation conversions.)
For our purposes, we want to apply this theorem to decimal/binary conversions, so I will restate it concretely as two separate theorems:
- Decimal-binary-decimal: d-digit decimal numbers round-trip through b-bit binary numbers if and only if 10d < 2b-1.
- Binary-decimal-binary: b-bit binary numbers round-trip through d-digit decimal numbers if and only if 2b < 10d-1.
Existing Proofs of The Sufficient Condition
Several authors have done partial proofs of the theorem, for the sufficient condition only: I. Goldberg, Kahan, D. Goldberg, and Muller et al. (p. 41-3). I. Goldberg does it for decimal-binary-decimal round-trips, and Kahan, D. Goldberg, and Muller et al. do it for binary-decimal-binary round-trips. D. Goldberg’s proof is limited to IEEE single-precision binary floating-point. Muller et al., though they state that the inequality is “a necessary and sufficient condition”, only prove the sufficient condition.
I like the proofs of D. Goldberg and Muller et al; they use a simple gap-based argument. (Frankly, I could not follow the other proofs in detail.) I will give my own proof, based on D. Goldberg’s and Muller et al.’s technique, for both decimal-binary-decimal and binary-decimal-binary round-trips, and for any precision.
Gap Size Determines Round-Trip
A number is said to round-trip if it converts back to itself after being converted to an intermediate base. This can only happen when the gaps in the original base are greater than or equal to the gaps in the intermediate base. (Gap size can only be equal for integers, where the gaps line up exactly and thus conversion is exact; I won’t consider this special case of equal gap size any further.) To see why, consider what happens during a conversion; let’s use decimal-binary-decimal conversion as an example.
Let gapd be the size of the gap between decimal numbers, and let gapb be the size of the gap between binary numbers. Conversion from decimal to binary will incur a maximum error of gapb/2. If gapb/2 > gapd/2, the closest binary number, in the worst case, will be beyond the midpoints sandwiching the original decimal number; that means it would convert back to a different decimal number. If gapd/2 > gapb/2, then the binary number has to round back to the original decimal number. (Note that in this latter case, the decimal number can never convert to a decimal midpoint; even if a binary number lines up with the decimal midpoint, the binary number at the other end of the binary gap would be closer. This prevents ties, and thus keeps tie-breaking rounding mode out of the picture, which could send the returning conversion back to the wrong number.)
What Matula Has To Say About Gap Size and Round-Trips
In his paper Base Conversion Mappings Matula says
“…a sufficient condition for a base conversion mapping to be one-to-one is that the minimum gap in the old base be at least as large as the maximum gap in the new base. …if this condition does not hold … there will exist some range of numbers where the gap function in the new base is larger than the gap function in the old base. If this region is sufficiently long, a conversion mapping of the old base into the new base could not be one-to-one over this region.”
My Proof of The Sufficient Condition: Decimal-Binary-Decimal
Let d be the number of significant decimal digits, and b be the number of significant bits.
Consider the interval [10e,10e+1). In this interval, there are three or four powers of two, so there are four or five different binary gap sizes. Let 2f be the largest power of two in the interval. Since gap size increases with exponent, the binary gaps in the power of ten interval are at their largest in [2f,10e+1). Thus all we need to check is that this maximum binary gap size is smaller than the decimal gap size in the power of ten interval.
In [10e,10e+1), the decimal gap size is 10e+1-d. In [2f,10e+1), which is a subinterval of [2f,2f+1), the binary gap size is 2f+1-b. We must show that if 10d < 2b-1, then 10e+1-d > 2f+1-b.
Starting with 10d < 2b-1, and given that 2f < 10e+1, we can multiply as follows, preserving the inequality: 10d·2f < 2b-1·10e+1.
Combining powers, this simplifies to 2f/2b-1 < 10e+1/10d, or 2f+1-b < 10e+1-d. This proves the sufficient condition.
My Proof of The Sufficient Condition: Binary-Decimal-Binary
This proof is symmetric to the one above, so will read similarly.
Consider the interval [10e,10e+1). In this interval, there are three or four powers of two, so there are four or five different binary gap sizes. Let 2f be the smallest power of two in the interval. Since gap size increases with exponent, the binary gaps in the power of ten interval are at their smallest in [10e,2f). Thus all we need to check is that this minimum binary gap size is larger than the decimal gap size in the power of ten interval.
In [10e,10e+1), the decimal gap size is 10e+1-d. In [10e,2f), which is a subinterval of [2f-1,2f), the binary gap size is 2f-b. We must show that if 2b < 10d-1, then 2f-b > 10e+1-d.
Starting with 2b < 10d-1, and given that 2f > 10e, we can multiply as follows, preserving the inequality: 2b·10e < 10d-1·2f.
Combining powers, this simplifies to 10e/10d-1 < 2f/2b, or 10e+1-d < 2f-b. This proves the sufficient condition.
Open Question: Proving The Necessary Condition
I would like to see a proof of the necessary condition in the style of the sufficient condition proofs above. For example, for decimal-binary-decimal conversions, I’d like to prove that if 10e+1-d > 2f+1-b, then 10d < 2b-1. I attempted to prove it by reversing the steps of the sufficient condition proof, but I did not see a way to complete it. I also tried proving the contrapositive: if 10d > 2b-1, then 10e+1-d < 2f+1-b. Still no luck. (Technically the contrapositive would make the signs ≥ and ≤, respectively, but d is greater than zero, and we are not considering the equal gap size case.)
D. Goldberg uses a counterexample to prove the necessary condition for binary-decimal-binary conversion from IEEE single-precision binary floating-point, but I am looking for a general proof (Matula’s proof is general but not in the above style). What seems to make this direction harder is that there are some ranges where less precision than called for by the inequality will work.
Oh my gosh, you got me wanting to do this proof. I’ll take a crack at it this week. No promises.
[Edited by RR to fix formatting]
[ Meta: how do I mark up mathematics in these comments? I can’t see any information about what sort of markup is permitted. I’ll use LaTeX-ish markup in what follows. ]
This comment is a followup from a discussion in the comments section of a StackOverflow question: http://stackoverflow.com/a/35708911/270986. The aim is to give a short proof of the condition for roundtripping of conversions between two floating-point formats.
Setup. We’ll work with an idealised form of floating-point that ignores exponent bounds, NaNs, infinities and zeros. Roundtripping for negative floats doesn’t behave substantially differently than for positive floats, so we also ignore negative floats and just work with positive floats. We’ll only consider round-to-nearest rounding modes between the two formats, and we won’t make any assumptions about which direction ties round in.
To be precise, here’s what we’ll mean by a floating-point number in the following text.
Definition. Given integers p ≥ 1 and B ≥ 2, a precision-p base-B floating-point number is a positive rational number x that can be expressed in the form m·Be for some integers m and e, with 0 ≤ m < Bp.
Now we consider conversions from precision-p base-B floating-point to precision-q base-D floating-point and back again, using round-to-nearest in both directions.
Proposition 1. Suppose that Bp ≤ Dq-1. Let x be a precision-p base-B float, and let y be a closest precision-q base-D float to x (choose either one in the case of a tie). Then x is the unique closest precision-p base-B float to y. Hence precision-p base-B floats roundtrip through precision-q base-D floating-point.
Proof. There are integers e and f such that Be-1 < x ≤ Be and Df-1 < x ≤ Df. (Concretely, e = ceiling(log_B(x)), and f = ceiling(log_D(x)).) Since the nearest float to x is at least Be-p away, it's enough to show that
|x – y| < 1/2 Be-p,
since then y is closer to x than to any other precision-p base-B float. (Note: that this is still true in the corner case where x = Be; this argument wouldn't have worked if we'd chosen e so that Be-1 ≤ x < Be.)
So that's the inequality we're aiming to prove. What we *know* at this point is that since y is a closest base-D float to x, and the precision-q base-D floats have spacing Df-q in the region (Df-1, Df], y can't be more than 1/2 Df-q away from x:
|x – y| ≤ 1/2 Df-q
Now we just combine all the inequalities: we have
|x – y| ≤ 1/2 x Df-q (from above)
== 1/2 x Df-1 D1-q
< 1/2 x D1-q (by choice of f)
≤ 1/2 x B-p (from Bp ≤ Dq-1)
≤ 1/2 x Be B-p (by choice of e)
== 1/2 x Be-p
which gives us the inequality we need: |x – y| Dq-1. Then there are infinitely many precision-p base-B floats x such that the unique nearest precision-q base-D float y to x does *not* round back to x.
Proof. A consequence of the fact that B and D are not powers of a common base is that rational numbers of the form Df / Be are dense on the positive real line; that is, any open interval (x, y) in the positive reals contains at least one number (in fact, infinitely many numbers) of the form Df / Be.
Next, observe that the open interval
(1 / (1 + 1/2 D1-q), 1 – 1/2 B-p)
is nonempty; that is, we have 1 / (1 + 1/2 D1-q) < 1 – 1/2 B-p. To see this, rearrange the terms to see that the above inequality is exactly equivalent to
So we can find e and f such that Df / Be lies in this interval; that is,
1 / (1 + 1/2 D1-q) < Df / Be < 1 – 1/2 B-p < 1.
Now consider x = Be. From the left-hand inequality above, and the fact that Df / Be < 1, we have
Df < Be < Df + 1/2 Df-q+1
and so in our first rounding operation of the attempted roundtrip, x must round *down* to Df. But from the right-hand inequality,
Df < Be – 1/2 Be-p
which means that in the second rounding of the roundtrip operation, Df must round *down* again (it's closer to Be – Be-p than to Be). So x fails to roundtrip. And because we can find infinitely many pairs (e, f) so that Df / Be lies in our desired interval, there are infinitely many floats x that don't roundtrip.
In the case that B and D *are* powers of a common base (let's call it C), Proposition 2 is not valid. In that case, the analysis is rather easier, and we have the following general result:
Proposition 3. Suppose that B and D are powers of a common base C, and assume that C is the maximal possible common base. Then conversion from precision-p base-B floating-point through precision-q base-D floating-point roundtrips if and only if
Bp ≤ C·Dq-1
Note: in the special case where B = D above, we have C = B = D, and the proposition tells us that we have roundtripping if and only if p ≤ q, which is what we'd expect.
Hmm. I left a long comment giving a proof of both directions, but I’m not seeing it show up. Did I get something wrong, or am I just being impatient?
Rick, sorry; it looks as though the markup got messed up in the long comment; bits of the text are missing, which makes some of the proof a bit nonsensical.
What are the markup rules for the comments?
I have no idea why WordPress allowed your second comment through without moderation but not your first (if I could whitelist you I would!).
Unfortunately the way I set this blog up Latex is allowed in posts but not comments. You should be able to use html tags for the simple stuff in comments though (subscripts and superscripts). But perhaps we can do something else: a) I can edit your comment b) I could turn it into a blog post and use Latex (and maybe we can go back and forth via email to get it right before posting).
| I edited your comment to format it. Please let me know if it is correct (I used
| · for \times, ≤ for <=, ≥ for >=, and <sup>n</sup>
| for exponents).
@Rick: Thanks for the edits! I’ve taken a few moments to write my comment up in proper LaTeX. A PDF is available here:
Comments are closed.