In my article “7 Bits Are Not Enough for 2-Digit Accuracy” I showed how the relative spacing of decimal and binary floating-point numbers dictates when all conversions between those two bases round-trip. There are two formulas that capture this relationship, and I will derive them in this article. I will also show that it takes one more digit (or bit) of precision to round-trip a floating-point number than it does to round-trip an integer of equal significance.

## Background

In his paper “27 Bits Are Not Enough for 8-Digit Accuracy”, I. Bennett Goldberg states and proves the following about decimal to floating-point to decimal round-trip conversions: “if 10^{p} < 2^{q-1}, then *q* bits are enough for *p*-digit accuracy.” This says that every *p*-digit decimal floating-point number will round-trip through *q* bits or more.

In his paper “In-and-Out Conversions”, David Matula makes and proves a more general statement. He shows that conversions from base *B* round-trip through base *v* when *B*^{n} < *v*^{m-1}, where *n* is the number of base *B* digits, and *m* is the number of base *v* digits. In this form, it’s clear that the formula covers both decimal-binary-decimal and binary-decimal-binary conversions:

- Decimal-binary-decimal:
**10**^{d}< 2^{b-1} - Binary-decimal-binary:
**2**^{b}< 10^{d-1}

(I used *b* and *d* in both formulas, but they are unrelated across the two.)

If you’re given one of *b* or *d*, you can solve for the other by trial and error. But you can also solve for the unknown variable directly, using logarithms. The formulas involving logarithms are well known (e.g., Kahan, Muller et al.), but I will show how they are derived. (I am not going to prove the inequalities; I am just going to derive the formulas from them.)

## Formula for Decimal-Binary-Decimal Round-Trips

For decimal-binary-decimal round-trips, solve for *b* to find the minimum number of bits you need to round-trip a d-digit decimal number:

- Start with original inequality:
**10**^{d}< 2^{b-1} - Turn it around so b is on left side:
**2**^{b-1}> 10^{d} - Take base two logarithm to pull b from exponent:
**log**_{2}(2^{b-1}) > log_{2}(10^{d}) - Simplify using laws of logarithms:
**b-1 > d·log**_{2}(10) - Use ceiling to convert right hand side to integer:
**b-1 ≥ ⌈d·log**_{2}(10)⌉ - Complete the isolation of b:
**b ≥ ⌈d·log**_{2}(10)⌉ + 1 - Take minimum b:
**b = ⌈d·log**_{2}(10)⌉ + 1

For example, to round-trip two digits, you need at least eight bits (for the purposes of this example, log_{2}(10) ≈ 3.322 is accurate enough):

b = ⌈2·log_{2}(10)⌉ + 1 ≈ ⌈2 · 3.322⌉ + 1 = 8.

### Here’s The Formula You’ll Use In Practice

When you’re dealing with a specific binary floating-point format, like IEEE 754, the number of bits is fixed. You don’t want to know how many bits *b* are needed to round-trip a decimal number of *d* digits — you want to know how many digits *d* can round trip given *b* bits. To get *d*, do a different derivation from the same inequality:

- Start with original inequality:
**10**^{d}< 2^{b-1} - Take
*base ten*logarithm to pull d from exponent:**log**_{10}(10^{d}) < log_{10}(2^{b-1}) - Simplify using laws of logarithms:
**d < (b-1)·log**_{10}(2) - Use
*floor*to convert right hand side to integer:**d ≤ ⌊(b-1)·log**_{10}(2)⌋ - Find
*maximum*d:**d = ⌊(b-1)·log**_{10}(2)⌋

For example, at most 15 digits will round-trip through IEEE double-precision (for the purposes of this example, log_{10}(2) ≈ 0.301 is accurate enough):

d = ⌊(53-1)·log_{10}(2)⌋ = ⌊52 · 0.301⌋ = ⌊15.65⌋ = 15.

Here’s the formula in the form you would most likely enter it in a computer:

d = floor((b-1)*(log(2)/log(10)))

**Formula DBD: Maximum digits d that will round-trip through b bits.**

(The parentheses around the division are unnecessary; I just like thinking of log(2)/log(10) as a separate constant.)

## Formula for Binary-Decimal-Binary Round-Trips

For binary-decimal-binary round-trips, solve for *d* to find the minimum number of digits you need to round-trip a b-bit number:

- Start with original inequality:
**2**^{b}< 10^{d-1} - Turn it around so d is on left side:
**10**^{d-1}> 2^{b} - Take base ten logarithm to pull d from exponent:
**log**_{10}(10^{d-1}) > log_{10}(2^{b}) - Simplify using laws of logarithms:
**d-1 > b·log**_{10}(2) - Use ceiling to convert right hand side to integer:
**d-1 ≥ ⌈b·log**_{10}(2)⌉ - Complete the isolation of d:
**d ≥ ⌈b·log**_{10}(2)⌉ + 1 - Take minimum d:
**d = ⌈b·log**_{10}(2)⌉ + 1

For example, to round-trip seven bits, you need at least four digits:

d = ⌈7·log_{10}(2)⌉ + 1 ≈ ⌈7 · 0.301⌉ + 1 = 4.

Here’s the formula in the form you would most likely enter it in a computer:

d = ceil(b*(log(2)/log(10))) + 1

**Formula BDB: Minimum digits d that will round-trip b bits.**

## Round-Trips To and From IEEE Binary Floating-Point

Here’s a table summarizing the values of those formulas for the various IEEE 754 binary floating-point formats:

Name | Bits | Max digits for decimal-binary-decimal round-trip |
Min digits for binary-decimal-binary round-trip |
---|---|---|---|

Half | 11 | 3 | 5 |

Single | 24 | 6 | 9 |

Double | 53 | 15 | 17 |

Extended | 64 | 18 | 21 |

Quadruple | 113 | 33 | 36 |

### Subnormal Numbers

The values in the table above apply only to *normal* floating-point numbers.

Consider the 15-digit, 13-bit subnormal number 2.02345678901234e-320, which converts to 2.0236928853657458449…e-320. Only three of its fifteen digits would survive a rounding back to decimal (2.02e-320). Also, only five digits (2.0235e-320) are needed to preserve its floating-point value (17 digits always works though, so you don’t really have to think about it).

Since you don’t really know how many bits of precision a subnormal number will have until you convert it, it’s not practical to apply the digit formulas on a case-by-case basis. (Is there anyone out there round-tripping subnormal numbers?)

## Relating To The Digit Count For Integers

As stated above, the minimum number of bits *b* required to round-trip every *d*-digit decimal floating-point number is

b = ⌈d·log_{2}(10)⌉ + 1 .

Recall that the minimum number of bits *b* required to represent every *d*-digit decimal integer is

As you can see, it takes one more bit to represent a decimal floating-point number than it does to represent a decimal integer of equal significance. (Remember, integers are represented exactly; floating-point numbers are in general not, but are made to appear so with the extra bit of precision).

Similarly, the minimum number of digits *d* required to round-trip every *b*-bit binary floating-point number is

d = ⌈b·log_{10}(2)⌉ + 1 ,

and the minimum number of digits *d* required to represent every *b*-bit binary integer is

This means it takes one more digit to represent a binary floating-point number than it does to represent a binary integer of equal significance.

Think of it like this: you need an extra digit (or bit) for rounding, because some regions won’t have enough precision otherwise.

Great read Rick! However, I found an inconsistency at the end.

I followed their respective links and they say,

This article says they are minimums and the other articles say they are maximums. Could you clarify this please?

@Wandering Fool,

Both statements are correct. If the maximum number of bits in a d digit number is b, you need a minimum of b bits to represent any d-digit number.

Take 4-digit decimal numbers. The biggest 4-digit number, 9999, needs 14 bits (the smallest, 1000, needs 10 bits). So you need at least 14-bits to represent every 4-digit number.

Very good article. My only suggestion is to define what ‘b’ means (i.e: “number of mantissa/significand bits including the implicit hidden bit”). This hint can be very helpful if you come here after having read too much, like I did 🙂

@Cesar,

You are thinking in terms of implementation (“hidden bit”, “mantissa”, “significand”), but the concept of precision transcends implementation. In fact, the main background work I reference predates IEEE 754. (And in any case, extended-precision, for example, does not use a hidden bit.)

In general, I know many people get confused about the hidden bit and whether precision is 23 or 24 bits (for a float) or 52 or 53 bits (for a double), for example. You’ll be best served by forgetting about the hidden bit and just remembering it’s 24 and 53 (for example).