logo
String To Floating Point Number Conversion
atof(), strtod() – how they work, and how you can write your own, better ones

1. Introduction

For most C/C++ programmers converting strings to floats is a trivial task. They just call atof() or strtod() from the standard library and don't think about internals. In some cases however linking with the standard library is not possible, or not wanted. Implementation of standard functions may be very inefficient (as it was the case with Visual C++ 2008). It may be also huge in terms of executable size. On my PowerPC machine and MorphOS operating system, strtod() adds 40 kB of executable. Investigating the problem, I've found the source code of GCC strtod() an uneradable mess. Then I've asked myself if I can write something shorter and more straightforward. The code described below compiles to 1.5 kB of object file. The code may be easily modified, for example to handle Unicode strings or to support extended precision or nonstandard format floats. The base code assumes IEEE754 double precision numbers and ASCII strings. It also assumes big-endian processor, which is important for multiword integer shifts used. These shifts are implemented in pure C, but I've not tested them on little-endian machine.

2. Double Precision Floats

A double precision floating point number, called just double in short, occupies 8 bytes (64 bits) of memory. It is discussed in details in the Wikipedia article linked above. The first, most significant bit is a sign bit for the whole number. Then we have 11-bit exponent and 52 bits of significand. The exponent is an integer power of two, but with offset of 1023, so 1023 means 20. Then exponent can represent values from 2−1022 (00000000001) to 21023 (11111111110). Significand represents a fractional number between 1 including and 2 excluding. Integer part of significand is then always equal to 1 and is implied (not stored), so all the 52 bits are used for fractional part. To obtain the value of a double one raises 2 to the power of exponent and multiplies it by significand. Formally, denoting the value of number as v, exponent bits as e and singificand bits as s, we have:

v = 2e − 1023 · (1 + s / 252)

Of course we should not forget to negate the final value if the sign bit is set. Maximum possible value of a double number has exponent 11111111110 and all ones in significand. This maximum value is 1.79769313486231570815... · 10308, minumum value has 00000000001 exponent, 1 as significand and is equal to 2.22507385850720138309... · 10−308.

2.1. Special Exponent Values

Two of possible 2048 exponent values have special meaning. Exponent of 0 is used to represent just 0 if the significand is 0, or so called subnormal numbers when the significand is not zero. A subnormal number extends small numbers range at a cost of degrading precision. The exponent of a subnormal is considered being −1022 (instead of −1023, what general formula gives), but significand has no implied 1.0 added. My base code does not support subnormals, so any number, which module is less than the minimum number given above, is converted to 0. While we are at zero – there are two double zeros: a positive and negative one. It makes no sense in the domain of real numbers, as there is just one zero. In the double domain it makes sense however, as the domain is not continuous. Zero in the domain means "really zero, or any number less than the lowest representable one", and as such it can have a sign. My code always assigns a proper sign, it assigns positive zero for strings representing the "true" 0, for example "0", "+0", "0e+3". It creates a negative zero however, when the minus sign is given explicitly, like in "−0.0000E−6".

The second special exponent value is all ones (2047). This value has meaning of infinity (positive or negative) if the significand is 0, or "not a number" (NaN in short) for non-zero significand. NaN is used to represent results of mathematical operations making no sense in the real numbers domain, like square root of a negative number. Conversion from string does not generate NaN-s, it simply stops at the first wrong character, and even if the first character is wrong, it returns 0. It can generate infinities however, when the number is too big. Below you can find binary values for 4 special numbers generated by my code (given as hexadecimals):

$0000 0000 0000 0000 - positive zero
$8000 0000 0000 0000 - negative zero
$7FF0 0000 0000 0000 - positive infinity
$FFF0 0000 0000 0000 - negative infinity

3. Divide et impera

This old Roman proverb applies also to computer science. If some problem can be splitted into two smaller problems, half of the work is already done. Problem of our conversion may be splitted too. The first stage would be to convert the string to internal representation, the second will convert this representation into a proper double number. The internal representation I've choosen is very simple:

v = 10e · x

where e and x are integers (x is positive integer). Size of e may be estimated from the minimum and maximum values representable as double. It must hold numbers roughly from −327 to +308, so even 16-bit variable is enough, the best would be to just use int. The variable x however must be bigger. Double number provides 16 significant decimal digits. 32-bit int provides 9 digits only. Then we have to use 64-bit int (long long int in GCC). Fortunately most compilers have built-in 64-bit integer arithmetic.

3.1. String Parser

As I have an electronic education background, I like to implement string parsers as finite state machines. Such parsers are easy to understand, implement, debug and modify. In this case such a parser takes one character at a time and need not to look either ahead or back. Thanks to this, implementation of fstrtod() for example is straightforward. The parser has to support plain number notation (just digits and decimal point) and scientific notation with power of ten base. It has to support also common things like:

It looks messy, but a finite state machine can sort it out easily. Let's start from the state chart:

The operation denoted as getc() reads next character from input stream. For plain string it may be defined as an inline function (or even a macro) deferencing a pointer and incrementing it (well known *ptr++; construct). It is worth noting, that in this case the code does not use standard library calls at all. For fstrtod() implementation, getting a char will be just fgetc() call.

3.2. Description of Parser States

A – skipping leading whitespaces. In my code whitespaces are hardcoded to ASCII space and tabulator. One can call standard isspace() or do whatever is needed. This state can be even ommitted in special cases, where leading spaces are known to be not existing. I've added this feature to keep compatibility with strtod(). When the current character is a whitespace, next one is fetched and the machine stays in state A. Otherwise the machine goes to state B.

B – reading number sign. "+" causes next char to be fetched. "−" does the same, but also sets the "negative" flag. Any digit (including 0) switches the machine to state C. Other characters are not allowed here, so they make the machine to finish parsing. This short path increases code performance, when it is run on a string not being a valid number.

C – skipping leading zeros in the integer part of mantissa. These zeros are meaningless, so they are simply skipped. Zero digit here causes getting next character and staying in the state. Decimal point fetches new character and switches machine into state D (the integer part of mantissa is just 0, either explicit, or mantissa starts from the decimal point, for example ".384"). Any other character changes the state to E.

D – reading leading zeros in the fractional part of mantissa. Such zeros aren't significant digits, because state D is entered only if integer mantissa part was 0. These zeros change the exponent however. Every zero makes the machine to fecth next char and decrease exponent by 1. The decrease operation is protected against overflow. It is hard to imagine two billions of zeros fed to the algorithm, but it is theoretically possible, hence the protection. Any other character immediately changes the state to F.

E – reading integer part of mantissa. If less than 18 significant digits have been stored so far, current digit is stored. It is done by multiplying the mantissa by 10 and adding the digit to it. Digits after the 18-th one are not stored, but every such digit shifts the whole number one decimal position left. It is accounted by increasing the exponent by 1. This operation is protected against overflow similarly as in the state D. After digit storage, the next character is fetched, and the machine stays in the current state. Encountering a decimal point causes next character fetching and going to state F. Any other character changes the state to F immediately.

F – reading fractional part of mantissa. Works similar to state E. Digits are stored in the same way, until reaching 18. Unlike in the state E storing digits moves the imagined decimal point to the left, so to keep the value, exponent is decremented by 1 with each stored digit. On the other hand, digits after 18-th significant digit do not change the value at all, so these digits are just skipped. If letter "E" or "e" is encountered, it is accepted, then machine goes to state G. Any other character switches to G immediately.

G – reading sign of exponent. "+" is accepted and skipped, "−" additionally sets "negative exponent" flag. Any other character implies positive exponent and advances the state machine to state H.

H – skipping leading zeros of exponent. They are meaningless.

I – reading exponent digits. Digits are registered in the same way as mantissa digits. Overflow protection is applied here, a digit is not stored if it can create overflow. Exponent overflow happens well above the upper limit of double, so the final result will be infinity anyway. Any non-digit character stops the parser.

3.3. Parser Final Part

After the state machine finishes, some operations have to be done. First the exponent resulting from decimal point position is merged with exponent specified explicitly. For example number "3.753e16" will be parsed to 3753 with exponent −3 and explicit exponent 16. The final exponent will be 13 then, as the number is equal to 3753 · 1013. The second operation is preliminary range checking. Every number with final exponent above 308 will end as infinity. Every number with the final exponent below −328 will end as zero, even if the integer mantissa is the highest possible 999 999 999 999. The parser signals this fact by its return value, so converter, the second part of code, can be skipped.

NOTE added 16 Oct 2014

Alexander Guzhva noticed a bug triggered by strings denoting the zero number, using any number of fractional zeros, like "0.0" or "0.000000". In such a case the parser finishes with zero mantissa and negative exponent. Then the converter tries to normalize the mantissa. It means shifting it left until the most significant bit is "1". Of course it never happens when the mantissa is zero. Then the code falls into an infinite loop. It has been fixed in the parser by explicit check for zero mantissa in the final stage. For zero mantissa either PARSER_PZERO or PARSER_MZERO (when explicit minus sign has been given in the string) is returned, so the converter part is skipped completely.

4. Conversion by Compensation

The parser discussed above converts the input string to the normalized base-ten notation expressed as:

v = s · 10e

where s is an integer significand ranging from 0 to 1018 −1, and e is an integer exponent already limited to [−328, +309]. We have to convert it to another form:

v = x · 2y

where x will be a 52-bit fractional part of a real number from [1.0, 2.0) interval and y will be an integer number from [−1022, +1023]. The conversion should not use floating point arithmethic itself. How to do it? The idea is to use an intermediate formula:

v = s · 10e · 2y

where y is initially zero, so it does not change the value v (as 20 is simply 1). Then our goal is to remove the 10e factor. We can do it by reducing e to zero. Every change of e must be compensated by simultaneous change of y and, as we want y to stay integer, also change of s. When initial decimal exponent e is positive, it has to be decremented. Every decrementation by 1 equals to division of the whole number by 10. Then, the significand s has to be multiplied by 10 and after that renormalized. Renormalization is just repeated division by 2 to protect the significand against overflow. Fortunately dividing an integer by 2 is just bitwise shift right by one bit. Every such shift is compensated by incrementing the binary exponent y.

Reduction of negative decimal exponent is done in similar way. Exponent is increased by 1 in a loop. Every increase means multiplying the whole number by 10, then significand is divided by 10 to compensate. Renormalization is performed with bitwise left shift, compensated by subtracting 1 from the binary exponent.

4.1. Binary Trickery

As stated at the start of chapter 3., our significand is an integer and occupies a 64-bit variable. The conversion involves division by 2 and by 10. Integer division usually means rounding errors. When significand is divided by 10, fractional part would be truncated, then renormalization would shift in zeros instead of lost fractional part. Similarly division by 2 will cause rounding errors when a bit shifted out will be "1". Following multiplication by 10 will "poison" the integer part of significand with the rounding error. This is unacceptable. We need to add some bits for fractional part. How many of them? Fortunately for us, the renormalization keeps the magnitude of significand more or less constant, so probably 5 or 6 bits would suffice. Anyway, adding those bits will mean crossing 64-bit boundary. We will have to implement own arithmetic routines then. As most nowadays processors use 32-bit words, there is no point in adding less than 32 bits. It means significand will use 96 bits.

Significand is then placed in the lower two 32-bit words of three-word 96-bit number. Now is the time for some virtual operation. We can imagine a decimal (or rather "binary" in this case) point in this number. Of course, as the number is integer, this point is placed right of the least significant bit. We move it then and place after 4 most significant bits of the whole 96-bit number. This operation means division by 292, so then we have to change y from 0 to +92.

Before starting with conversion, we need a basic set of arithmetic operations for 96-bit numbers. Two first operations are addition and subtraction. This is how they may be implemented in pure C:

#define add96(s2, s1, s0, d2, d1, d0) { \
unsigned int _x, _c; \
_x = (s0); (s0) += (d0); \
if ((s0) < _x) _c = 1; else _c = 0; \
_x = (s1); (s1) += (d1) + _c; \
if (((s1) < _x) || (((s1) == _x)) && _c) _c = 1; else _c = 0; \
(s2) += (d2) + _c; }

#define sub96(s2, s1, s0, d2, d1, d0) { \
unsigned int _x, _c; \
_x = (s0); (s0) -= (d0); \
if ((s0) > _x) _c = 1; else _c = 0; \
_x = (s1); (s1) -= (d1) + _c; \
if (((s1) > _x) || (((s1) == _x) && _c)) _c = 1; else _c = 0; \
(s2) -= (d2) + _c; }

NOTE added 26 Sep 2013

Raúl Gutiérrez has found, that the two above macros had a rare bug related to carry/borrow propagation. For add96() it happened when s1 was 0xFFFFFFFF and s0 + d0 generated a carry, s1 + d1 + c, should generate a carry as well for any non-zero d1, but it did not. Assuming even distribution of possible values of arguments, this bug happened in 1/233 of cases. Similarly for sub96(), if s1 was 0x00000001 and s0d0 generated a borrow, s1d1c did not generate a borrow for any non-zero d1. Old, buggy versions of these macros are given below for reference:

#define add96(s2, s1, s0, d2, d1, d0) { \
unsigned int _x, _c; \
_x = (s0); (s0) += (d0); _c = 0; \
if ((s0) < _x) _c = 1; \
_x = (s1); (s1) += (d1) + _c; _c = 0; \
if ((s1) < _x) _c = 1; \
(s2) += (d2) + _c; }

#define sub96(s2, s1, s0, d2, d1, d0) {\
unsigned int _x, _c; \
_x = (s0); (s0) -= (d0); _c = 0; \
if ((s0) > _x) _c = 1; \
_x = (s1); (s1) -= (d1) + _c; _c = 0; \
if ((s1) > _x) _c = 1; \
(s2) -= (d2) + _c; }

There is nothing special in them. Both operations are performed from right to left in 32-bit chunks with carry/borrow propagation. These macros have some disadvantages however. They contain conditional instructions, which are not very good for nowadays processors with deep pipelines. They also ignore the fact, that almost any processor has multiword precision (also known as "carrying") additions and subtractions instructions. If one uses a compiler supporting 64-bit integers (long long type), better macros may be used:

#define add96(s2, s1, s0, d2, d1, d0) { \
long long w; \
w = (long long)(s0) + (long long)(d0); \
(s0) = w; \
w >>= 32; \
w += (long long)(s1) + (long long)(d1); \
(s1) = w; \
w >>= 32; \
w += (long long)(s2) + (long long)(d2); \
(s2) = w; }

#define sub96(s2, s1, s0, d2, d1, d0) { \
long long w; \
w = (long long)(s0) - (long long)(d0); \
(s0) = w; \
w >>= 32; \
w += (long long)(s1) - (long long)(d1); \
(s1) = w; \
w >>= 32; \
w += (long long)(s2) - (long long)(d2); \
(s2) = w; }

These two macros have following advantages:

Another two simple macros are for bitwise shift left and right (by one bit):

#define lsr96(s2, s1, s0, d2, d1, d0) \
d0 = ((s0) >> 1) | (((s1) & 1) << 31); \
d1 = ((s1) >> 1) | (((s2) & 1) << 31); \
d2 = (s2) >> 1;

#define lsl96(s2, s1, s0, d2, d1, d0) \
d2 = ((s2) << 1) | (((s1) & (1 << 31)) >> 31); \
d1 = ((s1) << 1) | (((s0) & (1 << 31)) >> 31); \
d0 = (s0) << 1;

Again looking into particular CPU list of instructions may provide a faster solution. There is also some caveat in these macros. C language standard does not specify what is shifted to the most significant bits during a right shift. In the code I use only unsigned variables and therefore assume that right shift shifts in zeros. Most compilers on most platforms do exactly that.

4.2. Decimal Exponent Reduction

Equipped with this nice 96-bit arithmetic, we are ready to reduce the decimal exponent. Let's start from the case, where the exponent is positive. In every turn of reduction, we will decrease it by 1. Then we have to compensate this by multiplying significand by 10. It can be easily done when we notice that 10 = 8 + 2 and multiplication by 2 and by 8 can be done by bitshifting. What about overflow? When setting implicit "binary point", we've left 4 most significant bits free. These 4 bits ensure multiplication by 10 will never overflow. The next step is renormalization. If the integer part (those 4 bits) of significand is not 0, we keep shifting the significand right and incrementing the binary exponent y. As we cleared those 4 bits again, we are sure, the next multiplication will not overflow. After the cycle is finished, the total value v is kept unchanged. Reduction cycle is repeated until the decimal exponent e is 0. Here is the code:

while (e > 0)
{
  lsl96(s2, s1, s0, q2, q1, q0); // q = s * 2
  lsl96(q2, q1, q0, r2, r1, r0); // r = s * 4
  lsl96(r2, r1, r0, s2, s1, s0); // s = s * 8
  add96(s2, s1, s0, q2, q1, q0); // s = (s * 8) + (s * 2)

  e--;

  while (s2 & 0xF0000000)
  {
    lsr96(s2, s1, s0, q2, q1, q0);
    y++;
    s2 = q2;
    s1 = q1;
    s0 = q0;
  }
}

The second case is reduction of negative decimal exponent. In every turn of reduction it is increased by 1. Compensation requires us to divide the significand by 10 then. To keep the division rounding error small, we renormalize before dividing. The significand is shifted left until the first "1" bit hits the most significant position. Every shift is compensated by decrementation of the binary exponent y. There can be found some tricky algorithms for division by 10, but they are usually only for 32-bit (or even 16-bit) numbers. For 96-bit number I've applied a plain long division algorithm adopted to binary system and 32-bit chunks. It requires 4 32-bit divisions and 3 modulo operations. Many compilers will be able to optimize those divisions (all are by 10) into multiplications by prescaled 10 reciprocal. At least GCC 4 does it for PowerPC processors. My guess is any "magic" bit-juggling division algorithm won't be faster than that for 96-bit numbers.

while (e < 0)
{
  while (!(s2 & 0x80000000))
  {
    lsl96(s2, s1, s0, q2, q1, q0);
    y--;
    s2 = q2;
    s1 = q1;
    s0 = q0;
  }

  q2 = s2 / 10;
  r1 = s2 % 10;
  r2 = (s1 >> 8) | (r1 << 24);
  q1 = r2 / 10;
  r1 = r2 % 10;
  r2 = ((s1 & 0xFF) << 16) | (s0 >> 16) | (r1 << 24);
  r0 = r2 / 10;
  r1 = r2 % 10;
  q1 = (q1 << 8) | ((r0 & 0x00FF0000) >> 16);
  q0 = r0 << 16;
  r2 = (s0 & 0xFFFF) | (s1 << 16);
  q0 |= r2 / 10;
  s2 = q2;
  s1 = q1;
  s0 = q0;

  e++;
}

4.3. Final Normalization And Limiting

At this stage of algorithm, decimal exponent e has been reduced to 0, so 10e term is eliminated from our expression. Before we are ready to assemble double precision number, the significand should be brought down to [1.0, 2.0) range. It means its integer part (4 most significant bits) must be equal to 1. Previous stages of the algorithm ensure the higher bound, so we only have to take care of significand being below 1.0. A loop keeps shifting it left, until "1" bit arrives at position 92. Every shift is compensated by decreasing the binary exponent. A special case is significand being just 0. Shifting would be done endlessly then, so this case is detected before the normalization loop.

if (p2 || p1 || p0)
{
  while (!(s2 & 0xF00000000))
  {
    lsl96(s2, s1, s0, q2, q1, q0);
    binexp--;
    s2 = q2;
    s1 = q1;
    s0 = q0;
  }
}

Now significand and binary exponent are ready. It's time for final, precise double number limit checking. If the exponent is out of allowed range, either zero (for y < −1022) or infinity (for y > +1023) is generated. Sign of it is taken from the sign flag set by parser. If limit check is passed, the final number is assembled with some bit masking:

double number assembling

5. The Code

The algorithm base consists of two internal functions: parser() and converter(). Using them, standard library functions like atof() or strtod() can be written. Full implementation of the later one will require adding support for hexadecimal floats (very rarely used) and for "Inf", "Infinity" and "NaN" strings. The code is licensed on BSD Two Clause License.