## 1. Introduction

For most C/C++ programmers converting strings to floats is a trivial task. They just call
*atof()* or *strtod()* from the standard library and don't think about internals. In some cases
however linking with the standard library is not possible, or not wanted. Implementation of standard
functions may be very inefficient (as it was the case with Visual C++ 2008). It may be also huge in terms
of executable size. On my PowerPC machine and MorphOS operating system, *strtod()* adds 40 kB of
executable. Investigating the problem, I've found the source code of GCC *strtod()* an uneradable mess.
Then I've asked myself if I can write something shorter and more straightforward. The code described below
compiles to 1.5 kB of object file.
The code may be easily modified, for example to handle Unicode strings or to support extended precision
or nonstandard format floats. The base code assumes
IEEE754 double precision
numbers and ASCII strings. It also assumes big-endian processor, which is important for multiword integer
shifts used. These shifts are implemented in pure C, but I've not tested them on little-endian machine.

## 2. Double Precision Floats

A double precision floating point number, called just *double* in short, occupies 8 bytes (64 bits) of
memory. It is discussed in details in the Wikipedia article linked above. The first, most significant bit is a
sign bit for the whole number. Then we have 11-bit exponent and 52 bits of significand. The exponent is an
integer power of two, but with offset of 1023, so 1023 means 2^{0}. Then exponent can represent values from
2^{−1022} (00000000001) to 2^{1023} (11111111110). Significand represents a fractional number between 1 including
and 2 excluding. Integer part of significand is then always equal to 1 and is implied (not stored), so all the 52
bits are used for fractional part. To obtain the value of a *double* one raises 2 to the power of exponent and
multiplies it by significand. Formally, denoting the value of number as *v*, exponent bits as *e* and
singificand bits as *s*, we have:

v = 2^{e − 1023} · (1 + s / 2^{52})

Of course we should not forget to negate the final value if the sign bit is set. Maximum possible value of
a double number has exponent 11111111110 and all ones in significand. This maximum value is
1.79769313486231570815... · 10^{308}, minumum value has 00000000001 exponent,
1 as significand and is equal to 2.22507385850720138309... · 10^{−308}.

### 2.1. Special Exponent Values

Two of possible 2048 exponent values have special meaning. Exponent of 0 is used to represent just 0 if the significand
is 0, or so called subnormal numbers when the significand is not zero. A subnormal number extends small numbers range
at a cost of degrading precision. The exponent of a subnormal is considered being −1022 (instead of −1023, what general
formula gives), but significand has no implied 1.0 added. My base code does not support subnormals, so any number, which
module is less than the minimum number given above, is converted to 0. While we are at zero – there are two *double* zeros:
a positive and negative one. It makes no sense in the domain of real numbers, as there is just one zero. In the
*double* domain it makes sense however, as the domain is not continuous. Zero in the domain means "really zero, or any
number less than the lowest representable one", and as such it can have a sign. My code always assigns a proper sign, it
assigns positive zero for strings representing the "true" 0, for example "0", "+0", "0e+3". It creates a negative zero
however, when the minus sign is given explicitly, like in "−0.0000E−6".

The second special exponent value is all ones (2047). This value has meaning of infinity (positive or negative)
if the significand is 0, or "not a number" (*NaN* in short) for non-zero significand. *NaN* is used to
represent results of mathematical operations making no sense in the real numbers domain, like square root of a
negative number. Conversion from string does not generate *NaN*-s, it simply stops at the first wrong character,
and even if the first character is wrong, it returns 0. It can generate infinities however, when the number is too
big. Below you can find binary values for 4 special numbers generated by my code (given as hexadecimals):

$0000 0000 0000 0000 - positive zero

$8000 0000 0000 0000 - negative zero

$7FF0 0000 0000 0000 - positive infinity

$FFF0 0000 0000 0000 - negative infinity

## 3. Divide et impera

This old Roman proverb applies also to computer science. If some problem can be splitted into two smaller problems,
half of the work is already done. Problem of our conversion may be splitted too. The first stage would be to convert
the string to internal representation, the second will convert this representation into a proper *double* number.
The internal representation I've choosen is very simple:

v = 10^{e} · x

where *e* and *x* are integers (*x* is positive integer). Size of
*e* may be estimated from the minimum and maximum values representable as *double*. It must hold numbers
roughly from −327 to +308, so even 16-bit variable is enough, the best would be to just use *int*. The variable *x*
however must be bigger. Double number provides 16 significant decimal digits. 32-bit *int* provides 9 digits only.
Then we have to use 64-bit *int* (*long long int* in GCC). Fortunately most compilers have built-in 64-bit
integer arithmetic.

### 3.1. String Parser

As I have an electronic education background, I like to implement string parsers as finite state machines. Such
parsers are easy to understand, implement, debug and modify. In this case such a parser takes one character at a
time and need not to look either ahead or back. Thanks to this, implementation of *fstrtod()* for example is
straightforward. The parser has to support plain number notation (just digits and decimal point) and scientific
notation with power of ten base. It has to support also common things like:

- any number of leading zeros (0000032, 0000.73),
- non-significant trailing zeros in fractional part (3.474650000),
- explicit "+" sign in front of the number (+32.746),
- ommitted leading zero before the decimal point (.43),
- small "e" and capital "E" as exponent part delimiter,
- leading zeros in exponent (3.6E00000004),
- explicit "+" sign in exponent (7E+2).

It looks messy, but a finite state machine can sort it out easily. Let's start from the state chart:

The operation denoted as *getc()* reads next character from input stream. For plain string it may be defined as an inline
function (or even a macro) deferencing a pointer and incrementing it (well known **ptr++;* construct). It is worth noting, that
in this case the code does not use standard library calls at all. For *fstrtod()* implementation, getting a char will be just
*fgetc()* call.

### 3.2. Description of Parser States

A – skipping leading whitespaces. In my code whitespaces are hardcoded to ASCII space
and tabulator. One can call standard *isspace()* or do whatever is needed. This state can be even ommitted in special
cases, where leading spaces are known to be not existing. I've added this feature to keep compatibility with *strtod()*.
When the current character is a whitespace, next one is fetched and the machine stays in state **A**. Otherwise the machine
goes to state **B**.

B – reading number sign. "+" causes next char to be fetched. "−" does the
same, but also sets the "negative" flag. Any digit (including 0) switches the machine to state **C**. Other characters
are not allowed here, so they make the machine to finish parsing. This short path increases code performance, when it is
run on a string not being a valid number.

C – skipping leading zeros in the integer part of mantissa. These zeros are meaningless,
so they are simply skipped. Zero digit here causes getting next character and staying in the state. Decimal point fetches
new character and switches machine into state **D** (the integer part of mantissa is just 0, either explicit, or mantissa
starts from the decimal point, for example ".384"). Any other character changes the state to **E**.

D – reading leading zeros in the fractional part of mantissa. Such zeros aren't significant
digits, because state **D** is entered only if integer mantissa part was 0. These zeros change the exponent however. Every zero makes
the machine to fecth next char and decrease exponent by 1. The decrease operation is protected against overflow. It is hard to
imagine two billions of zeros fed to the algorithm, but it is theoretically possible, hence the protection. Any other character
immediately changes the state to **F**.

E – reading integer part of mantissa. If less than 18 significant digits have been stored
so far, current digit is stored. It is done by multiplying the mantissa by 10 and adding the digit to it. Digits after the
18-th one are not stored, but every such digit shifts the whole number one decimal position left. It is accounted by increasing
the exponent by 1. This operation is protected against overflow similarly as in the state **D**. After digit storage, the next
character is fetched, and the machine stays in the current state. Encountering a decimal point causes next character fetching
and going to state **F**. Any other character changes the state to **F** immediately.

F – reading fractional part of mantissa. Works similar to state **E**. Digits are stored in
the same way, until reaching 18. Unlike in the state **E** storing digits moves the imagined decimal point to the left, so to keep
the value, exponent is decremented by 1 with each stored digit. On the other hand, digits after 18-th significant digit do not change
the value at all, so these digits are just skipped. If letter "E" or "e" is encountered, it is accepted, then machine goes to state
**G**. Any other character switches to **G** immediately.

G – reading sign of exponent. "+" is accepted and skipped, "−" additionally sets "negative
exponent" flag. Any other character implies positive exponent and advances the state machine to state **H**.

H – skipping leading zeros of exponent. They are meaningless.

I – reading exponent digits. Digits are registered in the same way as mantissa digits.
Overflow protection is applied here, a digit is not stored if it can create overflow. Exponent overflow happens well above
the upper limit of *double*, so the final result will be infinity anyway. Any non-digit character stops the parser.

### 3.3. Parser Final Part

After the state machine finishes, some operations have to be done. First the exponent resulting from decimal point position
is merged with exponent specified explicitly. For example number "3.753e16" will be parsed to 3753 with exponent −3 and
explicit exponent 16. The final exponent will be 13 then, as the number is equal to 3753 · 10^{13}.
The second operation is preliminary range checking. Every number with final exponent above 308 will end as infinity. Every number
with the final exponent below −328 will end as zero, even if the integer mantissa is the highest possible 999 999 999 999.
The parser signals this fact by its return value, so converter, the second part of code, can be skipped.

**NOTE** added 16 Oct 2014

Alexander Guzhva noticed a bug triggered by strings denoting the zero number, using any number of fractional zeros, like "0.0" or "0.000000". In such a case the parser finishes with zero mantissa and negative exponent. Then the converter tries to normalize the mantissa. It means shifting it left until the most significant bit is "1". Of course it never happens when the mantissa is zero. Then the code falls into an infinite loop. It has been fixed in the parser by explicit check for zero mantissa in the final stage. For zero mantissa either PARSER_PZERO or PARSER_MZERO (when explicit minus sign has been given in the string) is returned, so the converter part is skipped completely.

## 4. Conversion by Compensation

The parser discussed above converts the input string to the normalized base-ten notation expressed as:

v = s · 10^{e}

where *s* is an integer significand ranging from 0 to 10^{18} −1, and *e* is an integer exponent
already limited to [−328, +309]. We have to convert it to another form:

v = x · 2^{y}

where *x* will be a 52-bit fractional part of a real number from [1.0, 2.0) interval and *y* will be an integer number
from [−1022, +1023]. The conversion should not use floating point arithmethic itself. How to do it? The idea is to use an intermediate
formula:

v = s · 10^{e} · 2^{y}

where *y* is initially zero, so it does not change the value *v* (as 2^{0} is simply 1). Then our goal is to remove
the 10^{e} factor. We can do it by reducing *e* to zero. Every change of *e* must be **compensated** by simultaneous
change of *y* and, as we want *y* to stay integer, also change of *s*. When initial decimal exponent *e* is positive,
it has to be decremented. Every decrementation by 1 equals to division of the whole number by 10. Then, the significand *s* has to be
multiplied by 10 and after that renormalized. Renormalization is just repeated division by 2 to protect the significand against overflow.
Fortunately dividing an integer by 2 is just bitwise shift right by one bit. Every such shift is compensated by incrementing the binary exponent
*y*.

Reduction of negative decimal exponent is done in similar way. Exponent is increased by 1 in a loop. Every increase means multiplying the whole number by 10, then significand is divided by 10 to compensate. Renormalization is performed with bitwise left shift, compensated by subtracting 1 from the binary exponent.

### 4.1. Binary Trickery

As stated at the start of chapter 3., our significand is an integer and occupies a 64-bit variable. The conversion involves division by 2 and by 10. Integer division usually means rounding errors. When significand is divided by 10, fractional part would be truncated, then renormalization would shift in zeros instead of lost fractional part. Similarly division by 2 will cause rounding errors when a bit shifted out will be "1". Following multiplication by 10 will "poison" the integer part of significand with the rounding error. This is unacceptable. We need to add some bits for fractional part. How many of them? Fortunately for us, the renormalization keeps the magnitude of significand more or less constant, so probably 5 or 6 bits would suffice. Anyway, adding those bits will mean crossing 64-bit boundary. We will have to implement own arithmetic routines then. As most nowadays processors use 32-bit words, there is no point in adding less than 32 bits. It means significand will use 96 bits.

Significand is then placed in the lower two 32-bit words of three-word 96-bit number. Now is the time for some virtual operation.
We can imagine a decimal (or rather "binary" in this case) point in this number. Of course, as the number is integer, this
point is placed right of the least significant bit. We move it then and place after 4 most significant bits of the whole 96-bit
number. This operation means division by 2^{92}, so then we have to change *y* from 0 to +92.

Before starting with conversion, we need a basic set of arithmetic operations for 96-bit numbers. Two first operations are addition and subtraction. This is how they may be implemented in pure C:

#define add96(s2, s1, s0, d2, d1, d0) { \

unsigned int _x, _c; \

_x = (s0); (s0) += (d0); \

if ((s0) < _x) _c = 1; else _c = 0; \

_x = (s1); (s1) += (d1) + _c; \

if (((s1) < _x) || (((s1) == _x)) && _c) _c = 1; else _c = 0; \

(s2) += (d2) + _c; }

#define sub96(s2, s1, s0, d2, d1, d0) { \

unsigned int _x, _c; \

_x = (s0); (s0) -= (d0); \

if ((s0) > _x) _c = 1; else _c = 0; \

_x = (s1); (s1) -= (d1) + _c; \

if (((s1) > _x) || (((s1) == _x) && _c)) _c = 1; else _c = 0; \

(s2) -= (d2) + _c; }

**NOTE** added 26 Sep 2013

Raúl Gutiérrez has found, that the two above macros had a rare bug related to carry/borrow propagation. For *add96()* it happened when *s1* was 0xFFFFFFFF and *s0* + *d0* generated a carry, *s1* + *d1* + *c*, should generate a carry as well for any non-zero *d1*, but it did not. Assuming even distribution of possible values of arguments, this bug happened in 1/2^{33} of cases. Similarly for *sub96()*, if *s1* was 0x00000001 and *s0* − *d0* generated a borrow, *s1* − *d1* − *c* did not generate a borrow for any non-zero *d1*. Old, buggy versions of these macros are given below for reference:

#define add96(s2, s1, s0, d2, d1, d0) { \

unsigned int _x, _c; \

_x = (s0); (s0) += (d0); _c = 0; \

if ((s0) < _x) _c = 1; \

_x = (s1); (s1) += (d1) + _c; _c = 0; \

if ((s1) < _x) _c = 1; \

(s2) += (d2) + _c; }

#define sub96(s2, s1, s0, d2, d1, d0) {\

unsigned int _x, _c; \

_x = (s0); (s0) -= (d0); _c = 0; \

if ((s0) > _x) _c = 1; \

_x = (s1); (s1) -= (d1) + _c; _c = 0; \

if ((s1) > _x) _c = 1; \

(s2) -= (d2) + _c; }

There is nothing special in them. Both operations are performed from right to left in 32-bit chunks with carry/borrow propagation.
These macros have some disadvantages however. They contain conditional instructions, which are not very good for nowadays processors with deep pipelines. They also ignore the fact, that almost any processor has multiword precision (also known as "carrying") additions and subtractions instructions. If one uses a compiler supporting 64-bit integers (*long long* type), better macros may be used:

#define add96(s2, s1, s0, d2, d1, d0) { \

long long w; \

w = (long long)(s0) + (long long)(d0); \

(s0) = w; \

w >>= 32; \

w += (long long)(s1) + (long long)(d1); \

(s1) = w; \

w >>= 32; \

w += (long long)(s2) + (long long)(d2); \

(s2) = w; }

#define sub96(s2, s1, s0, d2, d1, d0) { \

long long w; \

w = (long long)(s0) - (long long)(d0); \

(s0) = w; \

w >>= 32; \

w += (long long)(s1) - (long long)(d1); \

(s1) = w; \

w >>= 32; \

w += (long long)(s2) - (long long)(d2); \

(s2) = w; }

These two macros have following advantages:

- No conditional expressions, so no branches in the code.
- They use extended precision arithmetic instruction on 32-bit processors (for example
**addc**and**adde**on PowerPC,**adc**and**add**on x86). One macro uses 8 such instructions. - On 64-bit processors, these macros are compiled using 64-bit additions and subtractions (4 instructions).

Another two simple macros are for bitwise shift left and right (by one bit):

#define lsr96(s2, s1, s0, d2, d1, d0) \

d0 = ((s0) >> 1) | (((s1) & 1) << 31); \

d1 = ((s1) >> 1) | (((s2) & 1) << 31); \

d2 = (s2) >> 1;

#define lsl96(s2, s1, s0, d2, d1, d0) \

d2 = ((s2) << 1) | (((s1) & (1 << 31)) >> 31); \

d1 = ((s1) << 1) | (((s0) & (1 << 31)) >> 31); \

d0 = (s0) << 1;

Again looking into particular CPU list of instructions may provide a faster solution. There is also some caveat in these macros.
C language standard does not specify what is shifted to the most significant bits during a right shift. In the code I use only **unsigned**
variables and therefore assume that right shift shifts in zeros. Most compilers on most platforms do exactly that.

### 4.2. Decimal Exponent Reduction

Equipped with this nice 96-bit arithmetic, we are ready to reduce the decimal exponent. Let's start from the case, where the exponent
is positive. In every turn of reduction, we will decrease it by 1. Then we have to compensate this by multiplying significand by 10. It can be
easily done when we notice that 10 = 8 + 2 and multiplication by 2 and by 8 can be done by bitshifting. What about
overflow? When setting implicit "binary point", we've left 4 most significant bits free. These 4 bits ensure multiplication by 10 will never
overflow. The next step is renormalization. If the integer part (those 4 bits) of significand is not 0, we keep shifting the significand
right and incrementing the binary exponent *y*. As we cleared those 4 bits again, we are sure, the next multiplication will not
overflow. After the cycle is finished, the total value *v* is kept unchanged. Reduction cycle is repeated
until the decimal exponent *e* is 0. Here is the code:

while (e > 0)

{

lsl96(s2, s1, s0, q2, q1, q0); *// q = s * 2*

lsl96(q2, q1, q0, r2, r1, r0); *// r = s * 4*

lsl96(r2, r1, r0, s2, s1, s0); *// s = s * 8*

add96(s2, s1, s0, q2, q1, q0); *// s = (s * 8) + (s * 2)*

e--;

while (s2 & 0xF0000000)

{

lsr96(s2, s1, s0, q2, q1, q0);

y++;

s2 = q2;

s1 = q1;

s0 = q0;

}

}

The second case is reduction of negative decimal exponent. In every turn of reduction it is increased by 1. Compensation requires
us to divide the significand by 10 then. To keep the division rounding error small, we renormalize **before** dividing. The significand
is shifted left until the first "1" bit hits the most significant position. Every shift is compensated by decrementation
of the binary exponent *y*. There can be found some tricky algorithms for
division by 10, but they are usually only for 32-bit (or even 16-bit) numbers. For 96-bit number I've applied a plain
long division algorithm adopted to binary system and 32-bit chunks. It requires
4 32-bit divisions and 3 modulo operations. Many compilers will be able to optimize those divisions (all are by 10) into multiplications
by prescaled 10 reciprocal. At least GCC 4 does it for PowerPC processors. My guess is any "magic" bit-juggling division algorithm won't
be faster than that for 96-bit numbers.

while (e < 0)

{

while (!(s2 & 0x80000000))

{

lsl96(s2, s1, s0, q2, q1, q0);

y--;

s2 = q2;

s1 = q1;

s0 = q0;

}

q2 = s2 / 10;

r1 = s2 % 10;

r2 = (s1 >> 8) | (r1 << 24);

q1 = r2 / 10;

r1 = r2 % 10;

r2 = ((s1 & 0xFF) << 16) | (s0 >> 16) | (r1 << 24);

r0 = r2 / 10;

r1 = r2 % 10;

q1 = (q1 << 8) | ((r0 & 0x00FF0000) >> 16);

q0 = r0 << 16;

r2 = (s0 & 0xFFFF) | (s1 << 16);

q0 |= r2 / 10;

s2 = q2;

s1 = q1;

s0 = q0;

e++;

}

### 4.3. Final Normalization And Limiting

At this stage of algorithm, decimal exponent *e* has been reduced to 0, so 10^{e} term is eliminated from
our expression. Before we are ready to assemble double precision number, the significand should be brought down to [1.0, 2.0) range.
It means its integer part (4 most significant bits) must be equal to 1. Previous stages of the algorithm ensure the higher bound,
so we only have to take care of significand being below 1.0. A loop keeps shifting it left, until "1" bit arrives at position 92.
Every shift is compensated by decreasing the binary exponent. A special case is significand being just 0. Shifting would be done
endlessly then, so this case is detected before the normalization loop.

if (p2 || p1 || p0)

{

while (!(s2 & 0xF00000000))

{

lsl96(s2, s1, s0, q2, q1, q0);

binexp--;

s2 = q2;

s1 = q1;

s0 = q0;

}

}

Now significand and binary exponent are ready. It's time for final, precise *double* number limit checking.
If the exponent is out of allowed range, either zero (for *y* < −1022) or infinity (for *y* >
+1023) is generated. Sign of it is taken from the sign flag set by parser. If limit check is passed, the final number
is assembled with some bit masking:

## 5. The Code

The algorithm base consists of two internal functions: *parser()* and *converter()*. Using them, standard
library functions like *atof()* or *strtod()* can be written. Full implementation of the later one will require
adding support for hexadecimal floats (very rarely used) and for "Inf", "Infinity" and "NaN" strings. The
code is licensed on
BSD Two Clause License.