Floating point and all that

Since physics is all about numbers we had better develop some understanding of how computers represent them, and the limitations of this representation. Hopefully this example is sufficiently motivating:

0.1  + 0.2 == 0.3

False

Ah…

1 Integers

Let’s begin with something simpler

1 + 1 == 2

True

which is a bit more reassuring. Integers can be represented in binary

3 == 0b11

True

or octal or hexadecimal (with a prefix 0o or 0h). You can get the binary string representing an integer using the bin function

bin(-2)

'-0b10'

Python allows for arbitrarily large integers, so there is no possibility of overflow or rounding error

2**100

1267650600228229401496703205376

The only limitation is the memory required to store it.

Numpy integers are a different story

import numpy as np
np.int64(2**100)

OverflowError: Python int too large to convert to C long

Since NumPy is using C the types have to play nicely. The range of integers that can be represented with 32 bit numpy.int32s is \(\approx\pm 2^{31} \approx \pm 2.1 × 10^9\) (one bit is for the sign) and 64 bit numpy.int64s is \(\approx\pm 2^{63} \approx \pm 9.2 × 10^{18}\). Apart from the risk of overflow when working NumPy’s integers there are no other gotchas to worry about.

2 Floating point numbers

The reason why \(0.1 + 0.2 \neq 0.3\) in Python is that specifying a real number exactly would involve an infinite number of bits, so that any finite representation is necessarily approximate.

The representation computers use for the reals is called floating point arithmetic. It is essentially a form of scientific notation, in which a significand (it contains the significant figures) is multiplied by an exponent. The name floating point reflects the fact that the number of digits after the decimal point is not fixed (I’m using the base ten terms for convenience)

This representation requires the choice of a base, and Python’s floating point numbers use binary. Numbers with finite binary representations therefore behave nicely

0.125 + 0.25 == 0.375

True

For decimal numbers to be represented exactly we’d have to use base ten. This can be achieved with the decimal module. Our \(0.1+0.2\) example then works as expected

from decimal import *
Decimal('0.1') + Decimal('0.2')

Decimal('0.3')

Since there is nothing to single out the decimal representation in physics (as opposed to, say, finance) we won’t have any need for it.

A specification for floating point numbers must give

A base (or radix) \(b\)
A precision \(p\), the number of digits in the significand \(c\). Thus \(0\leq c \leq b^{p}-1\).
A range of exponents \(q\) specifed by \(\text{emin}\) and \(\text{emax}\) with \(\text{emin}\leq q+p-1 \leq \text{emax}\).

Including one bit \(s\) for the overall sign, a number then has the form \((-1)^s\times c \times b^q\). The smallest positive nonzero number that can be represented is therefore \(b^{1 + \text{emin} - p}\) (corresponding to the smallest value of the exponent) and the largest is \(b^{1 + \text{emax}} - 1\).

The above representation isn’t unique: for some numbers you could make the significand smaller and the exponent bigger. A unique representation is fixed by choosing the exponent to be as small as possible.

Representing numbers smaller than \(b^{\text{emin}}\) involves a loss of precision, as the number of digits in the significand falls below \(p\) and the exponent has taken its minimum value . These are called subnormal numbers. For binary floats, if we stick with the normal numbers and a \(p\)-bit significand the leading bit will be 1 and so can be dropped from the representation, which then only requires \(p-1\) bits.

The specification for the floating point numbers used by Python (and many other languages) is contained in the IEEE Standard for Floating Point Arithmetic IEEE 754. The default Python float uses the 64 bit binary64 representation (often called double precision). Here’s how those 64 bits are used

\(p=53\) for the significand, encoded in 52 bits
11 bits for the exponent
1 bit for the sign

Another common representation is the 32 bit binary32 (single precision) with

\(p=24\) for the significand, encoded in 23 bits
8 bits for the exponent
1 bit for the sign

2.1 Floating point numbers in NumPy

If this all a bit theoretical you can just get NumPy’s finfo function to tell all about the machine precision

np.finfo(np.float64)

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

Note that \(2^{-52}=2.22\times 10^{-16}\) which accounts for the value \(10^{-15}\) of the resolution. This can be checked by finding when a number is close enough to treated as 1.0.

x=1.0
while 1.0 + x != 1.0:
    x /= 1.01 
print(x)

1.099427563084686e-16

For binary32 we have a resolution of \(10^{-6}\).

np.finfo(np.float32)

finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)

One lesson from this is that taking small differences between numbers is a potential source of rounding error, as in this somewhat mean exam question

Solution

Solution: \(x-x'=x(1-\gamma^{-1})\sim x\beta^2/2\sim 4.2\text{mm}\).

import numpy as np
from scipy.constants import c
beta = 384400e3 / (76 * 3600) / c
gamma = 1/np.sqrt(1 - beta**2)
print(1 - np.float32(1/gamma), 1 - np.float64(1/gamma))

0.0 1.0981660025777273e-11

2.2 The dreaded NaN

As well as a floating point system, IEEE 754 defines Infinity and NaN (Not a Number)

np.array([1, -1, 0]) / 0

/var/folders/xs/y8sn45v943s2_62flnxw0p940000gn/T/ipykernel_38352/2604490398.py:1: RuntimeWarning:

divide by zero encountered in true_divide

/var/folders/xs/y8sn45v943s2_62flnxw0p940000gn/T/ipykernel_38352/2604490398.py:1: RuntimeWarning:

invalid value encountered in true_divide

array([ inf, -inf,  nan])

They behave as you might guess

2 * np.inf, 0 * np.inf, np.inf > np.nan

(inf, nan, False)

NaNs propagate through subsequent operations

2 * np.nan

nan

which means that if you get a NaN somewhere in your calculation, you’ll probably end up seeing it somewhere in the output (which is the idea).