NumPy and friends

The NumPy package is the key building block of the Python scientific ecosystem.

In this chapter we introduce a few of the key concepts. You should refer to the documentation for details. As with any mature software ecosystem, you should first assume that what you want to achieve can be achieved in a highly optimised way within the existing framework, and only resort to creating your own solution if and when you satisfy yourself that this is not the case.

There are a huge number of resources for learning NumPy online. Here are a couple of good ones:

1 Preamble: objects in Python

Everything in Python is an object. For example [1,2,3] is a list:

my_list = [1, 2, 3]
type(my_list)
list

You can think of an object as a container for properties and methods, the latter being functions associated with the object. Properties and methods are accessed with the . syntax. For example, lists have the append method, which adds an element to the end of the list:

my_list.append("boop")
my_list
[1, 2, 3, 'boop']

With IPython you can see all the available methods by hitting tab:

You can list all of an objects properties and methods using dir:

dir(my_list)
['__add__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

Notice that lots of these are methods have a name sandwiched between double underscores and for this reason are called dunder methods (or magic methods, or just special methods). This is to indicate that they are not to be used by you, but by the Python interpreter to implement certain standard functions that apply to many different classes of objects. For instance, when you write len(my_list) to find the length of my_list Python is actually calling the dunder method my_list.__len__ which does the job of actually finding the length.

my_list.__len__()
4

In this way the same function (len in this case) can operate on many different objects, an example of what is called polymorphism in object oriented programming.

2 Arrays

import numpy as np

The fundamental object in NumPy is the Array (or ndarray), which you can think of as a multidimensional version of a list. If we were representing two dimensional data — or a matrix — in plain old Python you would use a list of lists.

data = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]

Here data[i] represents each row:

data[1]
[4, 5, 6]

If I wanted to multiply every element by a number I would have to do something like this:

for row in data:
    for j, _ in enumerate(row):
        row[j] *= 2
data
[[2, 4, 6], [8, 10, 12], [14, 16, 18], [20, 22, 24]]

If the details here are a bit unfamiliar don’t worry. The point I want to emphasize is: don’t do this.

NumPy provides us with the tools to perform this kind of operation with minimum code and maximum efficiency. First we create our data as an array. There are numerous NumPy functions that produce arrays. The simplest is numpy.array. It takes data in the “Pythonic” list-of-lists(-of-lists-of… etc.) form and produces the corresponding ndarray

my_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
type(my_array)
numpy.ndarray

Now, how do we multiply the whole array by a number? Like this, of course

2 * my_array
array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18],
       [20, 22, 24]])

It even prints nicely

print(my_array)
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]

Arrays can be indexed, similar to lists

print(my_array[0], my_array[1], my_array[3][1])
[1 2 3] [4 5 6] 11

but — different from a ordinary list of lists — the last one can be much more pleasantly achieved with the syntax

my_array[3,1]
11

We also have a generalization of the slice syntax

my_array[1:, 1:]
array([[ 5,  6],
       [ 8,  9],
       [11, 12]])

Slicing can be mixed with integer indexing

my_array[1:, 1]
array([ 5,  8, 11])

NumPy offers all sorts of fancy indexing options for slicing and dicing your data: see the documentation for details.

A fundamental property of an array is its shape:

# [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
my_array.shape
(4, 3)

The way to read off the shape of an array is as follows. To begin with you encounter a number of [ corresponding to the rank of the array (two in the above example). You then scan over a number of entries that give the rightmost (innermost) dimension in the shape tuple before closing ] (3 here). After a number of 1D arrays [...] equal to the next innermost dimension (4 here), we have another closing ], and so on.

It’s definitely something that will take a bit of time getting used to!

Notice that slicing does not change the rank of the array

my_array[1:, 1:].shape
(3, 2)

but integer indexing does

my_array[1:, 1].shape
(3,)

It’s to distinguish the tuple (3,) giving the shape from (3) which is just the number 3 in brackets

type((3,))
tuple
type((3))
int

It wouldn’t be very convenient if you always had to create your arrays from nested lists using np.array. Fortunately NumPy has lots of methods to create arrays with a given shape and populated in different ways:

a = np.zeros((2,2))
print(a)

b = np.ones((2,2))
print(b)

c = np.full((2,2), 5)
print(c)

d = np.random.random((2,2)) # random numbers uniformly in [0.0, 1.0)
print(d)
[[0. 0.]
 [0. 0.]]
[[1. 1.]
 [1. 1.]]
[[5 5]
 [5 5]]
[[0.09186552 0.47712133]
 [0.16653795 0.87510057]]

There are also lots of methods to change the shape of arrays, for example

Let’s see an example of the first one:

my_array.reshape(2, 2, 3)
array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

Reshaping only works if the shapes are compatible. In this case it’s OK because the original shape was \((4,3)\) and \(4\times 3 = 2\times 2\times 3\). If the shapes aren’t compatible, we’ll get an error

my_array.reshape(2, 3, 3)
ValueError: cannot reshape array of size 12 into shape (2,3,3)

A NumPy array has a dtype property that gives the datatype. If the array was created from data, this will be inferred

my_array.dtype
dtype('int64')

Functions that construct arrays also have an optional argument to specify the datatype

my_float_array = np.array([1,2,3], dtype=np.float64)
my_float_array.dtype
dtype('float64')

Importantly, complex numbers are supported

my_float_array = np.array([1.1 + 2.3j,2.2,3.6])
my_float_array.dtype
dtype('complex128')

3 Examples of array-like data

It’s not hard to come up with examples of array-like data in physics. The position, velocity, or acceleration of a particle will be three dimensional vectors, so have shape (3,). If we had \(N\) particles you could either put them in a \(3N\) dimensional vector, but for reasons that will become clear it would be better to store them in an array of shape (N,3), so that the first index indexes the particle number and the second the particle coordinate.

An \(N\times M\) matrix has shape (N,M). The Riemann curvature tensor in General Relativity \(R_{abcd}\) has shape (4,4,4,4).

Often we deal with fields which are functions of space and time e.g. the electric potential \(\phi(\mathbf{r},t)\). On a computer we often approximate these using a grid of space-time points \(N_x\times N_y \times N_z\times N_t\). Thus the values of a scalar field could be stored in an array of shape (N_x,N_y,N_z,N_t). If we had a vector field like \(\mathbf{E}(\mathbf{r},t)\) this would be (N_x,N_y,N_z,N_t,3). You get the idea.

A very useful method to create a grid of coordinate values (at which you can evaluate a function, say) is as follows

# Grid of x, y points
nx, ny = 64, 64
x = np.linspace(-2, 2, nx)
y = np.linspace(-2, 2, ny)
X, Y = np.meshgrid(x, y)
Dipole electric field
# Example from https://scipython.com/blog/visualizing-a-vector-field-with-matplotlib/
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Circle

def E(q, r0, x, y):
    """Return the electric field vector E=(Ex,Ey) due to charge q at r0."""
    den = np.hypot(x-r0[0], y-r0[1])**3
    return q * (x - r0[0]) / den, q * (y - r0[1]) / den

# Dipole
charges = [(1, (1, 0)), (-1, (-1, 0))]

# Electric field vector, E=(Ex, Ey), as separate components
Ex, Ey = np.zeros((ny, nx)), np.zeros((ny, nx))
for charge in charges:
    ex, ey = E(*charge, x=X, y=Y)
    Ex += ex
    Ey += ey

fig = plt.figure()
ax = fig.add_subplot(111)

# Plot the streamlines with an appropriate arrow style
color = 2 * np.log(np.hypot(Ex, Ey))
ax.streamplot(x, y, Ex, Ey, color=color, linewidth=1,
              density=2, arrowstyle='->', arrowsize=1.5)

# Add filled circles for the charges themselves
charge_colors = {True: '#aa0000', False: '#0000aa'}
for q, pos in charges:
    ax.add_artist(Circle(pos, 0.05, color=charge_colors[q>0]))

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_xlim(-2,2)
ax.set_ylim(-2,2)
ax.set_aspect('equal')
plt.show()

4 Mathematical operations with arrays

I’ve already hinted at the payoff from all of this. On lists, multiplication by an integer concatentates multiple copies

2 * [1, 2, 3]
[1, 2, 3, 1, 2, 3]

which is sometimes useful. But in numerical applications what we really want is this

2 * np.array([1, 2, 3])
array([2, 4, 6])

This illustrates a general feature of NumPy that all mathematical operations are performed elementwise on arrays!

print(np.array([1, 2, 3]) + np.array([4, 5, 6]))
print(np.array([1, 2, 3])**2)
print(np.sqrt(np.array([1, 2, 3])))
[5 7 9]
[1 4 9]
[1.         1.41421356 1.73205081]

This avoids the need to write nested loops to perform some operation on each element of some multidimensional data. Of course, the loops are still there, it’s just that NumPy handles them in highly optimized C rather than Python. Code which operates in this way – rather than with explicit loops – is often described as vectorized, and in NumPy-speak vectorized functions are called ufuncs, short for universal functions (you can write your own if you need to). As a basic principle you should never use a Python loop to access your data in NumPy code. Loops may appear at a high level in stepping through time steps in a simulation, for example.

5 Broadcasting

Vectorization is even more versatile than the above examples might suggest. Broadcasting is a powerful protocol that allows us to combine arrays of different shapes. Thus we can add a number to an array

np.array([1, 2, 3]) + 2.3
array([3.3, 4.3, 5.3])

More generally, elementwise operations can be performed on two arrays of the same rank if in each dimension the sizes either match or one array has size 1.

# These have shape (2, 3) and (1, 3)
np.array([[1, 2, 3], [4, 5, 6]]) + np.array([[4, 3, 2]])
array([[5, 5, 5],
       [8, 8, 8]])

In fact, we can simplify this last example

# These have shape (2, 3) and (3,)
np.array([[1, 2, 3], [4, 5, 6]]) + np.array([4, 3, 2])
array([[5, 5, 5],
       [8, 8, 8]])

Broadcasting two arrays follows these rules:

  1. If the arrays do not have the same rank, prepend the shape of the lower rank array with 1s until both shapes have the same length.

  2. The two arrays are said to be compatible in a dimension if they have the same size in the dimension, or if one of the arrays has size 1 in that dimension.

  3. The arrays can be broadcast together if they are compatible in all dimensions. After broadcasting, each array behaves as if it had shape equal to the elementwise maximum of shapes of the two input arrays.

  4. In any dimension where one array had size 1 and the other array had size greater than 1, the first array behaves as if it were copied along that dimension.

The documentation has more detail.

Recall our earlier example of an \(N\)-particle system described by a position array of shape (N,3). If we want to shift the entire system by a vector, we can just add a vector of shape (3,) and broadcasting will ensure that this applied correctly to each particle.

6 Plotting with Matplotlib

There are various specialized Python plotting libraries but the entry-level option is the catchily named Matplotlib. The pyplot module provides a plotting system that is similar to MATLAB (I’m told)

import matplotlib.pyplot as plt

This is probably the second most common import you will make. Here’s a simple example of the plot function, used to plot 2D data

# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)

# Plot the points using matplotlib
plt.plot(x, y)
plt.show()

Note: you must call plt.show() to make graphics appear. Here’s a fancier example with some labelling

# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])
plt.show()

Often you’ll want to make several related plots and present them together, which can be achieved using the subplot function

import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)

# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')

# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')

# Show the figure.
plt.show()

6.1 Example: playing with images

Pixels in an image are encoded as a triple of RGB values in the range [0,255] i.e. 8 bits of type uint8 (the “u” is for “unsigned”). Tinting an image gives us a nice example of broadcasting

img = plt.imread('../assets/lucian.jpeg')

img_tinted = img * [1, 0.55, 1]

# Show the original image
plt.subplot(1, 2, 1)
plt.imshow(img)
plt.title("Lucian")

# Show the tinted image
plt.subplot(1, 2, 2)
plt.title("Pink Panther")
# Having multiplied by floats, 
# we must cast the image to uint8 before displaying it.
plt.imshow(np.uint8(img_tinted))

plt.show()
img.shape, img.dtype

((4032, 3024, 3), dtype('uint8'))

This is a standard 12 megapixel image.

7 Saving and loading data

In the course of your work you are likely to produce, as well as consume lots of data. While it’s good practice to keep notebooks capable of reproducing any of your analyses, this could be time consuming and resource heavy for larger computations. Thus at some point you’ll probably want to save and load data. For example, after saving the data of a large scale simulation you’d like to load it and perform some analysis.

NumPy comes with its own save and load functions and associated binary format .npy. The benefit of using these is that after loading you get back a NumPy array ready to be used.

A related function savez allows several arrays to be saved and then loaded as a dictionary-like object.

random_matrix_1 = np.random.rand(4, 4)
random_matrix_2 = np.random.rand(4, 4)
np.savez("../assets/my-matrices", first_matrix=random_matrix_1, second_matrix=random_matrix_2)
%ls ../assets
fibonacci.png     ising.js          metropolis.png    tab-complete.png
hard-spheres.png  ising.py          my-matrices.npz
ia-question.png   lucian.jpeg       tab-complete.gif
my_matrix_file = np.load("../assets/my-matrices.npz")
my_matrix_file['first_matrix']
array([[0.74247601, 0.5018754 , 0.94228006, 0.42209143],
       [0.71437331, 0.54953203, 0.50018677, 0.87390299],
       [0.70960527, 0.43038589, 0.18670472, 0.08482399],
       [0.15953534, 0.67654528, 0.38712039, 0.24182652]])