2. Numpy#

Read more at PDSH Ch 2

Numpy introduces a new data type, the array. In many ways, the numpy array is like a list, and we’ll see many similarities when it comes to indexing and slicing arrays. But there are some key differences that make arrays particularly useful for data analysis. First is a restriction to ensure homogeneity.

  • Numpy arrays may only contain numerical or text data or nested arrays (lists) of numerical or text data, and all data must be of the same type.

  • Mixed numerical data (ints and floats) are up-typed to the most permissible type unless data type is explicitly specified.

2.1. Creating a Numpy array#

Let’s create a generic Numpy array and some special arrays.

  • generic array

  • empty array

  • array of ones or zeros

  • array of all one value

  • array of regularly spaced values

  • array of random numbers

import numpy as np
A = np.array([1, 2, 10, 4.1, 11])
B = np.array([])
C = np.zeros(7)
C2 = np.zeros_like(A)
D = np.ones(8)
D2 = np.ones_like(A)
E = np.full(20, 3)  # size of array, then fill value. same as np.ones(20)*3
F = np.arange(0, 100.1, 5)
G = np.random.randint(0, 10, 20)

display(A)
display(B)
display(C)
display(C2)
display(D)
display(D2)
display(E)
display(F)
display(G)
array([ 1. ,  2. , 10. ,  4.1, 11. ])
array([], dtype=float64)
array([0., 0., 0., 0., 0., 0., 0.])
array([0., 0., 0., 0., 0.])
array([1., 1., 1., 1., 1., 1., 1., 1.])
array([1., 1., 1., 1., 1.])
array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
array([  0.,   5.,  10.,  15.,  20.,  25.,  30.,  35.,  40.,  45.,  50.,
        55.,  60.,  65.,  70.,  75.,  80.,  85.,  90.,  95., 100.])
array([3, 6, 7, 7, 0, 0, 3, 5, 0, 6, 4, 6, 1, 4, 6, 1, 8, 8, 8, 2])

2.2. Multi-dimensional arrays#

Similar to how lists can contain lists, arrays can have multiple dimensions.

Consider the matrix \(X\):

\[\begin{split} X = \begin{bmatrix} x_{0,0} & x_{0,1} & x_{0,2} & x_{0,3}\\ x_{1,0} & x_{1,1} & x_{1,2} & x_{1,3}\\ x_{2,0} & x_{2,1} & x_{2,2} & x_{2,3}\\ \end{bmatrix} \end{split}\]

The dimensions of a matrix are \(num\_rows \times num\_columns\), for the matrix above \(3 \times 4\). The location of an element in the matrix is \((row\_index, column\_index)\).

For a 2-dimensional numpy array, we can treat the matrix as a list containing an individual list for each row.

\[\begin{split} X = \begin{bmatrix} \begin{bmatrix}x_{0,0} & x_{0,1} & x_{0,2} & x_{0,3}\end{bmatrix}\\ \begin{bmatrix}x_{1,0} & x_{1,1} & x_{1,2} & x_{1,3}\end{bmatrix}\\ \begin{bmatrix}x_{2,0} & x_{2,1} & x_{2,2} & x_{2,3}\end{bmatrix}\\ \end{bmatrix} \end{split}\]
  • To get the dimensions of an array, we can query the array property .shape.

  • To index an element in an array, X[row_idx, col_idx] (and for higher dimensional arrays, just keep adding idx)

  • All the same slicing that is performed on lists can be performed on arrays, but now in any direction!

'''create an 4 x 5 array of random integers'''
np.random.seed(42)
X = np.random.randint(1, 101, [4,5])
X
array([[ 52,  93,  15,  72,  61],
       [ 21,  83,  87,  75,  75],
       [ 88, 100,  24,   3,  22],
       [ 53,   2,  88,  30,  38]])
'''get the shape of the array'''
X.shape
(4, 5)
'''extract entries of an array'''
X[2,1]
np.int64(100)
'''extract individual rows and columns'''
display(X)
first_row = X[0,:]
first_col = X[:,0]
array([[ 52,  93,  15,  72,  61],
       [ 21,  83,  87,  75,  75],
       [ 88, 100,  24,   3,  22],
       [ 53,   2,  88,  30,  38]])
'''all the slicing like lists'''
display(X)
# every other column
X[:,::2]

# every other column starting at the one-th column
X[:,1::2]

# the last three rows of the last two columns
X[-3:,-2:]

# the 1-th and 3-th rows of the 1-th and 4-th columns
X[1::2,1::3]
array([[ 52,  93,  15,  72,  61],
       [ 21,  83,  87,  75,  75],
       [ 88, 100,  24,   3,  22],
       [ 53,   2,  88,  30,  38]])
array([[83, 75],
       [ 2, 38]])

2.3. Masking#

A mask is a matrix of boolean values. You can either 1) use a mask as an index to an array or 2) multiply an array by a mask.

  • As an index, the result will be a 1-D array of the values wherever the mask was True.

  • Multiplying by the mask, the result is an array of the same shapes with 0 everywhere the mask is False and the original value where the mask is True.

np.random.seed(33)
Y = np.random.randint(0, 10, (10,10))
display(Y)
array([[4, 7, 8, 2, 2, 9, 9, 3, 6, 3],
       [3, 1, 7, 6, 0, 0, 6, 6, 0, 4],
       [8, 8, 3, 7, 9, 3, 3, 7, 3, 7],
       [2, 1, 3, 6, 9, 0, 0, 4, 9, 2],
       [5, 7, 1, 1, 4, 1, 1, 8, 4, 8],
       [3, 5, 8, 0, 9, 7, 7, 9, 6, 9],
       [1, 5, 0, 3, 0, 0, 9, 0, 3, 4],
       [8, 5, 9, 5, 8, 3, 2, 5, 5, 3],
       [8, 7, 0, 1, 2, 7, 6, 2, 4, 5],
       [9, 2, 1, 1, 2, 2, 6, 6, 7, 6]])
Ymask = Y>=5
Ymask*1
array([[0, 1, 1, 0, 0, 1, 1, 0, 1, 0],
       [0, 0, 1, 1, 0, 0, 1, 1, 0, 0],
       [1, 1, 0, 1, 1, 0, 0, 1, 0, 1],
       [0, 0, 0, 1, 1, 0, 0, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 1, 1, 0, 1, 1, 1, 1, 1, 1],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 1, 1, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 1, 1, 1, 1]])
Y_maskindex = Y*(Y%3==0)
display(Y_maskindex)

# Y * Ymask
array([[0, 0, 0, 0, 0, 9, 9, 3, 6, 3],
       [3, 0, 0, 6, 0, 0, 6, 6, 0, 0],
       [0, 0, 3, 0, 9, 3, 3, 0, 3, 0],
       [0, 0, 3, 6, 9, 0, 0, 0, 9, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [3, 0, 0, 0, 9, 0, 0, 9, 6, 9],
       [0, 0, 0, 3, 0, 0, 9, 0, 3, 0],
       [0, 0, 9, 0, 0, 3, 0, 0, 0, 3],
       [0, 0, 0, 0, 0, 0, 6, 0, 0, 0],
       [9, 0, 0, 0, 0, 0, 6, 6, 0, 6]])

Challenge question Create a 10x10 array of numbers between -50 and 150. Then create a new array in which any number below 0 is replaced with 1 and any number greater than 100 is replaced with 100.

np.random.seed(6)
A = np.random.randint(-50,151, [10,10])
B = (A<1)*1 + (A>100) * 100 + ((1<A) & (A<100)) * A

B
array([[ 88,  56,  59,  29,  56,  30,  12,   1,   1,  25],
       [ 27, 100,   1, 100,  60,  18, 100,   1,  80,   1],
       [100,  83,  75, 100,   1,  36,  64, 100, 100, 100],
       [ 82,  12,  30, 100,  77,  16,  13,  97,  35,  74],
       [  1,   1,   7,  83,   1,   1,  28,  69,   1, 100],
       [ 20,  13,   1,  17,  35,  89,  79,  32, 100,   0],
       [  1,  73,  55,   1, 100,   1, 100,  85,   1,  25],
       [100,   1,  64,  18, 100,   1, 100,   1,  75,  99],
       [ 20,  79,   1,  74, 100,  51,   1,   1,  27,  23],
       [100,  13,  16, 100,  99,  18, 100,  92,   1,   1]])

Create an 7 x 10 array of random integers. Extract an array of the first 3 elements of every other row.

np.random.seed(1)
C = np.random.randint(1,101, [7,10])
display(C)
C[::2,:3]
array([[38, 13, 73, 10, 76,  6, 80, 65, 17,  2],
       [77, 72,  7, 26, 51, 21, 19, 85, 12, 29],
       [30, 15, 51, 69, 88, 88, 95, 97, 87, 14],
       [10,  8, 64, 62, 23, 58,  2,  1, 61, 82],
       [ 9, 89, 14, 48, 73, 31, 72,  4, 71, 22],
       [50, 58,  4, 69, 25, 44, 77, 27, 53, 81],
       [42, 83, 16, 65, 69, 26, 99, 88,  8, 27]])
array([[38, 13, 73],
       [30, 15, 51],
       [ 9, 89, 14],
       [42, 83, 16]])

2.4. Views vs Copies#

When you slice an array, the resulting sub-array is a view into the main array. This is true even if you save the sub-array as a new variable. What does this mean?

You are not allocating new memory to save this view, so any change made to the sub-array is made to the original array.

If we want to slice a sub-array and have it exist as an array independent of the original array, we must .copy().

While views might be confusing, they are incredibly useful for breaking up large data arrays to work with manageable chunks.

Demo Exercise

  • Let’s create a 10 x 10 matrix of random numbers from 1-5, call it Z.

  • Then let’s extract the upper-right quadrant as a view and the lower-right quadrant as a copy, Z_tr and Z_br respectively.

  • Now, let’s fill Z_tr with ones and Z_br with twos.

How do these changes affect the original array?

'''creating a big matrix (10x10)'''
np.random.seed(1)
Z = np.random.randint(1, 6, [10,10])
Z
array([[4, 5, 1, 2, 4, 1, 1, 2, 5, 5],
       [2, 3, 5, 3, 5, 4, 5, 3, 5, 3],
       [5, 2, 2, 1, 2, 2, 2, 2, 1, 5],
       [2, 1, 1, 4, 3, 2, 1, 4, 2, 2],
       [4, 5, 1, 2, 4, 5, 3, 5, 1, 4],
       [2, 3, 1, 5, 2, 3, 3, 2, 1, 2],
       [4, 5, 4, 2, 4, 1, 1, 3, 3, 2],
       [4, 5, 3, 1, 1, 2, 2, 4, 1, 1],
       [5, 3, 5, 4, 4, 1, 4, 5, 4, 5],
       [5, 5, 2, 1, 5, 3, 1, 3, 5, 2]])
'''slice the top-right quadrant as a view'''
Z_tr = Z[:5, 5:]

'''slice the bottom-right quadrant as a copy'''
Z_br = Z[5:, 5:].copy()

display(Z)
display(Z_tr)
display(Z_br)
array([[4, 5, 1, 2, 4, 1, 1, 2, 5, 5],
       [2, 3, 5, 3, 5, 4, 5, 3, 5, 3],
       [5, 2, 2, 1, 2, 2, 2, 2, 1, 5],
       [2, 1, 1, 4, 3, 2, 1, 4, 2, 2],
       [4, 5, 1, 2, 4, 5, 3, 5, 1, 4],
       [2, 3, 1, 5, 2, 3, 3, 2, 1, 2],
       [4, 5, 4, 2, 4, 1, 1, 3, 3, 2],
       [4, 5, 3, 1, 1, 2, 2, 4, 1, 1],
       [5, 3, 5, 4, 4, 1, 4, 5, 4, 5],
       [5, 5, 2, 1, 5, 3, 1, 3, 5, 2]])
array([[1, 1, 2, 5, 5],
       [4, 5, 3, 5, 3],
       [2, 2, 2, 1, 5],
       [2, 1, 4, 2, 2],
       [5, 3, 5, 1, 4]])
array([[3, 3, 2, 1, 2],
       [1, 1, 3, 3, 2],
       [2, 2, 4, 1, 1],
       [1, 4, 5, 4, 5],
       [3, 1, 3, 5, 2]])
Z_tr.fill(1)
Z_br.fill(2)

display(Z_tr)
display(Z_br)
array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])
array([[2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2]])
display(Z)
array([[4, 5, 1, 2, 4, 1, 1, 1, 1, 1],
       [2, 3, 5, 3, 5, 1, 1, 1, 1, 1],
       [5, 2, 2, 1, 2, 1, 1, 1, 1, 1],
       [2, 1, 1, 4, 3, 1, 1, 1, 1, 1],
       [4, 5, 1, 2, 4, 1, 1, 1, 1, 1],
       [2, 3, 1, 5, 2, 3, 3, 2, 1, 2],
       [4, 5, 4, 2, 4, 1, 1, 3, 3, 2],
       [4, 5, 3, 1, 1, 2, 2, 4, 1, 1],
       [5, 3, 5, 4, 4, 1, 4, 5, 4, 5],
       [5, 5, 2, 1, 5, 3, 1, 3, 5, 2]])
Z[:,7] = 0
Z
array([[4, 5, 1, 2, 4, 1, 1, 0, 1, 1],
       [2, 3, 5, 3, 5, 1, 1, 0, 1, 1],
       [5, 2, 2, 1, 2, 1, 1, 0, 1, 1],
       [2, 1, 1, 4, 3, 1, 1, 0, 1, 1],
       [4, 5, 1, 2, 4, 1, 1, 0, 1, 1],
       [2, 3, 1, 5, 2, 3, 3, 0, 1, 2],
       [4, 5, 4, 2, 4, 1, 1, 0, 3, 2],
       [4, 5, 3, 1, 1, 2, 2, 0, 1, 1],
       [5, 3, 5, 4, 4, 1, 4, 0, 4, 5],
       [5, 5, 2, 1, 5, 3, 1, 0, 5, 2]])
display(Z_tr) # remember this is a view
display(Z_br) # and this is a copy
array([[1, 1, 0, 1, 1],
       [1, 1, 0, 1, 1],
       [1, 1, 0, 1, 1],
       [1, 1, 0, 1, 1],
       [1, 1, 0, 1, 1]])
array([[2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2]])
Exam1 = Z[:,3]
A = Z.copy()
B = Z

2.5. Math on arrays#

The nicest thing about numpy arrays is that they have been optimized for performing vectorized math operations. What does that mean? A math operation can be applied to every element of an array without a loop, and these vectorized operations are MUCH MUCH MUCH faster.

A = np.random.randint(0,10, [10, 5])
display(A)
array([[7, 9, 8, 4, 0],
       [1, 9, 8, 2, 3],
       [1, 2, 7, 2, 6],
       [0, 9, 2, 6, 6],
       [2, 7, 7, 0, 6],
       [5, 1, 4, 6, 0],
       [6, 5, 1, 2, 1],
       [5, 4, 0, 7, 8],
       [9, 5, 7, 0, 9],
       [3, 9, 1, 4, 4]])
np.sin(A*np.pi/3)
A**2
array([[49, 81, 64, 16,  0],
       [ 1, 81, 64,  4,  9],
       [ 1,  4, 49,  4, 36],
       [ 0, 81,  4, 36, 36],
       [ 4, 49, 49,  0, 36],
       [25,  1, 16, 36,  0],
       [36, 25,  1,  4,  1],
       [25, 16,  0, 49, 64],
       [81, 25, 49,  0, 81],
       [ 9, 81,  1, 16, 16]])

And we can perform operations that aggregate results over a column or row (e.g. sum, mean, min, max).

display(A.mean())   # mean of all the values in A
display(A.mean(0)) # mean of every column
display(A.mean(1)) # mean of every row
np.float64(4.4)
array([3.9, 6. , 4.5, 3.3, 4.3])
array([5.6, 4.6, 3.6, 4.6, 4.4, 3.2, 3. , 4.8, 6. , 4.2])