FALLSEM2023-24 CSI3007 ETH VL2023240104352 2023-09-27 Reference-Material-I

numpy
February 28, 2022
1 NumPy
NumPy is the foundational package for scientific computing in Python.
It provides high performance multi dimensional array object called ‘ndarray’ using which you can
do all sorts of mathematical operations efficiently. That is, write less code, yet the code runs fast.
1.1 What Problems Does NumPy Solve?

The main problem NumPy solves is it provides efficient data structures to store tabular and multi-
dimensional data.
Drawback of storing the data in a python list :
It is highly inefficient. For example, you have a list of numbers [1,2,3,4,5] and you want to
multiply each item in the list by 2.
[21]: [1, 2, 3, 4, 5] * 2
[21]: [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
If you multiply it by 2, it simply repeats the list 2 times. This is not what we want.
[22]: [i*2 for i in [1, 2, 3, 4, 5]]
[22]: [2, 4, 6, 8, 10]
Imagine having to hold a two dimensional, tabular data and you want to add or multiply two
columns in the dataset. You’ll have to write a for loop to do something as basic.
Lists are not built for mathematical operations by design.
And that is where Numpy Arrays comes into picture!
The core of the numpy package is the powerful n-dimensional array (‘ndarray’) object.
1. Vectorized math operations enables more work with less code.
2. Operations are performed in pre-compiled C code, enabling fast performance.
3. Capabilities for performing:
• Linear algebra
• Matrix operations
• Random number generation
1
• Probability distributions
• Sequence generation
• Data wrangling
• Fast fourier transforms
• Financial functions
• Handing dates
• Missing values
• Fitting polynomials
• Solving linear equations
1.1.1 Why NumPy is Fast?

Numpy arrays are written mostly in C language. Being written in C, the NumPy arrays are stored
in contiguous memory locations which makes them accessible and easier to manipulate. This means
that you can get the performance level of a C code with the ease of writing a python program.
1.2 Creating NumPy Arrays

1.2.1 What is NumPy Array?
NumPy arrays are datastructures to store multi-dimensional data. They are homogeneous and
perform vectorized operations by default.
1.2.2 Create NumPy array from a list
[23]: import numpy as np
[24]: L = [1, 2, 3, 4]
arr = np.array(L)
arr
[24]: array([1, 2, 3, 4])
Check Type
[25]: type(arr)
[25]: numpy.ndarray
Vectorized Multiplication works

[26]: # A list replicates
L * 2
[26]: [1, 2, 3, 4, 1, 2, 3, 4]
[27]: arr * 2
[27]: array([2, 4, 6, 8])
2
[28]: # possible with list comprehension or For loop
[2*i for i in L]
[28]: [2, 4, 6, 8]
Subset array
[29]: print(arr)
arr[1:3]
[1 2 3 4]
[29]: array([2, 3])
Negative indexing is supported

[30]: arr[-3:-1]
[30]: array([2, 3])
Reversing
[31]: arr[::-1]
[31]: array([4, 3, 2, 1])
Arrays are homogenous

Creating with list containing both numbers and characters will convert the numbers to characters.
Because, Numpy arrays are homogenous.
[32]: L = [1, 2, "A", "B"]
arr = np.array(L)
arr
[32]: array(['1', '2', 'A', 'B'], dtype='<U21')
Check data type

[33]: arr.dtype
[33]: dtype('<U21')
Integer array
[34]: L = [1, 2, 3, 4]
arr = np.array(L)
arr.dtype
[34]: dtype('int64')
3
1.3 Two Dimensional Arrays
[35]: L2 = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
arr_L2 = np.array(L2)
[36]: arr_L2
arr_L2[:, 1:3]
[36]: array([[2, 3],

[5, 6],
[8, 9]])
Array Indexing
[37]: arr_L2[0, :]
[37]: array([1, 2, 3])
[38]: arr_L2[:, 1]
[38]: array([2, 5, 8])
Convert back to list

[39]: arr_L2.tolist()
[39]: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
1.4 Change DataType

[40]: # to string
arr_L2.astype('str')
[40]: array([['1', '2', '3'],

['4', '5', '6'],
['7', '8', '9']], dtype='<U21')
[41]: # to float
arr_L2.astype('float32')
[41]: array([[1., 2., 3.],

[4., 5., 6.],
[7., 8., 9.]], dtype=float32)
4
1.5 Math Operations
[42]: print(arr_L2)
# Multiply
arr_L2 * 2
[[1 2 3]
[4 5 6]
[7 8 9]]
[42]: array([[ 2, 4, 6],

[ 8, 10, 12],
[14, 16, 18]])
[43]: # Divide
quarter = arr_L2 / 4
quarter
[43]: array([[0.25, 0.5 , 0.75],

[1. , 1.25, 1.5 ],
[1.75, 2. , 2.25]])
[44]: # Subtract
arr_L2 - quarter
[44]: array([[0.75, 1.5 , 2.25],

[3. , 3.75, 4.5 ],
[5.25, 6. , 6.75]])
[45]: # Add
arr_L2 - quarter + arr_L2
[45]: array([[ 1.75, 3.5 , 5.25],

[ 7. , 8.75, 10.5 ],
[12.25, 14. , 15.75]])
1.6 Create Zeros and Ones Arrays

Zeros
[46]: np.zeros_like(arr_L2)
[46]: array([[0, 0, 0],

[0, 0, 0],
[0, 0, 0]])
[47]: np.zeros((3,3))
5
[47]: array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
Ones
[48]: a= np.ones((3, 3))
print(a)
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
[49]: np.ones_like(arr_L2)
[49]: array([[1, 1, 1],

[1, 1, 1],
[1, 1, 1]])
[50]: np.empty((3,3))
[50]: array([[1., 1., 1.],

[1., 1., 1.],
[1., 1., 1.]])
Diagonal
[51]: np.diag(arr_L2)
[51]: array([1, 5, 9])
1.7 Inspecting Arrays

Let’s first create one
[53]: L = [[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16],
[17,18,19,20]]
arr = np.array(L)
arr
[53]: array([[ 1, 2, 3, 4],

[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
6
[13, 14, 15, 16],
[17, 18, 19, 20]])
Shape of the array - Number of items in each dimension (rows, columns)
[54]: arr.shape
[54]: (5, 4)
ndim - Number of dimensions

[55]: arr.ndim
[55]: 2
Size - Total number of items

[56]: arr.size
[56]: 20
Datatype
[57]: arr.dtype
Create in another dtype.

[58]: arr_int64 = np.array(L, dtype='int64')
arr_int64
[58]: array([[ 1, 2, 3, 4],

[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20]])
1.8 Copy vs Reference

If you change the value of the reference array, the original also changes. Whereas, when changing
the copy, the original remains unaffected.
[59]: arr_r = arr # reference
arr_c = arr.copy() # copy
[60]: arr_c
7
[60]: array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20]])
[61]: arr_r
[61]: array([[ 1, 2, 3, 4],

[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20]])
[62]: arr
[62]: array([[ 1, 2, 3, 4],

[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20]])
Change value in arr_c

[63]: arr_c[0, 0] = 100
arr_c
[63]: array([[100, 2, 3, 4],

[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[ 13, 14, 15, 16],
[ 17, 18, 19, 20]])
arr remains unaffected.

[64]: arr
[64]: array([[ 1, 2, 3, 4],

[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20]])
Let’s try the same on arr_r. Changes to arr_r will affect arr.
[65]: arr_r[0, 0] = 100
arr_r
8
[65]: array([[100, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[ 13, 14, 15, 16],
[ 17, 18, 19, 20]])
[66]: arr
[66]: array([[100, 2, 3, 4],

[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[ 13, 14, 15, 16],
[ 17, 18, 19, 20]])
Check the id of objects

arr_r is actually arr.
[67]: print(id(arr))
print(id(arr_c))
print(id(arr_r))
140420946102576
140420946289936
140420946102576
[68]: arr_r is arr
[68]: True
[69]: arr_c is arr
[69]: False
1.9 Why Datatype Matters?

Numpy provides different datatypes to hold various forms of data. You can expicitly control which
datatype to hold your data in.
arr = np.array([1,2,3,4])
arr
[70]: array([1, 2, 3, 4])
[71]: arr.dtype
9
By default, numpy assigned a default datatype of int32. Each item of this array consumes 32bits
= 32/8 = 4 bytes of memory.
[72]: arr.nbytes
[72]: 32
That means there is a certain maximum and minumum value it can hold.
[73]: np.iinfo('int32')
[73]: iinfo(min=-2147483648, max=2147483647, dtype=int32)
But you might not need int32. If this variable is supposed to represent the month of the year, the
max value needed is just 12. In such case, int8 would be sufficient to handle this data, freeing up
memory for much needed computations.
So, when creating the variable, explicitly mention the datatype. This will matter more when the
data size gets larger.
[74]: arr = np.array([1,2,3,4], dtype=np.int8)
arr
[74]: array([1, 2, 3, 4], dtype=int8)
1.10 Supported Data Types

The primary datatypes supported by numpy are as follows:
[75]: np.int # integer
np.uint # unsigned integer
np.float # float
np.bool # boolean
np.object # python object
np.str # string
/tmp/ipykernel_44644/1652948259.py:1: DeprecationWarning: `np.int` is a

deprecated alias for the builtin ìnt`. To silence this warning, use ìnt` by
itself. Doing this will not modify any behavior and is safe. When replacing
`np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the
precision. If you wish to review your current use, check the release note link
for additional information.
Deprecated in NumPy 1.20; for more details and guidance:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
np.int # integer
/tmp/ipykernel_44644/1652948259.py:3: DeprecationWarning: `np.float` is a
deprecated alias for the builtin `float`. To silence this warning, use `float`
by itself. Doing this will not modify any behavior and is safe. If you
specifically wanted the numpy scalar type, use `np.float64` here.
10
np.float # float
/tmp/ipykernel_44644/1652948259.py:4: DeprecationWarning: `np.bool` is a
deprecated alias for the builtin `bool`. To silence this warning, use `bool` by
itself. Doing this will not modify any behavior and is safe. If you specifically
wanted the numpy scalar type, use `np.bool_` here.
np.bool # boolean
/tmp/ipykernel_44644/1652948259.py:5: DeprecationWarning: `np.object` is a
deprecated alias for the builtin òbject`. To silence this warning, use òbject`
by itself. Doing this will not modify any behavior and is safe.
np.object # python object
/tmp/ipykernel_44644/1652948259.py:6: DeprecationWarning: `np.str` is a
deprecated alias for the builtin `str`. To silence this warning, use `str` by
itself. Doing this will not modify any behavior and is safe. If you specifically
wanted the numpy scalar type, use `np.str_` here.
np.str # string
[75]: str
To find out the minimum and the maximum range a given integer type can store, use np.iinfo
method.
[76]: # int
print("int8", np.iinfo(np.int8))
# unsigned int
print("uint8", np.iinfo(np.uint8))
int8 Machine parameters for int8

---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------
11
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------

---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------
uint8 Machine parameters for uint8

---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

---------------------------------------------------------------
min = 0
max = 65535
---------------------------------------------------------------

---------------------------------------------------------------
min = 0
max = 4294967295
---------------------------------------------------------------

---------------------------------------------------------------
min = 0
max = 18446744073709551615
---------------------------------------------------------------
12
1.11 Creating array that contains a mix of both numbers and characters and
even any python object
[77]: arr = np.array(['a', 'b', 'c', 1], dtype='object')
arr.dtype
[77]: dtype('O')
Some python objects as well into the list.

[78]: arr = np.array(['a', 'b', 'c', 1, None, [21]], dtype='object')
arr.dtype
[78]: dtype('O')
[79]: arr
[79]: array(['a', 'b', 'c', 1, None, list([21])], dtype=object)
1.12 Exercise
1. Convert the following numpy array to optimal datatype (one that requires least space).
’‘’python import numpy as np arr = np.array([1,20,300,4000,50000]) arr”’
2. Create a numpy array that contains the following tuples. What is the difference between the
two? ’‘’python T1 = [(1, 10, 10), (2,20), (3,30)] T2 = [(1, 10), (2,20), (3,30)]”’
Solution 1
arr = np.array([1,20,300,4000,50000])
arr
[80]: array([ 1, 20, 300, 4000, 50000])
[81]: print(np.iinfo('int8'))
print(np.iinfo('int16'))
print(np.iinfo('int32'))
Machine parameters for int8

---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

---------------------------------------------------------------
min = -32768
max = 32767
13
---------------------------------------------------------------

---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------
[82]: # Loss of data

arr.astype(np.int8)
[82]: array([ 1, 20, 44, -96, 80], dtype=int8)
[83]: # Loss of data again

arr.astype(np.int16)
[83]: array([ 1, 20, 300, 4000, -15536], dtype=int16)
[84]: # Ok
arr.astype(np.int32)
[84]: array([ 1, 20, 300, 4000, 50000], dtype=int32)
Solution 2
[85]: T1 = [(1, 10, 10), (2, 20), (3, 30)]
T2 = [(1, 10), (2, 20), (3, 30)]
[86]: a1 = np.array(T1, dtype='object')

a1
[86]: array([(1, 10, 10), (2, 20), (3, 30)], dtype=object)
[87]: # forms 1d array of tuples

a1.shape
[87]: (3,)
[88]: a2 = np.array(T2, dtype='object')

a2
[88]: array([[1, 10],

[2, 20],
[3, 30]], dtype=object)
[89]: # forms 2d array intuitively.

a2.shape
14
[89]: (3, 2)
1.13 Import and Export Data

Numpy provides useful functions to load data from an external file and save it as well.
2 Import Data
The main import methods are:
1. numpy.loadtxt()
2. numpy.genfromtext()
Use np.loadtxt when there is no missing data.
data = np.loadtxt('/home/jaisakthi/JS/CSI3007_adv_python/Datasets/data.txt',␣
,→delimiter="\t")
data
[90]: array([[ 1. , 87. , 57.54435],

[ 2. , 8. , 7.31704],
[ 3. , 56. , 56.82095],
[ 4. , 63. , 64.15579],
[ 5. , 2. , 5.74522],
[ 6. , 45. , 19.56758],
[ 7. , 43. , 39.62271],
[ 8. , 47. , 34.95107],
[ 9. , 2. , 9.38692],
[10. , 79. , 36.41022],
[11. , 67. , 49.83894],
[12. , 24. , 23.47974],
[13. , 61. , 72.55357],
[14. , 85. , 39.24693],
[15. , 63. , 53.6279 ],
[16. , 2. , 16.72441],
[17. , 29. , 37.25533],
[18. , 45. , 18.78498],
[19. , 33. , 19.8089 ],
[20. , 28. , 46.03384],
[21. , 21. , 23.7864 ],
[22. , 27. , 44.42627],
[23. , 65. , 34.94804],
[24. , 61. , 53.49576],
[25. , 10. , 25.98564]])
[91]: data[0:4, 0:2]
15
[91]: array([[ 1., 87.],
[ 2., 8.],
[ 3., 56.],
[ 4., 63.]])
[92]: type(data)
[92]: numpy.ndarray
When there are missing values, it errors out.

[93]: data = np.loadtxt('/home/jaisakthi/JS/CSI3007_adv_python/Datasets/data_miss.
,→txt', delimiter="\t")
data
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_44644/82168248.py in <module>
----> 1 data = np.loadtxt('/home/jaisakthi/JS/CSI3007_adv_python/Datasets/
,→data_miss.txt', delimiter="\t")
2 data
~/.conda/envs/ml/lib/python3.9/site-packages/numpy/lib/npyio.py in␣
,→loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols,␣
,→unpack, ndmin, encoding, max_rows, like)
1144 # converting the data

1145 X = None
-> 1146 for x in read_data(_loadtxt_chunksize):
1147 if X is None:
1148 X = np.array(x, dtype)
~/.conda/envs/ml/lib/python3.9/site-packages/numpy/lib/npyio.py in␣
,→read_data(chunk_size)
995
996 # Convert each value according to its column and store
--> 997 items = [conv(val) for (conv, val) in zip(converters, vals)]
998
999 # Then pack it according to the dtype's nesting
~/.conda/envs/ml/lib/python3.9/site-packages/numpy/lib/npyio.py in <listcomp>(.0)
995
996 # Convert each value according to its column and store
--> 997 items = [conv(val) for (conv, val) in zip(converters, vals)]
998
999 # Then pack it according to the dtype's nesting
~/.conda/envs/ml/lib/python3.9/site-packages/numpy/lib/npyio.py in floatconv(x)
732 if '0x' in x:
16
733 return float.fromhex(x)
--> 734 return float(x)
735
736 typ = dtype.type
ValueError: could not convert string to float: ''
In such situation, use np.genfromtxt(). It fills in missing data with nan.
[94]: data = np.genfromtxt('/home/jaisakthi/JS/CSI3007_adv_python/Datasets/data_miss.

data
[94]: array([[ 1. , 87. , 57.54435],

[ 2. , 8. , 7.31704],
[ 3. , 56. , 56.82095],
[ 4. , 63. , 64.15579],
[ 5. , 2. , 5.74522],
[ 6. , 45. , 19.56758],
[ 7. , 43. , 39.62271],
[ 8. , 47. , 34.95107],
[ 9. , 2. , nan],
[10. , 79. , 36.41022],
[11. , 67. , 49.83894],
[12. , 24. , inf],
[13. , 61. , 72.55357],
[14. , 85. , 39.24693],
[15. , 63. , 53.6279 ],
[16. , 2. , 16.72441],
[17. , 29. , nan],
[18. , 45. , 18.78498],
[19. , 33. , 19.8089 ],
[20. , 28. , 46.03384],
[21. , 21. , 23.7864 ],
[22. , 27. , 44.42627],
[23. , 65. , 34.94804],
[24. , 61. , 53.49576],
[25. , 10. , 25.98564]])
2.1 CSV File

Loading a csv file with column names.
By default, it takes the dtype as ‘float’. In such cases, the text fields will go missing.
[95]: data = np.genfromtxt('/home/jaisakthi/JS/CSI3007_adv_python/Datasets/
,→Mall_Customers.csv', delimiter=",")
data
17
[95]: array([[ nan, nan, nan, nan, nan],
[ 1., nan, 19., 15., 39.],
[ 2., nan, 21., 15., 81.],
…,
[198., nan, 32., 126., 74.],
[199., nan, 32., 137., 18.],
[200., nan, 30., 137., 83.]])
So, explicitly mention the datatype.

[96]: # Change dtype and skip header
data = np.genfromtxt('/home/jaisakthi/JS/CSI3007_adv_python/Datasets/
,→Mall_Customers.csv',
delimiter=",",
dtype='object',
skip_header=1)
data[:5, :]
[96]: array([[b'1', b'Male', b'19', b'15', b'39'],

[b'2', b'Male', b'21', b'15', b'81'],
[b'3', b'Female', b'20', b'16', b'6'],
[b'4', b'Female', b'23', b'16', b'77'],
[b'5', b'Female', b'31', b'17', b'40']], dtype=object)
The problem with this is, the numbers are identified as bytes and not as numbers. So doing math
is not easy.
[97]: # Divide 3rd col by 2nd col. ERROR!
data[:, 3] / data[:, 2]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_44644/244581620.py in <module>
1 # Divide 3rd col by 2nd col. ERROR!
----> 2 data[:, 3] / data[:, 2]
TypeError: unsupported operand type(s) for /: 'bytes' and 'bytes'
Convert to float and then divide. Works!

[98]: output = data[:, 3].astype('float') / data[:, 2].astype('float')
output[:10]
[98]: array([0.78947368, 0.71428571, 0.8 , 0.69565217, 0.5483871 ,

0.77272727, 0.51428571, 0.7826087 , 0.296875 , 0.63333333])
18
[99]: dt = np.dtype({'names': ["CustomerID", "Genre", "Age", "Annual_Income",␣
,→"Spending_Score"],
'formats': [np.int16, 'U16', np.int16, np.int16, np.int16]})
[100]: # Change dtype and skip header

,→Mall_Customers.csv',
delimiter=",",
dtype=dt,
skip_header=0)
data[:15]
[100]: array([(-1, 'Genre', -1, -1, -1), ( 1, 'Male', 19, 15, 39),
( 2, 'Male', 21, 15, 81), ( 3, 'Female', 20, 16, 6),
( 4, 'Female', 23, 16, 77), ( 5, 'Female', 31, 17, 40),
( 6, 'Female', 22, 17, 76), ( 7, 'Female', 35, 18, 6),
( 8, 'Female', 23, 18, 94), ( 9, 'Male', 64, 19, 3),
(10, 'Female', 30, 19, 72), (11, 'Male', 67, 19, 14),
(12, 'Female', 35, 19, 99), (13, 'Female', 58, 20, 15),
(14, 'Female', 24, 20, 77)],
dtype=[('CustomerID', '<i2'), ('Genre', '<U16'), ('Age', '<i2'),
('Annual_Income', '<i2'), ('Spending_Score', '<i2')])
[101]: data.shape
[101]: (201,)
[102]: data[0]['Age']
[102]: -1
[103]: data[0]['Genre']
[103]: 'Genre'
[104]: data['Age']
[104]: array([-1, 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35, 58, 24, 37, 22,
35, 20, 52, 35, 35, 25, 46, 31, 54, 29, 45, 35, 40, 23, 60, 21, 53,
18, 49, 21, 42, 30, 36, 20, 65, 24, 48, 31, 49, 24, 50, 27, 29, 31,
49, 33, 31, 59, 50, 47, 51, 69, 27, 53, 70, 19, 67, 54, 63, 18, 43,
68, 19, 32, 70, 47, 60, 60, 59, 26, 45, 40, 23, 49, 57, 38, 67, 46,
21, 48, 55, 22, 34, 50, 68, 18, 48, 40, 32, 24, 47, 27, 48, 20, 23,
49, 67, 26, 49, 21, 66, 54, 68, 66, 65, 19, 38, 19, 18, 19, 63, 49,
51, 50, 27, 38, 40, 39, 23, 31, 43, 40, 59, 38, 47, 39, 25, 31, 20,
29, 44, 32, 19, 35, 57, 32, 28, 32, 25, 28, 48, 32, 34, 34, 43, 39,
19
44, 38, 47, 27, 37, 30, 34, 30, 56, 29, 19, 31, 50, 36, 42, 33, 36,
32, 40, 28, 36, 36, 52, 30, 58, 27, 59, 35, 37, 32, 46, 29, 41, 30,
54, 28, 41, 36, 34, 32, 33, 38, 47, 35, 45, 32, 32, 30],
dtype=int16)
2.2 Export Data and Load it back

If it’s a single array, save it in .npy format. If you have multiple arrays to save in same file, use
.npz format.
[105]: # Store the arrays to disk
# Single array
np.save('/home/jaisakthi/JS/CSI3007_adv_python/Datasets/TEMP/output.npy',␣
,→output)
# Multiple arrays: arrays will be save with names "arr_0", "arr_1",..

np.savez('/home/jaisakthi/JS/CSI3007_adv_python/Datasets/TEMP/outputs.npz',␣
,→output, data)
Load it back
[106]: # Single array
a = np.load('/home/jaisakthi/JS/CSI3007_adv_python/Datasets/TEMP/output.npy')
a[:5]
[106]: array([0.78947368, 0.71428571, 0.8 , 0.69565217, 0.5483871 ])
Set allow_pickle=True for multidimensional arrays.

[107]: # Multiple arrays
b = np.load('/home/jaisakthi/JS/CSI3007_adv_python/Datasets/TEMP/outputs.npz',␣
,→allow_pickle=True)
[107]: <numpy.lib.npyio.NpzFile at 0x7fb64c9b9400>
See the arrays stored in it.

[108]: b.files
[108]: ['arr_0', 'arr_1']
[109]: b['arr_0'][:5]
[109]: array([0.78947368, 0.71428571, 0.8 , 0.69565217, 0.5483871 ])
[110]: b['arr_1']
20
[110]: array([( -1, 'Genre', -1, -1, -1), ( 1, 'Male', 19, 15, 39),
( 2, 'Male', 21, 15, 81), ( 3, 'Female', 20, 16, 6),
( 4, 'Female', 23, 16, 77), ( 5, 'Female', 31, 17, 40),
( 6, 'Female', 22, 17, 76), ( 7, 'Female', 35, 18, 6),
( 8, 'Female', 23, 18, 94), ( 9, 'Male', 64, 19, 3),
( 10, 'Female', 30, 19, 72), ( 11, 'Male', 67, 19, 14),
( 12, 'Female', 35, 19, 99), ( 13, 'Female', 58, 20, 15),
( 14, 'Female', 24, 20, 77), ( 15, 'Male', 37, 20, 13),
( 16, 'Male', 22, 20, 79), ( 17, 'Female', 35, 21, 35),
( 18, 'Male', 20, 21, 66), ( 19, 'Male', 52, 23, 29),
( 20, 'Female', 35, 23, 98), ( 21, 'Male', 35, 24, 35),
( 22, 'Male', 25, 24, 73), ( 23, 'Female', 46, 25, 5),
( 24, 'Male', 31, 25, 73), ( 25, 'Female', 54, 28, 14),
( 26, 'Male', 29, 28, 82), ( 27, 'Female', 45, 28, 32),
( 28, 'Male', 35, 28, 61), ( 29, 'Female', 40, 29, 31),
( 30, 'Female', 23, 29, 87), ( 31, 'Male', 60, 30, 4),
( 32, 'Female', 21, 30, 73), ( 33, 'Male', 53, 33, 4),
( 34, 'Male', 18, 33, 92), ( 35, 'Female', 49, 33, 14),
( 36, 'Female', 21, 33, 81), ( 37, 'Female', 42, 34, 17),
( 38, 'Female', 30, 34, 73), ( 39, 'Female', 36, 37, 26),
( 40, 'Female', 20, 37, 75), ( 41, 'Female', 65, 38, 35),
( 42, 'Male', 24, 38, 92), ( 43, 'Male', 48, 39, 36),
( 44, 'Female', 31, 39, 61), ( 45, 'Female', 49, 39, 28),
( 46, 'Female', 24, 39, 65), ( 47, 'Female', 50, 40, 55),
( 48, 'Female', 27, 40, 47), ( 49, 'Female', 29, 40, 42),
( 50, 'Female', 31, 40, 42), ( 51, 'Female', 49, 42, 52),
( 52, 'Male', 33, 42, 60), ( 53, 'Female', 31, 43, 54),
( 54, 'Male', 59, 43, 60), ( 55, 'Female', 50, 43, 45),
( 56, 'Male', 47, 43, 41), ( 57, 'Female', 51, 44, 50),
( 58, 'Male', 69, 44, 46), ( 59, 'Female', 27, 46, 51),
( 60, 'Male', 53, 46, 46), ( 61, 'Male', 70, 46, 56),
( 62, 'Male', 19, 46, 55), ( 63, 'Female', 67, 47, 52),
( 64, 'Female', 54, 47, 59), ( 65, 'Male', 63, 48, 51),
( 66, 'Male', 18, 48, 59), ( 67, 'Female', 43, 48, 50),
( 68, 'Female', 68, 48, 48), ( 69, 'Male', 19, 48, 59),
( 70, 'Female', 32, 48, 47), ( 71, 'Male', 70, 49, 55),
( 72, 'Female', 47, 49, 42), ( 73, 'Female', 60, 50, 49),
( 74, 'Female', 60, 50, 56), ( 75, 'Male', 59, 54, 47),
( 76, 'Male', 26, 54, 54), ( 77, 'Female', 45, 54, 53),
( 78, 'Male', 40, 54, 48), ( 79, 'Female', 23, 54, 52),
( 80, 'Female', 49, 54, 42), ( 81, 'Male', 57, 54, 51),
( 82, 'Male', 38, 54, 55), ( 83, 'Male', 67, 54, 41),
( 84, 'Female', 46, 54, 44), ( 85, 'Female', 21, 54, 57),
( 86, 'Male', 48, 54, 46), ( 87, 'Female', 55, 57, 58),
( 88, 'Female', 22, 57, 55), ( 89, 'Female', 34, 58, 60),
( 90, 'Female', 50, 58, 46), ( 91, 'Female', 68, 59, 55),
( 92, 'Male', 18, 59, 41), ( 93, 'Male', 48, 60, 49),
21
( 94, 'Female', 40, 60, 40), ( 95, 'Female', 32, 60, 42),
( 96, 'Male', 24, 60, 52), ( 97, 'Female', 47, 60, 47),
( 98, 'Female', 27, 60, 50), ( 99, 'Male', 48, 61, 42),
(100, 'Male', 20, 61, 49), (101, 'Female', 23, 62, 41),
(102, 'Female', 49, 62, 48), (103, 'Male', 67, 62, 59),
(104, 'Male', 26, 62, 55), (105, 'Male', 49, 62, 56),
(106, 'Female', 21, 62, 42), (107, 'Female', 66, 63, 50),
(108, 'Male', 54, 63, 46), (109, 'Male', 68, 63, 43),
(110, 'Male', 66, 63, 48), (111, 'Male', 65, 63, 52),
(112, 'Female', 19, 63, 54), (113, 'Female', 38, 64, 42),
(114, 'Male', 19, 64, 46), (115, 'Female', 18, 65, 48),
(116, 'Female', 19, 65, 50), (117, 'Female', 63, 65, 43),
(118, 'Female', 49, 65, 59), (119, 'Female', 51, 67, 43),
(120, 'Female', 50, 67, 57), (121, 'Male', 27, 67, 56),
(122, 'Female', 38, 67, 40), (123, 'Female', 40, 69, 58),
(124, 'Male', 39, 69, 91), (125, 'Female', 23, 70, 29),
(126, 'Female', 31, 70, 77), (127, 'Male', 43, 71, 35),
(128, 'Male', 40, 71, 95), (129, 'Male', 59, 71, 11),
(130, 'Male', 38, 71, 75), (131, 'Male', 47, 71, 9),
(132, 'Male', 39, 71, 75), (133, 'Female', 25, 72, 34),
(134, 'Female', 31, 72, 71), (135, 'Male', 20, 73, 5),
(136, 'Female', 29, 73, 88), (137, 'Female', 44, 73, 7),
(138, 'Male', 32, 73, 73), (139, 'Male', 19, 74, 10),
(140, 'Female', 35, 74, 72), (141, 'Female', 57, 75, 5),
(142, 'Male', 32, 75, 93), (143, 'Female', 28, 76, 40),
(144, 'Female', 32, 76, 87), (145, 'Male', 25, 77, 12),
(146, 'Male', 28, 77, 97), (147, 'Male', 48, 77, 36),
(148, 'Female', 32, 77, 74), (149, 'Female', 34, 78, 22),
(150, 'Male', 34, 78, 90), (151, 'Male', 43, 78, 17),
(152, 'Male', 39, 78, 88), (153, 'Female', 44, 78, 20),
(154, 'Female', 38, 78, 76), (155, 'Female', 47, 78, 16),
(156, 'Female', 27, 78, 89), (157, 'Male', 37, 78, 1),
(158, 'Female', 30, 78, 78), (159, 'Male', 34, 78, 1),
(160, 'Female', 30, 78, 73), (161, 'Female', 56, 79, 35),
(162, 'Female', 29, 79, 83), (163, 'Male', 19, 81, 5),
(164, 'Female', 31, 81, 93), (165, 'Male', 50, 85, 26),
(166, 'Female', 36, 85, 75), (167, 'Male', 42, 86, 20),
(168, 'Female', 33, 86, 95), (169, 'Female', 36, 87, 27),
(170, 'Male', 32, 87, 63), (171, 'Male', 40, 87, 13),
(172, 'Male', 28, 87, 75), (173, 'Male', 36, 87, 10),
(174, 'Male', 36, 87, 92), (175, 'Female', 52, 88, 13),
(176, 'Female', 30, 88, 86), (177, 'Male', 58, 88, 15),
(178, 'Male', 27, 88, 69), (179, 'Male', 59, 93, 14),
(180, 'Male', 35, 93, 90), (181, 'Female', 37, 97, 32),
(182, 'Female', 32, 97, 86), (183, 'Male', 46, 98, 15),
(184, 'Female', 29, 98, 88), (185, 'Female', 41, 99, 39),
(186, 'Male', 30, 99, 97), (187, 'Female', 54, 101, 24),
22
(188, 'Male', 28, 101, 68), (189, 'Female', 41, 103, 17),
(190, 'Female', 36, 103, 85), (191, 'Female', 34, 103, 23),
(192, 'Female', 32, 103, 69), (193, 'Male', 33, 113, 8),
(194, 'Female', 38, 113, 91), (195, 'Female', 47, 120, 16),
(196, 'Female', 35, 120, 79), (197, 'Female', 45, 126, 28),
(198, 'Male', 32, 126, 74), (199, 'Male', 32, 137, 18),
(200, 'Male', 30, 137, 83)],
dtype=[('CustomerID', '<i2'), ('Genre', '<U16'), ('Age', '<i2'),
('Annual_Income', '<i2'), ('Spending_Score', '<i2')])
[ ]:
Missing Data
In Python, anything that is missing is represented as None.
In NumPy, since we are dealing specifically with data we use a more data specific notation for
missing values: np.nan. Infinity is represented as np.inf
[112]: np.nan
[112]: nan
Care when doing comparison with missing values (np.nan)
[113]: np.nan == np.nan
[113]: False
[114]: np.nan in [np.nan]
[114]: True
[115]: np.nan is np.nan
[115]: True
2.3 Import Data with missing value

[122]: data = np.genfromtxt('/home/jaisakthi/JS/CSI3007_adv_python/Datasets/data_miss.
data
[122]: array([[ 1. , 87. , 57.54435],

[ 2. , 8. , 7.31704],
[ 3. , 56. , 56.82095],
23
[ 4. , 63. , 64.15579],
[ 5. , 2. , 5.74522],
[ 6. , 45. , 19.56758],
[ 7. , 43. , 39.62271],
[ 8. , 47. , 34.95107],
[ 9. , 2. , nan],
[10. , 79. , 36.41022],
[11. , 67. , 49.83894],
[12. , 24. , inf],
[13. , 61. , 72.55357],
[14. , 85. , 39.24693],
[15. , 63. , 53.6279 ],
[16. , 2. , 16.72441],
[17. , 29. , nan],
[18. , 45. , 18.78498],
[19. , 33. , 19.8089 ],
[20. , 28. , 46.03384],
[21. , 21. , 23.7864 ],
[22. , 27. , 44.42627],
[23. , 65. , 34.94804],
[24. , 61. , 53.49576],
[25. , 10. , 25.98564]])
2.4 Check for missing data in array

[123]: np.isnan(data)
[123]: array([[False, False, False],

[False, False, False],
[False, False, True],
24
[False, False, False]])
Check for infinity

[124]: np.isinf(data)

Missing or Infinity
[125]: np.isnan(data) | np.isinf(data)

25
Fill up missing or infinte with some value

[126]: data[np.isnan(data) | np.isinf(data)] = 0
[127]: data
[127]: array([[ 1. , 87. , 57.54435],

[ 2. , 8. , 7.31704],
[ 3. , 56. , 56.82095],
[ 4. , 63. , 64.15579],
[ 5. , 2. , 5.74522],
[ 6. , 45. , 19.56758],
[ 7. , 43. , 39.62271],
[ 8. , 47. , 34.95107],
[ 9. , 2. , 0. ],
[10. , 79. , 36.41022],
[11. , 67. , 49.83894],
[12. , 24. , 0. ],
[13. , 61. , 72.55357],
[14. , 85. , 39.24693],
[15. , 63. , 53.6279 ],
[16. , 2. , 16.72441],
[17. , 29. , 0. ],
[18. , 45. , 18.78498],
[19. , 33. , 19.8089 ],
[20. , 28. , 46.03384],
[21. , 21. , 23.7864 ],
[22. , 27. , 44.42627],
[23. , 65. , 34.94804],
26
[24. , 61. , 53.49576],
[25. , 10. , 25.98564]])
2.5 Extract Specific Items

2.6 Import Data from csv file
,→Mall_Customers_Int.csv',
delimiter=",",
skip_header=1)
[156]: # CustomerID, Genre, Age, Annual_Income, Spending_Score

data
[156]: array([[ 1., 1., 19., 15., 39.],

[ 2., 1., 21., 15., 81.],
[ 3., 0., 20., 16., 6.],
[ 4., 0., 23., 16., 77.],
[ 5., 0., 31., 17., 40.],
[ 6., 0., 22., 17., 76.],
[ 7., 0., 35., 18., 6.],
[ 8., 0., 23., 18., nan],
[ 9., 1., 64., 19., 3.],
[ 10., 0., 30., 19., 72.],
[ 11., 1., 67., 19., 14.],
[ 12., 0., 35., 19., 99.],
[ 13., 0., 58., nan, 15.],
[ 14., 0., 24., 20., 77.],
[ 15., 1., 37., 20., 13.],
[ 16., 1., 22., 20., 79.],
[ 17., 0., 35., 21., nan],
[ 18., 1., 20., 21., 66.],
[ 19., 1., nan, 23., 29.],
[ 20., 0., 35., 23., 98.],
[ 21., 1., 35., 24., 35.],
[ 22., 1., 25., 24., 73.],
[ 23., 0., 46., 25., 5.],
[ 24., 1., 31., 25., 73.],
[ 25., 0., 54., 28., 14.],
[ 26., 1., 29., 28., 82.],
[ 27., 0., 45., nan, 32.],
[ 28., 1., 35., 28., 61.],
[ 29., 0., 40., 29., 31.],
[ 30., 0., 23., 29., 87.],
[ 31., 1., 60., 30., 4.],
27
[ 32., 0., 21., 30., 73.],
[ 33., 1., 53., 33., 4.],
[ 34., 1., 18., 33., 92.],
[ 35., 0., 49., 33., 14.],
[ 36., 0., 21., 33., 81.],
[ 37., 0., nan, 34., 17.],
[ 38., 0., 30., 34., 73.],
[ 39., 0., 36., 37., 26.],
[ 40., 0., 20., 37., 75.],
[ 41., 0., 65., 38., 35.],
[ 42., 1., 24., 38., 92.],
[ 43., 1., 48., 39., 36.],
[ 44., 0., 31., 39., 61.],
[ 45., 0., 49., 39., 28.],
[ 46., 0., 24., 39., 65.],
[ 47., 0., 50., 40., 55.],
[ 48., 0., 27., 40., 47.],
[ 49., 0., 29., 40., 42.],
[ 50., 0., 31., 40., 42.],
[ 51., 0., 49., 42., 52.],
[ 52., 1., 33., 42., 60.],
[ 53., 0., 31., 43., 54.],
[ 54., 1., 59., 43., 60.],
[ 55., 0., 50., 43., 45.],
[ 56., 1., 47., 43., 41.],
[ 57., 0., 51., nan, 50.],
[ 58., 1., 69., 44., 46.],
[ 59., 0., 27., 46., 51.],
[ 60., 1., 53., 46., 46.],
[ 61., 1., 70., 46., 56.],
[ 62., 1., 19., 46., 55.],
[ 63., 0., 67., 47., 52.],
[ 64., 0., 54., 47., 59.],
[ 65., 1., 63., 48., 51.],
[ 66., 1., 18., 48., 59.],
[ 67., 0., 43., 48., 50.],
[ 68., 0., 68., nan, 48.],
[ 69., 1., 19., 48., 59.],
[ 70., 0., 32., 48., 47.],
[ 71., 1., 70., 49., 55.],
[ 72., 0., 47., 49., 42.],
[ 73., 0., 60., 50., 49.],
[ 74., 0., 60., 50., 56.],
[ 75., 1., 59., 54., 47.],
[ 76., 1., 26., 54., 54.],
[ 77., 0., 45., 54., 53.],
[ 78., 1., 40., 54., 48.],
28
[ 79., 0., 23., 54., 52.],
[ 80., 0., 49., 54., 42.],
[ 81., 1., 57., 54., 51.],
[ 82., 1., 38., 54., 55.],
[ 83., 1., 67., 54., 41.],
[ 84., 0., 46., 54., 44.],
[ 85., 0., 21., 54., 57.],
[ 86., 1., 48., 54., 46.],
[ 87., 0., 55., 57., 58.],
[ 88., 0., 22., 57., 55.],
[ 89., 0., 34., 58., 60.],
[ 90., 0., 50., 58., 46.],
[ 91., 0., 68., 59., 55.],
[ 92., 1., 18., 59., 41.],
[ 93., 1., 48., 60., 49.],
[ 94., 0., 40., 60., 40.],
[ 95., 0., 32., 60., 42.],
[ 96., 1., 24., 60., 52.],
[ 97., 0., 47., 60., 47.],
[ 98., 0., 27., 60., 50.],
[ 99., 1., 48., 61., 42.],
[100., 1., 20., 61., 49.],
[101., 0., 23., 62., 41.],
[102., 0., 49., 62., 48.],
[103., 1., 67., 62., 59.],
[104., 1., 26., 62., 55.],
[105., 1., 49., 62., 56.],
[106., 0., 21., 62., 42.],
[107., 0., 66., 63., 50.],
[108., 1., 54., 63., 46.],
[109., 1., 68., 63., 43.],
[110., 1., 66., 63., 48.],
[111., 1., 65., 63., 52.],
[112., 0., 19., 63., 54.],
[113., 0., 38., 64., 42.],
[114., 1., 19., 64., 46.],
[115., 0., 18., 65., 48.],
[116., 0., 19., 65., 50.],
[117., 0., 63., 65., 43.],
[118., 0., 49., 65., 59.],
[119., 0., 51., 67., 43.],
[120., 0., 50., 67., 57.],
[121., 1., 27., 67., 56.],
[122., 0., 38., 67., 40.],
[123., 0., 40., 69., 58.],
[124., 1., 39., 69., 91.],
[125., 0., 23., 70., 29.],
29
[126., 0., 31., 70., 77.],
[127., 1., 43., 71., 35.],
[128., 1., 40., 71., 95.],
[129., 1., 59., 71., 11.],
[130., 1., 38., 71., 75.],
[131., 1., 47., 71., 9.],
[132., 1., 39., 71., 75.],
[133., 0., 25., 72., 34.],
[134., 0., 31., 72., 71.],
[135., 1., 20., 73., 5.],
[136., 0., 29., 73., 88.],
[137., 0., 44., 73., 7.],
[138., 1., 32., 73., 73.],
[139., 1., 19., 74., 10.],
[140., 0., 35., 74., 72.],
[141., 0., 57., 75., 5.],
[142., 1., 32., 75., 93.],
[143., 0., 28., 76., 40.],
[144., 0., 32., 76., 87.],
[145., 1., 25., 77., 12.],
[146., 1., 28., 77., 97.],
[147., 1., 48., 77., 36.],
[148., 0., 32., 77., 74.],
[149., 0., 34., 78., 22.],
[150., 1., 34., 78., 90.],
[151., 1., 43., 78., 17.],
[152., 1., 39., 78., 88.],
[153., 0., 44., 78., 20.],
[154., 0., 38., 78., 76.],
[155., 0., 47., 78., 16.],
[156., 0., 27., 78., 89.],
[157., 1., 37., 78., 1.],
[158., 0., 30., 78., 78.],
[159., 1., 34., 78., 1.],
[160., 0., 30., 78., 73.],
[161., 0., 56., 79., 35.],
[162., 0., 29., 79., 83.],
[163., 1., 19., 81., 5.],
[164., 0., 31., 81., 93.],
[165., 1., 50., 85., 26.],
[166., 0., 36., 85., 75.],
[167., 1., 42., 86., 20.],
[168., 0., 33., 86., 95.],
[169., 0., 36., 87., 27.],
[170., 1., 32., 87., 63.],
[171., 1., 40., 87., 13.],
[172., 1., 28., 87., 75.],
30
[173., 1., 36., 87., 10.],
[174., 1., 36., 87., 92.],
[175., 0., 52., 88., 13.],
[176., 0., 30., 88., 86.],
[177., 1., 58., 88., 15.],
[178., 1., 27., 88., 69.],
[179., 1., 59., 93., 14.],
[180., 1., 35., 93., 90.],
[181., 0., 37., 97., 32.],
[182., 0., 32., 97., 86.],
[183., 1., 46., 98., 15.],
[184., 0., 29., 98., 88.],
[185., 0., 41., 99., 39.],
[186., 1., 30., 99., 97.],
[187., 0., 54., 101., 24.],
[188., 1., 28., 101., 68.],
[189., 0., 41., 103., 17.],
[190., 0., 36., 103., 85.],
[191., 0., 34., 103., 23.],
[192., 0., 32., 103., 69.],
[193., 1., 33., 113., 8.],
[194., 0., 38., 113., 91.],
[195., 0., 47., 120., 16.],
[196., 0., 35., 120., 79.],
[197., 0., 45., 126., 28.],
[198., 1., 32., 126., 74.],
[199., 1., 32., 137., 18.],
[200., 1., 30., 137., 83.]])
[157]: data.shape
[157]: (200, 5)
Filter rows where the second column = 1.

Create the mask and use that as the row / column argument.
[131]: mask = data[:, 1] == 1

mask[:10] # first 10
[131]: array([ True, True, False, False, False, False, False, False, True,
False])
[132]: data[mask, :]
[132]: array([[ 1., 1., 19., 15., 39.],

[ 2., 1., 21., 15., 81.],
[ 9., 1., 64., 19., 3.],
31
[ 11., 1., 67., 19., 14.],
[ 15., 1., 37., 20., 13.],
[ 16., 1., 22., 20., 79.],
[ 18., 1., 20., 21., 66.],
[ 19., 1., nan, 23., 29.],
[ 21., 1., 35., 24., 35.],
[ 22., 1., 25., 24., 73.],
[ 24., 1., 31., 25., 73.],
[ 26., 1., 29., 28., 82.],
[ 28., 1., 35., 28., 61.],
[ 31., 1., 60., 30., 4.],
[ 33., 1., 53., 33., 4.],
[ 34., 1., 18., 33., 92.],
[ 42., 1., 24., 38., 92.],
[ 43., 1., 48., 39., 36.],
[ 52., 1., 33., 42., 60.],
[ 54., 1., 59., 43., 60.],
[ 56., 1., 47., 43., 41.],
[ 58., 1., 69., 44., 46.],
[ 60., 1., 53., 46., 46.],
[ 61., 1., 70., 46., 56.],
[ 62., 1., 19., 46., 55.],
[ 65., 1., 63., 48., 51.],
[ 66., 1., 18., 48., 59.],
[ 69., 1., 19., 48., 59.],
[ 71., 1., 70., 49., 55.],
[ 75., 1., 59., 54., 47.],
[ 76., 1., 26., 54., 54.],
[ 78., 1., 40., 54., 48.],
[ 81., 1., 57., 54., 51.],
[ 82., 1., 38., 54., 55.],
[ 83., 1., 67., 54., 41.],
[ 86., 1., 48., 54., 46.],
[ 92., 1., 18., 59., 41.],
[ 93., 1., 48., 60., 49.],
[ 96., 1., 24., 60., 52.],
[ 99., 1., 48., 61., 42.],
[100., 1., 20., 61., 49.],
[103., 1., 67., 62., 59.],
[104., 1., 26., 62., 55.],
[105., 1., 49., 62., 56.],
[108., 1., 54., 63., 46.],
[109., 1., 68., 63., 43.],
[110., 1., 66., 63., 48.],
[111., 1., 65., 63., 52.],
[114., 1., 19., 64., 46.],
[121., 1., 27., 67., 56.],
32
[124., 1., 39., 69., 91.],
[127., 1., 43., 71., 35.],
[128., 1., 40., 71., 95.],
[129., 1., 59., 71., 11.],
[130., 1., 38., 71., 75.],
[131., 1., 47., 71., 9.],
[132., 1., 39., 71., 75.],
[135., 1., 20., 73., 5.],
[138., 1., 32., 73., 73.],
[139., 1., 19., 74., 10.],
[142., 1., 32., 75., 93.],
[145., 1., 25., 77., 12.],
[146., 1., 28., 77., 97.],
[147., 1., 48., 77., 36.],
[150., 1., 34., 78., 90.],
[151., 1., 43., 78., 17.],
[152., 1., 39., 78., 88.],
[157., 1., 37., 78., 1.],
[159., 1., 34., 78., 1.],
[163., 1., 19., 81., 5.],
[165., 1., 50., 85., 26.],
[167., 1., 42., 86., 20.],
[170., 1., 32., 87., 63.],
[171., 1., 40., 87., 13.],
[172., 1., 28., 87., 75.],
[173., 1., 36., 87., 10.],
[174., 1., 36., 87., 92.],
[177., 1., 58., 88., 15.],
[178., 1., 27., 88., 69.],
[179., 1., 59., 93., 14.],
[180., 1., 35., 93., 90.],
[183., 1., 46., 98., 15.],
[186., 1., 30., 99., 97.],
[188., 1., 28., 101., 68.],
[193., 1., 33., 113., 8.],
[198., 1., 32., 126., 74.],
[199., 1., 32., 137., 18.],
[200., 1., 30., 137., 83.]])
Rows that contain missing value in 4th column

[133]: #Extract all nan values
#This does jot work
data[:, 3] == np.nan
[133]: array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
33
False, False])
[134]: data[np.isnan(data[:, 3]), :]
[134]: array([[13., 0., 58., nan, 15.],

[27., 0., 45., nan, 32.],
[57., 0., 51., nan, 50.],
[68., 0., 68., nan, 48.]])
[135]: np.isnan(data[:, 2]) | np.isnan(data[:, 3]) | np.isnan(data[:, 4])
[135]: array([False, False, False, False, False, False, False, True, False,
False, False, False, True, False, False, False, True, False,
True, False, False, False, False, False, False, False, True,
True, False, False, False, False, False, False, False, False,
False, False, True, False, False, False, False, False, False,
False, False, False, False, True, False, False, False, False,
34
False, False])
Rows that contain missing in any of the columns

The any method will return True if any of the values is true. Setting axis=1 will do it row-wise.
axis=0 is column wise.
[136]: data[np.isnan(data).any(axis=1), :]
[136]: array([[ 8., 0., 23., 18., nan],

[13., 0., 58., nan, 15.],
[17., 0., 35., 21., nan],
[19., 1., nan, 23., 29.],
[27., 0., 45., nan, 32.],
[37., 0., nan, 34., 17.],
[57., 0., 51., nan, 50.],
[68., 0., 68., nan, 48.]])
[137]: np.isnan(data).any(axis=1)
[137]: array([False, False, False, False, False, False, False, True, False,
False, False, False, True, False, False, False, True, False,
True, False, False, False, False, False, False, False, True,
True, False, False, False, False, False, False, False, False,
False, False, True, False, False, False, False, False, False,
False, False, False, False, True, False, False, False, False,
35
False, False])
[138]: ~np.isnan(data).any(axis=1)
[138]: array([ True, True, True, True, True, True, True, False, True,
True, True, True, False, True, True, True, False, True,
False, True, True, True, True, True, True, True, False,
True, True, True, True, True, True, True, True, True,
False, True, True, True, True, True, True, True, True,
True, True, False, True, True, True, True, True, True,
True, True, True, True, False, True, True, True, True,
True, True])
Drop all rows that contain one or more missing

[139]: data[~np.isnan(data).any(axis=1), :]
[139]: array([[ 1., 1., 19., 15., 39.],

[ 2., 1., 21., 15., 81.],
[ 3., 0., 20., 16., 6.],
[ 4., 0., 23., 16., 77.],
[ 5., 0., 31., 17., 40.],
[ 6., 0., 22., 17., 76.],
[ 7., 0., 35., 18., 6.],
[ 9., 1., 64., 19., 3.],
[ 10., 0., 30., 19., 72.],
[ 11., 1., 67., 19., 14.],
[ 12., 0., 35., 19., 99.],
[ 14., 0., 24., 20., 77.],
[ 15., 1., 37., 20., 13.],
[ 16., 1., 22., 20., 79.],
[ 18., 1., 20., 21., 66.],
36
[ 20., 0., 35., 23., 98.],
[ 21., 1., 35., 24., 35.],
[ 22., 1., 25., 24., 73.],
[ 23., 0., 46., 25., 5.],
[ 24., 1., 31., 25., 73.],
[ 25., 0., 54., 28., 14.],
[ 26., 1., 29., 28., 82.],
[ 28., 1., 35., 28., 61.],
[ 29., 0., 40., 29., 31.],
[ 30., 0., 23., 29., 87.],
[ 31., 1., 60., 30., 4.],
[ 32., 0., 21., 30., 73.],
[ 33., 1., 53., 33., 4.],
[ 34., 1., 18., 33., 92.],
[ 35., 0., 49., 33., 14.],
[ 36., 0., 21., 33., 81.],
[ 38., 0., 30., 34., 73.],
[ 39., 0., 36., 37., 26.],
[ 40., 0., 20., 37., 75.],
[ 41., 0., 65., 38., 35.],
[ 42., 1., 24., 38., 92.],
[ 43., 1., 48., 39., 36.],
[ 44., 0., 31., 39., 61.],
[ 45., 0., 49., 39., 28.],
[ 46., 0., 24., 39., 65.],
[ 47., 0., 50., 40., 55.],
[ 48., 0., 27., 40., 47.],
[ 49., 0., 29., 40., 42.],
[ 50., 0., 31., 40., 42.],
[ 51., 0., 49., 42., 52.],
[ 52., 1., 33., 42., 60.],
[ 53., 0., 31., 43., 54.],
[ 54., 1., 59., 43., 60.],
[ 55., 0., 50., 43., 45.],
[ 56., 1., 47., 43., 41.],
[ 58., 1., 69., 44., 46.],
[ 59., 0., 27., 46., 51.],
[ 60., 1., 53., 46., 46.],
[ 61., 1., 70., 46., 56.],
[ 62., 1., 19., 46., 55.],
[ 63., 0., 67., 47., 52.],
[ 64., 0., 54., 47., 59.],
[ 65., 1., 63., 48., 51.],
[ 66., 1., 18., 48., 59.],
[ 67., 0., 43., 48., 50.],
[ 69., 1., 19., 48., 59.],
[ 70., 0., 32., 48., 47.],
37
[ 71., 1., 70., 49., 55.],
[ 72., 0., 47., 49., 42.],
[ 73., 0., 60., 50., 49.],
[ 74., 0., 60., 50., 56.],
[ 75., 1., 59., 54., 47.],
[ 76., 1., 26., 54., 54.],
[ 77., 0., 45., 54., 53.],
[ 78., 1., 40., 54., 48.],
[ 79., 0., 23., 54., 52.],
[ 80., 0., 49., 54., 42.],
[ 81., 1., 57., 54., 51.],
[ 82., 1., 38., 54., 55.],
[ 83., 1., 67., 54., 41.],
[ 84., 0., 46., 54., 44.],
[ 85., 0., 21., 54., 57.],
[ 86., 1., 48., 54., 46.],
[ 87., 0., 55., 57., 58.],
[ 88., 0., 22., 57., 55.],
[ 89., 0., 34., 58., 60.],
[ 90., 0., 50., 58., 46.],
[ 91., 0., 68., 59., 55.],
[ 92., 1., 18., 59., 41.],
[ 93., 1., 48., 60., 49.],
[ 94., 0., 40., 60., 40.],
[ 95., 0., 32., 60., 42.],
[ 96., 1., 24., 60., 52.],
[ 97., 0., 47., 60., 47.],
[ 98., 0., 27., 60., 50.],
[ 99., 1., 48., 61., 42.],
[100., 1., 20., 61., 49.],
[101., 0., 23., 62., 41.],
[102., 0., 49., 62., 48.],
[103., 1., 67., 62., 59.],
[104., 1., 26., 62., 55.],
[105., 1., 49., 62., 56.],
[106., 0., 21., 62., 42.],
[107., 0., 66., 63., 50.],
[108., 1., 54., 63., 46.],
[109., 1., 68., 63., 43.],
[110., 1., 66., 63., 48.],
[111., 1., 65., 63., 52.],
[112., 0., 19., 63., 54.],
[113., 0., 38., 64., 42.],
[114., 1., 19., 64., 46.],
[115., 0., 18., 65., 48.],
[116., 0., 19., 65., 50.],
[117., 0., 63., 65., 43.],
38
[118., 0., 49., 65., 59.],
[119., 0., 51., 67., 43.],
[120., 0., 50., 67., 57.],
[121., 1., 27., 67., 56.],
[122., 0., 38., 67., 40.],
[123., 0., 40., 69., 58.],
[124., 1., 39., 69., 91.],
[125., 0., 23., 70., 29.],
[126., 0., 31., 70., 77.],
[127., 1., 43., 71., 35.],
[128., 1., 40., 71., 95.],
[129., 1., 59., 71., 11.],
[130., 1., 38., 71., 75.],
[131., 1., 47., 71., 9.],
[132., 1., 39., 71., 75.],
[133., 0., 25., 72., 34.],
[134., 0., 31., 72., 71.],
[135., 1., 20., 73., 5.],
[136., 0., 29., 73., 88.],
[137., 0., 44., 73., 7.],
[138., 1., 32., 73., 73.],
[139., 1., 19., 74., 10.],
[140., 0., 35., 74., 72.],
[141., 0., 57., 75., 5.],
[142., 1., 32., 75., 93.],
[143., 0., 28., 76., 40.],
[144., 0., 32., 76., 87.],
[145., 1., 25., 77., 12.],
[146., 1., 28., 77., 97.],
[147., 1., 48., 77., 36.],
[148., 0., 32., 77., 74.],
[149., 0., 34., 78., 22.],
[150., 1., 34., 78., 90.],
[151., 1., 43., 78., 17.],
[152., 1., 39., 78., 88.],
[153., 0., 44., 78., 20.],
[154., 0., 38., 78., 76.],
[155., 0., 47., 78., 16.],
[156., 0., 27., 78., 89.],
[157., 1., 37., 78., 1.],
[158., 0., 30., 78., 78.],
[159., 1., 34., 78., 1.],
[160., 0., 30., 78., 73.],
[161., 0., 56., 79., 35.],
[162., 0., 29., 79., 83.],
[163., 1., 19., 81., 5.],
[164., 0., 31., 81., 93.],
39
[165., 1., 50., 85., 26.],
[166., 0., 36., 85., 75.],
[167., 1., 42., 86., 20.],
[168., 0., 33., 86., 95.],
[169., 0., 36., 87., 27.],
[170., 1., 32., 87., 63.],
[171., 1., 40., 87., 13.],
[172., 1., 28., 87., 75.],
[173., 1., 36., 87., 10.],
[174., 1., 36., 87., 92.],
[175., 0., 52., 88., 13.],
[176., 0., 30., 88., 86.],
[177., 1., 58., 88., 15.],
[178., 1., 27., 88., 69.],
[179., 1., 59., 93., 14.],
[180., 1., 35., 93., 90.],
[181., 0., 37., 97., 32.],
[182., 0., 32., 97., 86.],
[183., 1., 46., 98., 15.],
[184., 0., 29., 98., 88.],
[185., 0., 41., 99., 39.],
[186., 1., 30., 99., 97.],
[187., 0., 54., 101., 24.],
[188., 1., 28., 101., 68.],
[189., 0., 41., 103., 17.],
[190., 0., 36., 103., 85.],
[191., 0., 34., 103., 23.],
[192., 0., 32., 103., 69.],
[193., 1., 33., 113., 8.],
[194., 0., 38., 113., 91.],
[195., 0., 47., 120., 16.],
[196., 0., 35., 120., 79.],
[197., 0., 45., 126., 28.],
[198., 1., 32., 126., 74.],
[199., 1., 32., 137., 18.],
[200., 1., 30., 137., 83.]])
Get the maximum value in each row

axis=1 will do it row-wise.
[158]: data.max(axis=0)
[158]: array([200., 1., nan, nan, nan])
If a row contains missing values, it returns nan. So use np.nanmax instead to ignore missing.
[159]: np.nanmax(data, axis=0)
40
[159]: array([200., 1., 70., 137., 99.])
Equivalently, you have nanmin, nanmean, nanmedian, nanpercentile functions.

Get the maximum value in each column
[142]: data.max(axis=0)
[142]: array([200., 1., nan, nan, nan])
Ignore missing data with nanmax.

[143]: np.nanmax(data, axis=0)
[143]: array([200., 1., 70., 137., 99.])
Writing if-else logic using np.where()

If-Else logic using np.where().
Example logic: If the second column = 1, then, keep score (5th col) as is, else divide it by 2.
[144]: np.where(data[:, 1] == 1, data[:, 4], data[:, 4]/2)
[144]: array([39. , 81. , 3. , 38.5, 20. , 38. , 3. , nan, 3. , 36. , 14. ,

49.5, 7.5, 38.5, 13. , 79. , nan, 66. , 29. , 49. , 35. , 73. ,
2.5, 73. , 7. , 82. , 16. , 61. , 15.5, 43.5, 4. , 36.5, 4. ,
92. , 7. , 40.5, 8.5, 36.5, 13. , 37.5, 17.5, 92. , 36. , 30.5,
14. , 32.5, 27.5, 23.5, 21. , 21. , 26. , 60. , 27. , 60. , 22.5,
41. , 25. , 46. , 25.5, 46. , 56. , 55. , 26. , 29.5, 51. , 59. ,
25. , 24. , 59. , 23.5, 55. , 21. , 24.5, 28. , 47. , 54. , 26.5,
48. , 26. , 21. , 51. , 55. , 41. , 22. , 28.5, 46. , 29. , 27.5,
30. , 23. , 27.5, 41. , 49. , 20. , 21. , 52. , 23.5, 25. , 42. ,
49. , 20.5, 24. , 59. , 55. , 56. , 21. , 25. , 46. , 43. , 48. ,
52. , 27. , 21. , 46. , 24. , 25. , 21.5, 29.5, 21.5, 28.5, 56. ,
20. , 29. , 91. , 14.5, 38.5, 35. , 95. , 11. , 75. , 9. , 75. ,
17. , 35.5, 5. , 44. , 3.5, 73. , 10. , 36. , 2.5, 93. , 20. ,
43.5, 12. , 97. , 36. , 37. , 11. , 90. , 17. , 88. , 10. , 38. ,
8. , 44.5, 1. , 39. , 1. , 36.5, 17.5, 41.5, 5. , 46.5, 26. ,
37.5, 20. , 47.5, 13.5, 63. , 13. , 75. , 10. , 92. , 6.5, 43. ,
15. , 69. , 14. , 90. , 16. , 43. , 15. , 44. , 19.5, 97. , 12. ,
68. , 8.5, 42.5, 11.5, 34.5, 8. , 45.5, 8. , 39.5, 14. , 74. ,
18. , 83. ])
Get the position of the maximum value in each row

[145]: max_pos = np.argmax(data, axis=1)
max_pos
41
[145]: array([4, 4, 2, 4, 4, 4, 2, 4, 2, 4, 2, 4, 3, 4, 2, 4, 4, 4, 2, 4, 2, 4,
2, 4, 2, 4, 3, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 0, 4, 2, 4, 2, 4,
2, 4, 4, 0, 0, 0, 4, 4, 4, 4, 0, 0, 3, 2, 0, 0, 2, 0, 2, 0, 0, 0,
0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
[146]: data[:5]
[146]: array([[ 1., 1., 19., 15., 39.],

[ 2., 1., 21., 15., 81.],
[ 3., 0., 20., 16., 6.],
[ 4., 0., 23., 16., 77.],
[ 5., 0., 31., 17., 40.]])
[147]: # max values in each row.

[data[row, i] for row, i in enumerate(max_pos)]
[147]: [39.0,
81.0,
20.0,
77.0,
40.0,
76.0,
35.0,
nan,
64.0,
72.0,
67.0,
99.0,
nan,
77.0,
37.0,
79.0,
nan,
66.0,
nan,
98.0,
35.0,
73.0,
46.0,
73.0,
54.0,
42
82.0,
nan,
61.0,
40.0,
87.0,
60.0,
73.0,
53.0,
92.0,
49.0,
81.0,
nan,
73.0,
39.0,
75.0,
65.0,
92.0,
48.0,
61.0,
49.0,
65.0,
55.0,
48.0,
49.0,
50.0,
52.0,
60.0,
54.0,
60.0,
55.0,
56.0,
nan,
69.0,
59.0,
60.0,
70.0,
62.0,
67.0,
64.0,
65.0,
66.0,
67.0,
nan,
69.0,
70.0,
71.0,
72.0,
43
73.0,
74.0,
75.0,
76.0,
77.0,
78.0,
79.0,
80.0,
81.0,
82.0,
83.0,
84.0,
85.0,
86.0,
87.0,
88.0,
89.0,
90.0,
91.0,
92.0,
93.0,
94.0,
95.0,
96.0,
97.0,
98.0,
99.0,
100.0,
101.0,
102.0,
103.0,
104.0,
105.0,
106.0,
107.0,
108.0,
109.0,
110.0,
111.0,
112.0,
113.0,
114.0,
115.0,
116.0,
117.0,
118.0,
119.0,
44
120.0,
121.0,
122.0,
123.0,
124.0,
125.0,
126.0,
127.0,
128.0,
129.0,
130.0,
131.0,
132.0,
133.0,
134.0,
135.0,
136.0,
137.0,
138.0,
139.0,
140.0,
141.0,
142.0,
143.0,
144.0,
145.0,
146.0,
147.0,
148.0,
149.0,
150.0,
151.0,
152.0,
153.0,
154.0,
155.0,
156.0,
157.0,
158.0,
159.0,
160.0,
161.0,
162.0,
163.0,
164.0,
165.0,
166.0,
45
167.0,
168.0,
169.0,
170.0,
171.0,
172.0,
173.0,
174.0,
175.0,
176.0,
177.0,
178.0,
179.0,
180.0,
181.0,
182.0,
183.0,
184.0,
185.0,
186.0,
187.0,
188.0,
189.0,
190.0,
191.0,
192.0,
193.0,
194.0,
195.0,
196.0,
197.0,
198.0,
199.0,
200.0]
Get the position of values that satisfy a given condition

[148]: # Positions where a given condition is satisfied
pos = np.argwhere(data[:, 1] == 1)
pos[:5]
[148]: array([[ 0],

[ 1],
[ 8],
[10],
[14]])
46
[149]: data[:5, :]
[149]: array([[ 1., 1., 19., 15., 39.],

[ 2., 1., 21., 15., 81.],
[ 3., 0., 20., 16., 6.],
[ 4., 0., 23., 16., 77.],
[ 5., 0., 31., 17., 40.]])
2.7 Exercise
1. From Mall_Customers_Int.csv, find the row positions where 2nd column is 1 and 3rd column
has value < 21. Extract the values from these columns. How many such rows exist?
2. Create a new array from Mall_Customers_Int.csv that has the value 1 if 2nd column = 1
and 3rd column < 21. Otherwise, it has the value 0.
import numpy as np
data = np.genfromtxt('Datasets/Mall_Customers_Int.csv',
delimiter=",",
skip_header=1)
47

FALLSEM2023-24 CSI3007 ETH VL2023240104352 2023-09-27 Reference-Material-I

Uploaded by

Copyright:

Available Formats

FALLSEM2023-24 CSI3007 ETH VL2023240104352 2023-09-27 Reference-Material-I

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FALLSEM2023-24 CSI3007 ETH VL2023240104352 2023-09-27 Reference-Material-I

Uploaded by

Copyright:

Available Formats

numpy

February 28, 2022

1.1 What Problems Does NumPy Solve?

[22]: [2, 4, 6, 8, 10]

1.1.1 Why NumPy is Fast?

1.2 Creating NumPy Arrays

1.2.2 Create NumPy array from a list

[23]: import numpy as np

[24]: array([1, 2, 3, 4])

Vectorized Multiplication works

[27]: array([2, 4, 6, 8])

[29]: array([2, 3])

Negative indexing is supported

[30]: array([2, 3])

[31]: array([4, 3, 2, 1])

Arrays are homogenous

[32]: array(['1', '2', 'A', 'B'], dtype='<U21')

Check data type

[36]: array([[2, 3],

[37]: array([1, 2, 3])

[38]: array([2, 5, 8])

Convert back to list

[39]: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

1.4 Change DataType

[40]: array([['1', '2', '3'],

[41]: array([[1., 2., 3.],

[42]: array([[ 2, 4, 6],

[43]: array([[0.25, 0.5 , 0.75],

[44]: array([[0.75, 1.5 , 2.25],

[45]: array([[ 1.75, 3.5 , 5.25],

1.6 Create Zeros and Ones Arrays

[46]: array([[0, 0, 0],

[49]: array([[1, 1, 1],

[50]: array([[1., 1., 1.],

[51]: array([1, 5, 9])

1.7 Inspecting Arrays

[53]: array([[ 1, 2, 3, 4],

Shape of the array - Number of items in each dimension (rows, columns)

ndim - Number of dimensions

Size - Total number of items

Create in another dtype.

[58]: array([[ 1, 2, 3, 4],

1.8 Copy vs Reference

[61]: array([[ 1, 2, 3, 4],

[62]: array([[ 1, 2, 3, 4],

Change value in arr_c

[63]: array([[100, 2, 3, 4],

arr remains unaffected.

[64]: array([[ 1, 2, 3, 4],

[66]: array([[100, 2, 3, 4],

Check the id of objects

[68]: arr_r is arr

[69]: arr_c is arr

1.9 Why Datatype Matters?

[70]: array([1, 2, 3, 4])

[73]: iinfo(min=-2147483648, max=2147483647, dtype=int32)

[74]: array([1, 2, 3, 4], dtype=int8)

1.10 Supported Data Types

/tmp/ipykernel_44644/1652948259.py:1: DeprecationWarning: `np.int` is a