Data Science Handwritten Notes - 3
Data Science Handwritten Notes - 3
Data Science Handwritten Notes - 3
UNIT : III
Lecture Total
Topics Covered
No. Page
1 Arrays and Vectorized Computation 4
References:
1. Data Analysis with Python A Modern Approach, David Taieb, Packt Publishing, ISBN-
9781789950069
2. Python Data Analysis, Second Ed., Armando Fandango, Packt Publishing, ISBN:
9781787127487
UNIT- 3 LECTURE-1
NumPy, short for Numerical Python, is one of the most important foundational packages
for numerical computing in Python.
Consider a NumPy array of one million integers, and the equivalent Python list:
A Multidimensional Array
One of the key features of NumPy is its N-dimensional array object, or ndarray, which is
a fast, flexible container for large datasets in Python.
In [14]: data
Out[14]:
array([[-0.2047, 0.4789, -0.5194],
[-0.5557, 1.9658, 1.3934]])
I then write mathematical operations with data:
In [15]: data * 10
Out[15]:
array([[-2.0471, 4.7894, -5.1944],
[-5.5573, 19.6578, 13.9341]])
An ndarray is a generic multidimensional container for homogeneous data; that is, all of
the elements must be the same type. Every array has a shape, a tuple indicating the size of
each dimension, and a dtype, an object describing the data type of the array:
In [17]: data.shape
Out[17]: (2, 3)
In [18]: data.dtype
Out[18]: dtype('float64')
Creating nd-arrays
The easiest way to create an array is to use the array function. This accepts any sequence-
like object (including other arrays) and produces a new NumPy array containing the
passed data.
In [21]: arr1
Out[21]: array([6. , 7.5, 8. , 0. , 1. ])
In [24]: arr2
Out[24]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
Since data2 was a list of lists, the NumPy array arr2 has two dimensions with shape
inferred from the data. We can confirm this by inspecting the ndim and shape attributes:
In [25]: arr2.ndim
Out[25]: 2
In [26]: arr2.shape
Out[26]: (2, 4)
In addition to np.array, there are a number of other functions for creating new arrays. As
examples, zeros and ones create arrays of 0s or 1s, respectively:
In [29]: np.zeros(10)
Out[29]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
In [30]: np.zeros((3, 6))
Out[30]:
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
In [32]: np.arange(15)
Out[32]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14])
Vectorization
Arrays are important because they enable you to express batch operations on data without
writing any for loops. NumPy users call this vectorization.
In [52]: arr
Out[52]:
array([[1., 2., 3.],
[4., 5., 6.]])
Arithmetic operations with scalars propagate the scalar argument to each element in the
array:
In [55]: 1 / arr
Out[55]:
array([[1. , 0.5 , 0.3333],
[0.25 , 0.2 , 0.1667]])
In [58]: arr2
Out[58]:
array([[ 0., 4., 1.],
[ 7., 2., 12.]])
NumPy array indexing is a rich topic, as there are many ways you may want to select a
subset of your data or individual elements. One-dimensional arrays are simple; on the
surface they act similarly to Python lists:
In [61]: arr
Out[61]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [62]: arr[5]
Out[62]: 5
In [63]: arr[5:8]
Out[63]: array([5, 6, 7])
In [64]: arr[5:8] = 12
In [65]: arr
Out[65]: array([ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9])
UNIT- 3 LECTURE-2
NumPy has some extra data types, and refer to data types with one character, like i for
integers, u for unsigned integers etc. Below is a list of all data types in NumPy and the
characters used to represent them.
i - integer
b - boolean
u - unsigned integer
f - float
c - complex float
m - timedelta
M - datetime
O - object
S - string
U - unicode string
V - fixed chunk of memory for other type ( void )
import numpy as np
print(arr.dtype)
import numpy as np
print(arr)
print(arr.dtype)
print(arr)
print(arr.dtype)
Change data type from float to integer by using 'i' as parameter value:
import numpy as np
newarr = arr.astype('i')
print(newarr)
print(newarr.dtype)
UNIT- 3 LECTURE-3
Arithmetic operations are possible only if the array has the same structure and
dimensions. We carry out the operations following the rules of array manipulation. We
have both functions and operators to perform these functions.
import numpy as np
a = np.array([10,20,100,200,500])
b = np.array([3,4,5,6,7])
np.add(a, b)
Output
import numpy as np
a = np.array([10,20,100,200,500])
b = np.array([3,4,5,6,7])
print(a+b)
Output
import numpy as np
a = np.array([10,20,100,200,500])
b = np.array([3,4,5,6,7])
np.subtract(a, b)
Output
array([ 7, 16, 95, 194, 493])
NumPy Subtract Operator
We can also use the subtract operator “-” to produce the difference of two arrays.
import numpy as np
a = np.array([10,20,100,200,500])
b = np.array([3,4,5,6,7])
print(a-b)
Output
[ 7 16 95 194 493]
import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
np.multiply(a, b)
Output
import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
print(a*b
Output
[21 12 20 30 7]
We use this function to output the division of two arrays. We cannot divide dissimilar
arrays.
import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
np.divide(a,b)
Output
We can also use the divide operator “/” to divide two arrays.
import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
print(a/b)
Output
We use both the functions to output the remainder of the division of two arrays.
import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
np.remainder(a,b)
Output
array([1, 3, 4, 5, 1])
NumPy Mod Function
import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
np.mod(a,b)
Output
array([1, 3, 4, 5, 1])
This Function treats the first array as base and raises it to the power of the elements of the
second array.
import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
np.power(a,b)
Output
import numpy as np
a = np.array([7,3,4,5,1])
np.reciprocal(a)
Output
array([0, 0, 0, 0, 1])
UNIT- 3 LECTURE-4
Contents of ndarray object can be accessed and modified by indexing or slicing, just like
Python's in-built container objects.
As mentioned earlier, items in ndarray object follows zero-based index. Three types of
indexing methods are available − field access, basic slicing and advanced indexing.
Basic slicing is an extension of Python's basic concept of slicing to n dimensions. A
Python slice object is constructed by giving start, stop, and step parameters to the built-
in slice function. This slice object is passed to the array to extract a part of array.
Example 1
import numpy as np
a = np.arange(10)
s = slice(2,7,2)
print a[s]
Its output is as follows −
[2 4 6]
In the above example, an ndarray object is prepared by arange() function. Then a slice
object is defined with start, stop, and step values 2, 7, and 2 respectively. When this slice
object is passed to the ndarray, a part of it starting with index 2 up to 7 with a step of 2 is
sliced.
The same result can also be obtained by giving the slicing parameters separated by a
colon : (start:stop:step) directly to the ndarray object.
Example 2
import numpy as np
a = np.arange(10)
b = a[2:7:2]
print b
Here, we will get the same output −
[2 4 6]
If only one parameter is put, a single item corresponding to the index will be returned. If
a : is inserted in front of it, all items from that index onwards will be extracted. If two
parameters (with : between them) is used, items between the two indexes (not including
the stop index) with default step one are sliced.
Example 3
# slice single item
import numpy as np
a = np.arange(10)
b = a[5]
print b
Its output is as follows −
5
Example 4
# slice items starting from index
import numpy as np
a = np.arange(10)
print a[2:]
Now, the output would be −
[2 3 4 5 6 7 8 9]
Example 5
# slice items between indexes
import numpy as np
a = np.arange(10)
print a[2:5]
Here, the output would be −
[2 3 4]
The above description applies to multi-dimensional ndarray too.
Example 6
import numpy as np
a = np.array([[1,2,3],[3,4,5],[4,5,6]])
print a
[[1 2 3]
[3 4 5]
[4 5 6]]
Example 7
# array to begin with
import numpy as np
a = np.array([[1,2,3],[3,4,5],[4,5,6]])
Boolean Indexing
Let’s consider an example where we have some data in an array and an array of names
with duplicates. I’m going to use here the randn function in numpy.random to generate
some random normally distributed data:
In [100]: names
Out[100]: array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'],
dtype='<U4')
In [101]: data
Out[101]:
array([[ 0.0929, 0.2817, 0.769 , 1.2464],
[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 1.669 , -0.4386, -0.5397, 0.477 ],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])
Suppose each name corresponds to a row in the data array and we wanted to select all the
rows with corresponding name 'Bob'. Like arithmetic operations, comparisons (such
as ==) with arrays are also vectorized. Thus, comparing names with the
string 'Bob' yields a boolean array:
The boolean array must be of the same length as the array axis it’s indexing. You can
even mix and match boolean arrays with slices or integers (or sequences of integers; more
on this later).
In these examples, I select from the rows where names == 'Bob' and index the columns,
too:
To select everything but 'Bob', you can either use != or negate the condition using ~:
In [138]: arr
Out[138]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [139]: np.sqrt(arr)
Out[139]:
array([0. , 1. , 1.4142, 1.7321, 2. , 2.2361, 2.4495,
2.6458,
2.8284, 3. ])
In [140]: np.exp(arr)
Out[140]:
array([ 1. , 2.7183, 7.3891, 20.0855, 54.5982,
148.4132,
403.4288, 1096.6332, 2980.958 , 8103.0839])
These are referred to as unary ufuncs. Others, such as add or maximum, take two arrays
(thus, binary ufuncs) and return a single array as the result:
In [141]: x = np.random.randn(8)
In [142]: y = np.random.randn(8)
In [143]: x
Out[143]:
array([-0.0119, 1.0048, 1.3272, -0.9193, -1.5491, 0.0222, 0.7584,
-0.6605])
In [144]: y
Out[144]:
array([ 0.8626, -0.01 , 0.05 , 0.6702, 0.853 , -0.9559, -
0.0235,
-2.3042])
In [145]: np.maximum(x, y)
Out[145]:
array([ 0.8626, 1.0048, 1.3272, 0.6702, 0.853 , 0.0222, 0.7584,
-0.6605])
In [147]: arr
Out[147]: array([-3.2623, -6.0915, -6.663 , 5.3731, 3.6182, 3.45
, 5.0077])
In [149]: remainder
Out[149]: array([-0.2623, -0.0915, -0.663 , 0.3731, 0.6182, 0.45
, 0.0077])
In [150]: whole_part
Out[150]: array([-3., -6., -6., 5., 3., 3., 5.])
Ufuncs accept an optional out argument that allows them to operate in-place on arrays:
In [151]: arr
Out[151]: array([-3.2623, -6.0915, -6.663 , 5.3731, 3.6182, 3.45
, 5.0077])
In [152]: np.sqrt(arr)
Out[152]: array([ nan, nan, nan, 2.318 , 1.9022, 1.8574,
2.2378])
In [154]: arr
Out[154]: array([ nan, nan, nan, 2.318 , 1.9022, 1.8574,
2.2378])
UNIT- 3 LECTURE-7
A set of mathematical functions that compute statistics about an entire array or about the
data along an axis are accessible as methods of the array class. You can use aggregations
(often called reductions) like sum, mean, and std (standard deviation) either by calling
the array instance method or using the top-level NumPy function.
Here I generate some normally distributed random data and compute some aggregate
statistics:
In [178]: arr
Out[178]:
array([[ 2.1695, -0.1149, 2.0037, 0.0296],
[ 0.7953, 0.1181, -0.7485, 0.585 ],
[ 0.1527, -1.5657, -0.5625, -0.0327],
[-0.929 , -0.4826, -0.0363, 1.0954],
[ 0.9809, -0.5895, 1.5817, -0.5287]])
In [179]: arr.mean()
Out[179]: 0.19607051119998253
In [180]: np.mean(arr)
Out[180]: 0.19607051119998253
In [181]: arr.sum()
Out[181]: 3.9214102239996507
Functions like mean and sum take an optional axis argument that computes the statistic
over the given axis, resulting in an array with one fewer dimension:
In [182]: arr.mean(axis=1)
Out[182]: array([ 1.022 , 0.1875, -0.502 , -0.0881, 0.3611])
In [183]: arr.sum(axis=0)
Out[183]: array([ 3.1693, -2.6345, 2.2381, 1.1486])
Here, arr.mean(1) means “compute mean across the columns” where arr.sum(0) means
“compute sum down the rows.”
Other methods like cumsum and cumprod do not aggregate, instead producing an array of
the intermediate results:
In [185]: arr.cumsum()
Out[185]: array([ 0, 1, 3, 6, 10, 15, 21, 28])
In [187]: arr
Out[187]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [188]: arr.cumsum(axis=0)
Out[188]:
array([[ 0, 1, 2],
[ 3, 5, 7],
[ 9, 12, 15]])
In [189]: arr.cumprod(axis=1)
Out[189]:
array([[ 0, 0, 0],
[ 3, 12, 60],
[ 6, 42, 336]])