Data Science Handwritten Notes - 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

MODERN INSTITUTE OF TECHNOLOGY AND RESEARCH CENTRE, ALWAR

NAME OF FACULTY: DR. AWANIT KUMAR

SUB : Data Science Using Python 3ADS-07

SEMESTER : V SESSION : 2021-22 (ODD SEM.)

BRANCH : AI&DS BATCH : A

UNIT : III

Lecture Total
Topics Covered
No. Page
1 Arrays and Vectorized Computation 4

2 The NumPy ND array- Creating ND arrays 4

3 Data Types for ND arrays 3

4 Arithmetic with NumPy Arrays & Basic Indexing and Slicing 3

5 Boolean Indexing-Transposing Arrays and Swapping Axes 4

6 Universal Functions: Fast Element-Wise Array Functions 4

Mathematical and Statistical Methods-Sorting Unique and Other Set


7 4
Logic.

References:
1. Data Analysis with Python A Modern Approach, David Taieb, Packt Publishing, ISBN-
9781789950069
2. Python Data Analysis, Second Ed., Armando Fandango, Packt Publishing, ISBN:
9781787127487
UNIT- 3 LECTURE-1

Numpy Basic: Array & Vectorization

NumPy, short for Numerical Python, is one of the most important foundational packages
for numerical computing in Python.

Consider a NumPy array of one million integers, and the equivalent Python list:

In [7]: import numpy as np

In [8]: my_arr = np.arange(1000000)

In [9]: my_list = list(range(1000000))

A Multidimensional Array

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is
a fast, flexible container for large datasets in Python.

In [12]: import numpy as np

# Generate some random data


In [13]: data = np.random.randn(2, 3)

In [14]: data
Out[14]:
array([[-0.2047, 0.4789, -0.5194],
[-0.5557, 1.9658, 1.3934]])
I then write mathematical operations with data:

In [15]: data * 10
Out[15]:
array([[-2.0471, 4.7894, -5.1944],
[-5.5573, 19.6578, 13.9341]])

In [16]: data + data


Out[16]:
array([[-0.4094, 0.9579, -1.0389],
[-1.1115, 3.9316, 2.7868]])

An ndarray is a generic multidimensional container for homogeneous data; that is, all of
the elements must be the same type. Every array has a shape, a tuple indicating the size of
each dimension, and a dtype, an object describing the data type of the array:

In [17]: data.shape
Out[17]: (2, 3)

In [18]: data.dtype
Out[18]: dtype('float64')

Creating nd-arrays

The easiest way to create an array is to use the array function. This accepts any sequence-
like object (including other arrays) and produces a new NumPy array containing the
passed data.

In [19]: data1 = [6, 7.5, 8, 0, 1]


In [20]: arr1 = np.array(data1)

In [21]: arr1
Out[21]: array([6. , 7.5, 8. , 0. , 1. ])

Nested sequences, like a list of equal-length lists, will be converted into a


multidimensional array:

In [22]: data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]

In [23]: arr2 = np.array(data2)

In [24]: arr2
Out[24]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])

Since data2 was a list of lists, the NumPy array arr2 has two dimensions with shape
inferred from the data. We can confirm this by inspecting the ndim and shape attributes:

In [25]: arr2.ndim
Out[25]: 2

In [26]: arr2.shape
Out[26]: (2, 4)

In addition to np.array, there are a number of other functions for creating new arrays. As
examples, zeros and ones create arrays of 0s or 1s, respectively:

In [29]: np.zeros(10)
Out[29]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
In [30]: np.zeros((3, 6))
Out[30]:
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])

In [31]: np.empty((2, 3, 2))


Out[31]:
array([[[0., 0.],
[0., 0.],
[0., 0.]],
[[0., 0.],
[0., 0.],
[0., 0.]]])

arange is an array-valued version of the built-in Python range function:

In [32]: np.arange(15)
Out[32]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14])

Vectorization

Arrays are important because they enable you to express batch operations on data without
writing any for loops. NumPy users call this vectorization.

In [51]: arr = np.array([[1., 2., 3.], [4., 5., 6.]])

In [52]: arr
Out[52]:
array([[1., 2., 3.],
[4., 5., 6.]])

In [53]: arr * arr


Out[53]:
array([[ 1., 4., 9.],
[16., 25., 36.]])

In [54]: arr - arr


Out[54]:
array([[0., 0., 0.],
[0., 0., 0.]])

Arithmetic operations with scalars propagate the scalar argument to each element in the
array:

In [55]: 1 / arr
Out[55]:
array([[1. , 0.5 , 0.3333],
[0.25 , 0.2 , 0.1667]])

In [56]: arr ** 0.5


Out[56]:
array([[1. , 1.4142, 1.7321],
[2. , 2.2361, 2.4495]])

Comparisons between arrays of the same size yield boolean arrays:

In [57]: arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])

In [58]: arr2
Out[58]:
array([[ 0., 4., 1.],
[ 7., 2., 12.]])

In [59]: arr2 > arr


Out[59]:
array([[False, True, False],
[ True, False, True]])

Basic Indexing and Slicing

NumPy array indexing is a rich topic, as there are many ways you may want to select a
subset of your data or individual elements. One-dimensional arrays are simple; on the
surface they act similarly to Python lists:

In [60]: arr = np.arange(10)

In [61]: arr
Out[61]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [62]: arr[5]
Out[62]: 5

In [63]: arr[5:8]
Out[63]: array([5, 6, 7])

In [64]: arr[5:8] = 12

In [65]: arr
Out[65]: array([ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9])
UNIT- 3 LECTURE-2

Data Types in NumPy

NumPy has some extra data types, and refer to data types with one character, like i for
integers, u for unsigned integers etc. Below is a list of all data types in NumPy and the
characters used to represent them.

 i - integer
 b - boolean
 u - unsigned integer
 f - float
 c - complex float
 m - timedelta
 M - datetime
 O - object
 S - string
 U - unicode string
 V - fixed chunk of memory for other type ( void )

import numpy as np

arr = np.array([1, 2, 3, 4])

print(arr.dtype)

Create an array with data type string

import numpy as np

arr = np.array([1, 2, 3, 4], dtype='S')

print(arr)
print(arr.dtype)

For i, u, f, S and U we can define size as well.


import numpy as np

arr = np.array([1, 2, 3, 4], dtype='i4')

print(arr)
print(arr.dtype)

Change data type from float to integer by using 'i' as parameter value:

import numpy as np

arr = np.array([1.1, 2.1, 3.1])

newarr = arr.astype('i')

print(newarr)
print(newarr.dtype)
UNIT- 3 LECTURE-3

NumPy Arithmetic Operations

Arithmetic operations are possible only if the array has the same structure and
dimensions. We carry out the operations following the rules of array manipulation. We
have both functions and operators to perform these functions.

NumPy Add function


This function is used to add two arrays. If we add arrays having dissimilar shapes we get
“Value Error”.

import numpy as np
a = np.array([10,20,100,200,500])
b = np.array([3,4,5,6,7])
np.add(a, b)

Output

array([ 13, 24, 105, 206, 507])


NumPy Add Operator
We can also use the add operator “+” to perform addition of two arrays.

import numpy as np
a = np.array([10,20,100,200,500])
b = np.array([3,4,5,6,7])
print(a+b)
Output

[ 13 24 105 206 507]

NumPy Subtract function


We use this function to output the difference of two arrays. If we subtract two arrays
having dissimilar shapes we get “Value Error”.

import numpy as np
a = np.array([10,20,100,200,500])
b = np.array([3,4,5,6,7])
np.subtract(a, b)
Output
array([ 7, 16, 95, 194, 493])
NumPy Subtract Operator
We can also use the subtract operator “-” to produce the difference of two arrays.

import numpy as np
a = np.array([10,20,100,200,500])
b = np.array([3,4,5,6,7])
print(a-b)

Output

[ 7 16 95 194 493]

NumPy Multiply function


We use this function to output the multiplication of two arrays. We cannot work with
dissimilar arrays.

import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
np.multiply(a, b)

Output

array([21, 12, 20, 30, 7])


NumPy Multiply Operator
We can also use the multiplication operator “*” to get the product of two arrays.

import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
print(a*b
Output

[21 12 20 30 7]

NumPy Divide Function

We use this function to output the division of two arrays. We cannot divide dissimilar
arrays.
import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
np.divide(a,b)

Output

array([2.33333333, 0.75 , 0.8 , 0.83333333, 0.14285714])

NumPy Divide Operator

We can also use the divide operator “/” to divide two arrays.

import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
print(a/b)
Output

[2.33333333 0.75 0.8 0.83333333 0.14285714]

NumPy Mod and Remainder function

We use both the functions to output the remainder of the division of two arrays.

NumPy Remainder Function

import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
np.remainder(a,b)

Output

array([1, 3, 4, 5, 1])
NumPy Mod Function
import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
np.mod(a,b)

Output

array([1, 3, 4, 5, 1])

NumPy Power Function

This Function treats the first array as base and raises it to the power of the elements of the
second array.

import numpy as np
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
np.power(a,b)

Output

array([ 343, 81, 1024, 15625, 1])

NumPy Reciprocal Function

This Function returns the reciprocal of all the array elements.

import numpy as np
a = np.array([7,3,4,5,1])
np.reciprocal(a)

Output
array([0, 0, 0, 0, 1])
UNIT- 3 LECTURE-4

NumPy - Indexing & Slicing

Contents of ndarray object can be accessed and modified by indexing or slicing, just like
Python's in-built container objects.
As mentioned earlier, items in ndarray object follows zero-based index. Three types of
indexing methods are available − field access, basic slicing and advanced indexing.
Basic slicing is an extension of Python's basic concept of slicing to n dimensions. A
Python slice object is constructed by giving start, stop, and step parameters to the built-
in slice function. This slice object is passed to the array to extract a part of array.

Example 1

import numpy as np
a = np.arange(10)
s = slice(2,7,2)
print a[s]
Its output is as follows −

[2 4 6]
In the above example, an ndarray object is prepared by arange() function. Then a slice
object is defined with start, stop, and step values 2, 7, and 2 respectively. When this slice
object is passed to the ndarray, a part of it starting with index 2 up to 7 with a step of 2 is
sliced.

The same result can also be obtained by giving the slicing parameters separated by a
colon : (start:stop:step) directly to the ndarray object.

Example 2
import numpy as np
a = np.arange(10)
b = a[2:7:2]
print b
Here, we will get the same output −

[2 4 6]
If only one parameter is put, a single item corresponding to the index will be returned. If
a : is inserted in front of it, all items from that index onwards will be extracted. If two
parameters (with : between them) is used, items between the two indexes (not including
the stop index) with default step one are sliced.

Example 3
# slice single item
import numpy as np

a = np.arange(10)
b = a[5]
print b
Its output is as follows −

5
Example 4
# slice items starting from index
import numpy as np
a = np.arange(10)
print a[2:]
Now, the output would be −

[2 3 4 5 6 7 8 9]
Example 5
# slice items between indexes
import numpy as np
a = np.arange(10)
print a[2:5]
Here, the output would be −

[2 3 4]
The above description applies to multi-dimensional ndarray too.

Example 6
import numpy as np
a = np.array([[1,2,3],[3,4,5],[4,5,6]])
print a

# slice items starting from index


print 'Now we will slice the array from the index a[1:]'
print a[1:]
The output is as follows −

[[1 2 3]
[3 4 5]
[4 5 6]]

Now we will slice the array from the index a[1:]


[[3 4 5]
[4 5 6]]
Slicing can also include ellipsis (…) to make a selection tuple of the same length as the
dimension of an array. If ellipsis is used at the row position, it will return an ndarray
comprising of items in rows.

Example 7
# array to begin with
import numpy as np
a = np.array([[1,2,3],[3,4,5],[4,5,6]])

print 'Our array is:'


print a
print '\n'

# this returns array of items in the second column


print 'The items in the second column are:'
print a[...,1]
print '\n'

# Now we will slice all items from the second row


print 'The items in the second row are:'
print a[1,...]
print '\n'

# Now we will slice all items from column 1 onwards


print 'The items column 1 onwards are:'
print a[...,1:]
The output of this program is as follows −

Our array is:


[[1 2 3]
[3 4 5]
[4 5 6]]

The items in the second column are:


[2 4 5]

The items in the second row are:


[3 4 5]

The items column 1 onwards is:


[[2 3]
[4 5]
[5 6]]
UNIT- 3 LECTURE-5

Boolean Indexing

Let’s consider an example where we have some data in an array and an array of names
with duplicates. I’m going to use here the randn function in numpy.random to generate
some random normally distributed data:

In [98]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe',


'Joe'])

In [99]: data = np.random.randn(7, 4)

In [100]: names
Out[100]: array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'],
dtype='<U4')

In [101]: data
Out[101]:
array([[ 0.0929, 0.2817, 0.769 , 1.2464],
[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 1.669 , -0.4386, -0.5397, 0.477 ],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])

Suppose each name corresponds to a row in the data array and we wanted to select all the
rows with corresponding name 'Bob'. Like arithmetic operations, comparisons (such
as ==) with arrays are also vectorized. Thus, comparing names with the
string 'Bob' yields a boolean array:

In [102]: names == 'Bob'


Out[102]: array([ True, False, False, True, False, False, False])

This boolean array can be passed when indexing the array:

In [103]: data[names == 'Bob']


Out[103]:
array([[ 0.0929, 0.2817, 0.769 , 1.2464],
[ 1.669 , -0.4386, -0.5397, 0.477 ]])

The boolean array must be of the same length as the array axis it’s indexing. You can
even mix and match boolean arrays with slices or integers (or sequences of integers; more
on this later).

In these examples, I select from the rows where names == 'Bob' and index the columns,
too:

In [104]: data[names == 'Bob', 2:]


Out[104]:
array([[ 0.769 , 1.2464],
[-0.5397, 0.477 ]])

In [105]: data[names == 'Bob', 3]


Out[105]: array([1.2464, 0.477 ])

To select everything but 'Bob', you can either use != or negate the condition using ~:

In [106]: names != 'Bob'


Out[106]: array([False, True, True, False, True, True, True])

In [107]: data[~(names == 'Bob')]


Out[107]:
array([[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])
UNIT- 3 LECTURE-6

Universal Functions: Fast Element-Wise Array Functions

A universal function, or ufunc, is a function that performs element-wise operations on


data in ndarrays. You can think of them as fast vectorized wrappers for simple functions
that take one or more scalar values and produce one or more scalar results.

Many ufuncs are simple element-wise transformations, like sqrt or exp:

In [137]: arr = np.arange(10)

In [138]: arr
Out[138]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [139]: np.sqrt(arr)
Out[139]:
array([0. , 1. , 1.4142, 1.7321, 2. , 2.2361, 2.4495,
2.6458,
2.8284, 3. ])

In [140]: np.exp(arr)
Out[140]:
array([ 1. , 2.7183, 7.3891, 20.0855, 54.5982,
148.4132,
403.4288, 1096.6332, 2980.958 , 8103.0839])

These are referred to as unary ufuncs. Others, such as add or maximum, take two arrays
(thus, binary ufuncs) and return a single array as the result:

In [141]: x = np.random.randn(8)

In [142]: y = np.random.randn(8)
In [143]: x
Out[143]:
array([-0.0119, 1.0048, 1.3272, -0.9193, -1.5491, 0.0222, 0.7584,
-0.6605])

In [144]: y
Out[144]:
array([ 0.8626, -0.01 , 0.05 , 0.6702, 0.853 , -0.9559, -
0.0235,
-2.3042])

In [145]: np.maximum(x, y)
Out[145]:
array([ 0.8626, 1.0048, 1.3272, 0.6702, 0.853 , 0.0222, 0.7584,
-0.6605])

Here, numpy.maximum computed the element-wise maximum of the elements in x and y.


While not common, a ufunc can return multiple arrays. modf is one example, a vectorized
version of the built-in Python divmod; it returns the fractional and integral parts of a
floating-point array:

In [146]: arr = np.random.randn(7) * 5

In [147]: arr
Out[147]: array([-3.2623, -6.0915, -6.663 , 5.3731, 3.6182, 3.45
, 5.0077])

In [148]: remainder, whole_part = np.modf(arr)

In [149]: remainder
Out[149]: array([-0.2623, -0.0915, -0.663 , 0.3731, 0.6182, 0.45
, 0.0077])

In [150]: whole_part
Out[150]: array([-3., -6., -6., 5., 3., 3., 5.])

Ufuncs accept an optional out argument that allows them to operate in-place on arrays:

In [151]: arr
Out[151]: array([-3.2623, -6.0915, -6.663 , 5.3731, 3.6182, 3.45
, 5.0077])

In [152]: np.sqrt(arr)
Out[152]: array([ nan, nan, nan, 2.318 , 1.9022, 1.8574,
2.2378])

In [153]: np.sqrt(arr, arr)


Out[153]: array([ nan, nan, nan, 2.318 , 1.9022, 1.8574,
2.2378])

In [154]: arr
Out[154]: array([ nan, nan, nan, 2.318 , 1.9022, 1.8574,
2.2378])
UNIT- 3 LECTURE-7

Mathematical and Statistical Methods

A set of mathematical functions that compute statistics about an entire array or about the
data along an axis are accessible as methods of the array class. You can use aggregations
(often called reductions) like sum, mean, and std (standard deviation) either by calling
the array instance method or using the top-level NumPy function.

Here I generate some normally distributed random data and compute some aggregate
statistics:

In [177]: arr = np.random.randn(5, 4)

In [178]: arr
Out[178]:
array([[ 2.1695, -0.1149, 2.0037, 0.0296],
[ 0.7953, 0.1181, -0.7485, 0.585 ],
[ 0.1527, -1.5657, -0.5625, -0.0327],
[-0.929 , -0.4826, -0.0363, 1.0954],
[ 0.9809, -0.5895, 1.5817, -0.5287]])

In [179]: arr.mean()
Out[179]: 0.19607051119998253

In [180]: np.mean(arr)
Out[180]: 0.19607051119998253

In [181]: arr.sum()
Out[181]: 3.9214102239996507

Functions like mean and sum take an optional axis argument that computes the statistic
over the given axis, resulting in an array with one fewer dimension:
In [182]: arr.mean(axis=1)
Out[182]: array([ 1.022 , 0.1875, -0.502 , -0.0881, 0.3611])

In [183]: arr.sum(axis=0)
Out[183]: array([ 3.1693, -2.6345, 2.2381, 1.1486])

Here, arr.mean(1) means “compute mean across the columns” where arr.sum(0) means
“compute sum down the rows.”

Other methods like cumsum and cumprod do not aggregate, instead producing an array of
the intermediate results:

In [184]: arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])

In [185]: arr.cumsum()
Out[185]: array([ 0, 1, 3, 6, 10, 15, 21, 28])

In multidimensional arrays, accumulation functions like cumsum return an array of the


same size, but with the partial aggregates computed along the indicated axis according to
each lower dimensional slice:

In [186]: arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])

In [187]: arr
Out[187]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])

In [188]: arr.cumsum(axis=0)
Out[188]:
array([[ 0, 1, 2],
[ 3, 5, 7],
[ 9, 12, 15]])

In [189]: arr.cumprod(axis=1)
Out[189]:
array([[ 0, 0, 0],
[ 3, 12, 60],
[ 6, 42, 336]])

You might also like