Unit-03: Capturing, Preparing and Working With Data

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 41

Python for Data Science (PDS) (3150713)

Unit-03
Capturing, Preparing
and Working with data
Topics
Looping

Basic File IO in Python


NumPy V/S Pandas (what to use?)
NumPy
Pandas
Accessing text, CSV, Excel files using pandas
Accessing SQL Database
Web Scrapping using BeautifulSoup
Python for Data Science (PDS) (3150713)

Unit-03.02
Lets Learn
Pandas
Pandas
 Pandas is an open source library built on top of NumPy.
 It allows for fast data cleaning, preparation and analysis.
 It excels in performance and productivity.
 It also has built-in visualization features.
 It can work with the data from wide variety of sources.
 Install :
 conda install pandas
OR  pip install pandas

4
Topics
Looping

Series
Data Frames
Accessing text, CSV, Excel files using pandas
Accessing SQL Database
Missing Data
Group By
Merging, Joining & Concatenating
Operations
Series
 Series is an one-dimensional* array with axis labels.
 It supports both integer and label-based index but index must be of hashable type.
 If we do not specify index it will assign integer zero-based index.

syntax Parameters
import pandas as pd data = array like Iterable
s = pd.Series(data,index,dtype,copy=False) index = array like index
dtype = data-type
copy = bool, default is False

pandasSeries.py Output
1 import pandas as pd 0 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) 1 3
3 print(s) 2 5
3 7
4 9
5 11
dtype: int64
6
Series (Cont.)
 We can then access the elements inside Series just like array using square brackets notation.
pdSeriesEle.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) Sum = 4
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)

 We can specify the data type of Series using dtype parameter


pdSeriesdtype.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11], dtype='str') Sum = 13
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)

7
Series (Cont.)
 We can specify index to Series with the help of index parameter
pdSeriesdtype.py Output
1 import numpy as np name dahod
2 import pandas as pd address dh
3 i = ['name','address','phone','email','website'] phone 123
4 d = [dahod',‘dh',123','[email protected]',‘gecdahod.ac.in'] email [email protected]
5 s = pd.Series(data=d,index=i) website gecdahod.ac.in
6 print(s) dtype: object

8
Creating Time Series
 We can use some of pandas inbuilt date functions to create a time series.
pdSeriesEle.py Output
1 import numpy as np 2020-07-27 50
2 import pandas as pd 2020-07-28 53
3 dates = pd.to_datetime("27th of July, 2020") 2020-07-29 25
4 i = dates + pd.to_timedelta(np.arange(5), 2020-07-30 70
unit='D') 2020-07-31 60
5 d = [50,53,25,70,60] dtype: int64
6 time_series = pd.Series(data=d,index=i)
7 print(time_series)

9
Data Frames
 Data frames are two dimensional data structure, i.e. data is aligned in a tabular format in rows
and columns.
 Data frame also contains labelled axes on rows and columns.
 Features of Data Frame :
 It is size-mutable
 Has labelled axes
 Columns can be of different data types
 We can perform arithmetic operations on rows and columns.
 Structure :
PDS Algo SE INS
101
102
103
….
160
10
Data Frames (Cont.)
 Syntax :
syntax Parameters
import pandas as pd data = array like Iterable
df = pd.DataFrame(data,index,columns,dtype,copy=False) index = array like row index
columns = array like col index
dtype = data-type
copy = bool, default is False
 Example :
pdDataFrame.py Output
1 import numpy as np PDS Algo SE INS
2 import pandas as pd 101 0 23 93 46
3 randArr = np.random.randint(0,100,20).reshape(5,4) 102 85 47 31 12
4 df = pd.DataFrame(randArr,np.arange(101,106,1), 103 35 34 6 89
['PDS','Algo','SE','INS']) 104 66 83 70 50
print(df) 105 65 88 87 87
5

11
Data Frames (Cont.)
 Grabbing the column
dfGrabCol.py Output
1 import numpy as np 101 0
2 import pandas as pd 102 85
3 randArr = np.random.randint(0,100,20).reshape(5,4) 103 35
4 df = pd.DataFrame(randArr,np.arange(101,106,1), 104 66
['PDS','Algo','SE','INS']) 105 65
print(df['PDS']) Name: PDS, dtype: int32
5
 Grabbing the multiple column
Output
dfGrabMulCol.py
PDS SE
1 print(df['PDS', 'SE']) 101 0 93
102 85 31
103 35 6
104 66 70
105 65 87
12
Data Frames (Cont.)
 Grabbing a row
dfGrabRow.py Output

1 print(df.loc[101]) # using labels PDS 0


2 #OR Algo 23
3 print(df.iloc[0]) # using zero based index SE 93
INS 46
Name: 101, dtype: int32
 Grabbing Single Value
dfGrabSingle.py Output
1 print(df.loc[101, 'PDS']) # using labels 0

 Deleting Row Output

dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('103',inplace=True) 102 85 47 31 12
2 print(df) 104 66 83 70 50
105 65 88 87 87
13
Data Frames (Cont.)
 Creating new column Output

dfCreateCol.py
PDS Algo SE INS total
101 0 23 93 46 162
1 df['total'] = df['PDS'] + df['Algo'] + 102 85 47 31 12 175
df['SE'] + df['INS'] 103 35 34 6 89 164
2 print(df) 104 66 83 70 50 269
105 65 88 87 87 327
 Deleting Column and Row Output

dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('total',axis=1,inplace=True) 102 85 47 31 12
2 print(df) 103 35 34 6 89
104 66 83 70 50
105 65 88 87 87

14
Data Frames (Cont.)
 Getting Subset of Data Frame
dfGrabSubSet.py Output
1 print(df.loc[[101,104], ['PDS','INS']]) PDS INS
101 0 46
104 66 50

 Selecting all cols except one Output

dfGrabExcept.py PDS SE INS


101 0 93 46
1 print(df.loc[:, df.columns != 'Algo' ])
102 85 31 12
103 35 6 89
104 66 70 50
105 65 87 87

15
Conditional Selection
 Similar to NumPy we can do conditional selection in pandas.
dfCondSel.py Output
1 import numpy as np PDS Algo SE INS
2 import pandas as pd 101 66 85 8 95
3 np.random.seed(121) 102 65 52 83 96
4 randArr = 103 46 34 52 60
np.random.randint(0,100,20).reshape(5,4) 104 54 3 94 52
5 df = 105 57 75 88 39
pd.DataFrame(randArr,np.arange(101,106,1) PDS Algo SE INS
,['PDS','Algo','SE','INS']) 101 True True False True
6 print(df) 102 True True True True
7 print(df>50) 103 False False True True
104 True False True True
105 True True True False
 Note : we have used np.random.seed() method and set seed to be 121, so that when you
generate random number it matches with the random number I have generated.

16
Conditional Selection (Cont.)
 We can then use this boolean DataFrame to get associated values.
dfCondSel.py Output
1 dfBool = df > 50 PDS Algo SE INS
2 print(df[dfBool]) 101 66 85 NaN 95
102 65 52 83 96
 Note : It will set NaN (Not a Number) in case of False 103 NaN NaN 52 60
104 54 NaN 94 52
105 57 75 88 NaN

 We can apply condition on specific column.


dfCondSel.py Output
1 dfBool = df['PDS'] > 50 PDS Algo SE INS
2 print(df[dfBool]) 101 66 85 8 95
102 65 52 83 96
104 54 3 94 52
105 57 75 88 39
17
Setting/Resetting index
 In our previous example we have seen our index does not have name, if we want to specify name
to our index we can specify it using DataFrame.index.name property.
dfCondSel.py Output
1 df.index.name('RollNo') PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
Note: We have name to our
103 46 34 52 60
index now
104 54 3 94 52
105 57 75 88 39

 We can use pandas built-in methods to set or reset the index


 pd.set_index('NewColumn',inplace=True), will set new column as index,
 pd.reset_index(), will reset index to zero based numberic index.

18
Setting/Resetting index (Cont.)
 set_index(new_index)
dfCondSel.py Output
1 df.set_index('PDS') #inplace=True Algo SE INS
PDS
66 85 8 95
65 52 83 96
Note: We have PDS as our
46 34 52 60
index now
54 3 94 52
 reset_index() 57 75 88 39
dfCondSel.py Output
1 df.reset_index() RollNo PDS Algo SE INS
0 101 66 85 8 95
Note: Our RollNo(index) 1 102 65 52 83 96
become new column, and 2 103 46 34 52 60
we now have zero based 3 104 54 3 94 52
numeric index 4 105 57 75 88 39

19
Multi-Index DataFrame
 Hierarchical indexes (AKA multiindexes) help us to organize, find, and aggregate information
faster at almost no cost. 
 Example where we need Hierarchical indexes
Numeric Index/Single Index Multi Index
Col Dep Sem RN S1 S2 S3 RN S1 S2 S3
0 ABC CE 5 101 50 60 70 Col Dep Sem
1 ABC CE 5 102 48 70 25 ABC CE 5 101 50 60 70
2 ABC CE 7 101 58 59 51 5 102 48 70 25
3 ABC ME 5 101 30 35 39 7 101 58 59 51
4 ABC ME 5 102 50 90 48 ME 5 101 30 35 39
5 Dahod CE 5 101 88 99 77 5 102 50 90 48
6 Dahod CE 5 102 99 84 76 Dahod CE 5 101 88 99 77
7 Dahod CE 7 101 88 77 99 5 102 99 84 76
8 Dahod ME 5 101 44 88 99 7 101 88 77 99
ME 5 101 44 88 99

20
Multi-Index DataFrame (Cont.)
 Creating multiindexes is as simple as creating single index using set_index method, only
difference is in case of multiindexes we need to provide list of indexes instead of a single string
index, lets see and example for that
dfMultiIndex.py Output
1 dfMulti = RN S1 S2 S3
pd.read_csv('MultiIndexDemo.csv') Col Dep Sem
2 dfMulti.set_index(['Col','Dep','Sem'], ABC CE 5 101 50 60 70
inplace=True) 5 102 48 70 25
3 print(dfMulti) 7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
Dahod CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99

21
Multi-Index DataFrame (Cont.)
 Now we have multi-indexed DataFrame from which we can access data using multiple index
 For Example Output (Dahod)
 Sub DataFrame for all the students of Dahod RN S1 S2 S3
dfGrabDahodStu.py Dep Sem
1 print(dfMulti.loc['Dahod']) CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
Output (Dahod>CE)
RN S1 S2 S3
Sem
5 101 88 99 77
5 102 99 84 76
 SubdfGrabDahodCEStu.py
DataFrame for Computer Engineering
7 101 88 77 99
students from a Dahod
1 print(dfMulti.loc['Dahod','CE'])

22
Reading in Multiindexed DataFrame directly from CSV
 read_csv function of pandas provides easy way to create multi-indexed DataFrame directly
while fetching the CSV file.
dfMultiIndex.py Output
1 dfMultiCSV = RN S1 S2 S3
pd.read_csv('MultiIndexDemo.csv' Col Dep Sem
,index_col=[0,1,2]) ABC CE 5 101 50 60 70
#for multi-index in cols we can 5 102 48 70 25
use header parameter 7 101 58 59 51
2 print(dfMultiCSV) ME 5 101 30 35 39
5 102 50 90 48
Dahod CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99

23
Cross Sections in DataFrame
 The xs() function is used to get cross-section from the === Parameters ===
key : label
Series/DataFrame.
axis : Axis to retrieve
 This method takes a key argument to select data at a cross section
particular level of a MultiIndex. level : level of key
 Syntax : drop_level : False if you
want to preserve the level
syntax
Output
DataFrame.xs(key, axis=0, level=None, drop_level=True)
 Example : RN S1 S2RN S3S1 S2 S3
Col
Col Dep
Sem Sem
dfMultiIndex.py
ABC
ABC CE
5 5 101 101
50 50
60 60
70 70
1 dfMultiCSV = 5 5 102 102
48 48
70 70
25 25
pd.read_csv('MultiIndexDemo.csv', 7 7 101 101
58 58
59 59
51 51
index_col=[0,1,2]) Dahod ME
5 5 101 101
88 30
99 35
77 39
2 print(dfMultiCSV) 5 5 102 102
99 50
84 90
76 48
3 print(dfMultiCSV.xs('CE',axis=0,level='Dep') Darshan CE
7 5 101 101
88 88
77 99
99 77
) 5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
24
Dealing with Missing Data
 There are many methods by which we can deal with the missing data, some of most commons
are listed below,
 dropna, will drop (delete) the missing data (rows/cols)
 fillna, will fill specified values in place of missing data
 interpolate, will interpolate missing data and fill interpolated value in place of missing data.

25
Groupby in Pandas
 Any groupby operation involves one of the following
operations on the original object. They are College Enno CPI
 Splitting the Object Dahod 123 8.9
 Applying a function
Dahod 124 9.2
 Combining the results
Dahod 125 7.8
 In many situations, we split the data into sets and we
Dahod 128 8.7 College Mean CPI
apply some functionality on each subset.
ABC 211 5.6 Dahod 8.65
 we can perform the following operations
ABC 212 6.2 ABC 4.8
 Aggregation − computing a summary statistic
ABC 215 3.2 XYZ 5.83
 Transformation − perform some group-specific operation
 Filtration − discarding the data with some condition ABC 218 4.2
XYZ 312 5.2
 Basic ways to use of groupby method
 df.groupby('key') XYZ 315 6.5
 df.groupby(['key1','key2']) XYZ 315 5.8
 df.groupby(key,axis=1)

26
Groupby in Pandas (Cont.)
 Example : Listing all the groups
dfGroup.py Output

1 dfIPL = pd.read_csv('IPLDataSet.csv') {2014: Int64Index([0, 2, 4, 9],


2 print(dfIPL.groupby('Year').groups) dtype='int64'),
2015: Int64Index([1, 3, 5, 10],
dtype='int64'),
2016: Int64Index([6, 8],
dtype='int64'),
2017: Int64Index([7, 11],
dtype='int64')}

27
Groupby in Pandas (Cont.)
 Example : Group by multiple columns
dfGroupMul.py Output

1 dfIPL = pd.read_csv('IPLDataSet.csv') {(2014, 'Devils'): Int64Index([2],


2 print(dfIPL.groupby(['Year','Team']).groups) dtype='int64'),
(2014, 'Kings'): Int64Index([4],
dtype='int64'),
(2014, 'Riders'): Int64Index([0],
dtype='int64'),
………
………
(2016, 'Riders'): Int64Index([8],
dtype='int64'),
(2017, 'Kings'): Int64Index([7],
dtype='int64'),
(2017, 'Riders'): Int64Index([11],
dtype='int64')}

28
Output
Groupby in Pandas (Cont.) 2014
 Example : Iterating through groups Team Rank Year Points
0 Riders 1 2014 876
dfGroupIter.py 2 Devils 2 2014 863
4 Kings 3 2014 741
1 dfIPL = pd.read_csv('IPLDataSet.csv')
9 Royals 4 2014 701
2 groupIPL = dfIPL.groupby('Year')
2015
3 for name,group in groupIPL :
Team Rank Year Points
4 print(name)
1 Riders 2 2015 789
5 print(group)
3 Devils 3 2015 673
5 kings 4 2015 812
10 Royals 1 2015 804
2016
Team Rank Year Points
6 Kings 1 2016 756
8 Riders 2 2016 694
2017
Team Rank Year Points
7 Kings 1 2017 788
11 Riders 2 2017 690
29
Groupby in Pandas (Cont.)
 Example : Aggregating groups Output
YEAR_ID
dfGroupAgg.py 2003 1000
1 dfSales = pd.read_csv('SalesDataSet.csv') 2004 1345
2 print(dfSales.groupby(['YEAR_ID']).count( 2005 478
)['QUANTITYORDERED']) Name: QUANTITYORDERED, dtype:
3 print(dfSales.groupby(['YEAR_ID']).sum() int64
['QUANTITYORDERED']) YEAR_ID
4 print(dfSales.groupby(['YEAR_ID']).mean() 2003 34612
['QUANTITYORDERED']) 2004 46824
2005 17631
Name: QUANTITYORDERED, dtype:
int64
YEAR_ID
2003 34.612000
2004 34.813383
2005 36.884937
Name: QUANTITYORDERED, dtype:
float64
30
Groupby in Pandas (Cont.)
Output
 Example : Describe details
count mean std min
dfGroupDesc.py 25% 50% 75% max
1 dfIPL = Year
pd.read_csv('IPLDataSet.csv') 2014 4.0 795.25 87.439026 701.0 731.0
2 print(dfIPL.groupby('Year').desc 802.0 866.25 876.0
ribe()['Points']) 2015 4.0 769.50 65.035888 673.0 760.0
796.5 806.00 812.0
2016 2.0 725.00 43.840620 694.0 709.5
725.0 740.50 756.0
2017 2.0 739.00 69.296465 690.0 714.5
739.0 763.50 788.0

31
Concatenation in Pandas
 Concatenation basically glues together DataFrames.
 Keep in mind that dimensions should match along the axis you are concatenating on.
 You can use pd.concat and pass in a list of DataFrames to concatenate together:
dfConcat.py Output
1 dfCX = pd.read_csv('CX_Marks.csv',index_col=0) PDS Algo SE
2 dfCY = pd.read_csv('CY_Marks.csv',index_col=0) 101 50 55 60
3 dfCZ = pd.read_csv('CZ_Marks.csv',index_col=0) 102 70 80 61
4 dfAllStudent = pd.concat([dfCX,dfCY,dfCZ]) 103 55 89 70
5 print(dfAllStudent) 104 58 96 85
201 77 96 63
 Note : We can use axis=1 parameter to concat columns. 202 44 78 32
203 55 85 21
204 69 66 54
301 11 75 88
302 22 48 77
303 33 59 68
304 44 55 62
32
Join in Pandas
 df.join() method will efficiently join multiple DataFrame objects by index(or column
specified) .
 some of important Parameters :
 dfOther : Right Data Frame
 on (Not recommended) : specify the column on which we want to join (Default is index)
 how : How to handle the operation of the two objects.
 left: use calling frame’s index (Default).
 right: use dfOther index.
 outer: form union of calling frame’s index with other’s index (or column if on is specified), and sort it. lexicographically.
 inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of
the calling’s one.

33
Join in Pandas (Example)
dfJoin.py Output - 1 Output - 2
1 dfINS = PDS Algo SE INS PDS Algo SE INS
pd.read_csv('INS_Marks.csv',index_col=0) 101 50 55 60 55.0 301 11 75 88 11
2 dfLeftJoin = allStudent.join(dfINS) 102 70 80 61 66.0 302 22 48 77 22
3 print(dfLeftJoin) 103 55 89 70 77.0 303 33 59 68 33
4 dfRightJoin = 104 58 96 85 88.0 304 44 55 62 44
allStudent.join(dfINS,how='right') 201 77 96 63 66.0 101 50 55 60 55
5 print(dfRightJoin) 202 44 78 32 NaN 102 70 80 61 66
203 55 85 21 78.0 103 55 89 70 77
204 69 66 54 85.0 104 58 96 85 88
301 11 75 88 11.0 201 77 96 63 66
302 22 48 77 22.0 203 55 85 21 78
303 33 59 68 33.0 204 69 66 54 85
304 44 55 62 44.0

34
Merge in Pandas
 Merge DataFrame or named Series objects with a database-style join.
 Similar to join method, but used when we want to join/merge with the columns instead of index.
 some of important Parameters :
 dfOther : Right Data Frame
 on : specify the column on which we want to join (Default is index)
 left_on : specify the column of left Dataframe
 right_on : specify the column of right Dataframe
 how : How to handle the operation of the two objects.
 left: use calling frame’s index (Default).
 right: use dfOther index.
 outer: form union of calling frame’s index with other’s index (or column if on is specified), and sort it. lexicographically.
 inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of
the calling’s one.

35
Merge in Pandas (Example)
dfMerge.py Output
1 m1 = pd.read_csv('Merge1.csv') RollNo EnNo Name
2 print(m1) 0 101 11112222 Abc
3 m2 = pd.read_csv('Merge2.csv') 1 102 11113333 Xyz
4 print(m2) 2 103 22224444 Def
5 m3 = m1.merge(m2,on='EnNo')
6 print(m3) EnNo PDS INS
0 11112222 50 60
1 11113333 60 70

RollNo EnNo Name PDS INS


0 101 11112222 Abc 50 60
1 102 11113333 Xyz 60 70

36
Read CSV in Pandas
 read_csv() is used to read Comma Separated Values (CSV) file into a pandas DataFrame.
 some of important Parameters :
 filePath : str, path object, or file-like object
 sep : separator (Default is comma)
 header: Row number(s) to use as the column names.
 index_col : index column(s) of the data frame.
readCSV.py Output
1 dfINS = pd.read_csv('Marks.csv',index_col=0,header=0) PDS Algo SE INS
2 print(dfINS) 101 50 55 60 55.0
102 70 80 61 66.0
103 55 89 70 77.0
104 58 96 85 88.0
201 77 96 63 66.0

37
Read Excel in Pandas
 Read an Excel file into a pandas DataFrame.
 Supports xls, xlsx, xlsm, xlsb, odf, ods and odt file extensions read from a local filesystem or URL.
Supports an option to read a single sheet or a list of sheets.
 some of important Parameters :
 excelFile : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object
 sheet_name : sheet no in integer or the name of the sheet, can have list of sheets.
 index_col : index column of the data frame.

38
Read from MySQL Database
 We need two libraries for that,
 conda install sqlalchemy
 conda install pymysql
 After installing both the libraries, import create_engine from sqlalchemy and import
pymysql
importsForDB.py
1 from sqlalchemy import create_engine
2 import pymysql

 Then, create a database connection string and create engine using it.
createEngine.py
1 db_connection_str = 'mysql+pymysql://username:password@host/dbname'
2 db_connection = create_engine(db_connection_str)

39
Read from MySQL Database (Cont.)
 After getting the engine, we can fire any sql query using pd.read_sql method.
 read_sql is a generic method which can be used to read from any sql (MySQL,MSSQL, Oracle
etc…)
readSQLDemo.py
1 df = pd.read_sql('SELECT * FROM cities', con=db_connection)
2 print(df)
Output
CityID CityName CityDescription CityCode
0 1 Rajkot Rajkot Description here RJT
1 2 Ahemdabad Ahemdabad Description here ADI
2 3 Surat Surat Description here SRT

40
Web Scrapping using Beautiful Soup
 Beautiful Soup is a library that makes it easy to scrape information from web pages.
 It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and
modifying the parse tree. Output
webScrap.py Dr. Gopi Sanghani
1 import requests Dr. Nilesh Gambhava
2 import bs4 Dr. Pradyumansinh
3 req = requests.get('https://www.gecdahod.ac.in/Faculty') Jadeja
soup = bs4.BeautifulSoup(req.text,'lxml') Prof. Hardik Doshi
4 allFaculty = soup.select('body > main > section:nth- Prof. Maulik Trivedi
5 child(5) > div > div > div.col-lg-8.col-xl-9 > div > Prof. Dixita Kagathara
div') Prof. Firoz Sherasiya
for fac in allFaculty : Prof. Rupesh Vaishnav
6 allSpans = fac.select('h2>a') Prof. Swati Sharma
7 print(allSpans[0].text.strip()) Prof. Arjun Bala
8 Prof. Mayur Padia
…..
…..
41

You might also like