Assignment-Data Preprocessing (All)

Lesson 2: Data Preprocessing Assignments
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
Assignment 1
Problem Statement
Suppose you are a public school administrator. Some schools in your state of Tennessee are performing below average academically. Your superintendent under pressure from frustrated parents and voters approached you with the task of understanding why these schools are under-
performing. To improve school performance, you need to learn more about these schools and their students, just as a business needs to understand its
Objective:
Perform exploratory data analysis which includes: determining the type of the data, correlation analysis over the same. You need to convert the data into useful information:
Read the data in pandas data frame
Describe the data to find more details
Find the correlation between ‘reduced_lunch’and‘school_rating’
In [4]:
df = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/middle_tn_schools.csv")
In [5]:
df.head()
Out[5]: name school_rating size reduced_lunch state_percentile_16 state_percentile_15 stu_teach_ratio school_type avg_score_15 avg_score_16 full_time_teachers percent_black percent_white percent_asian percent_hispanic
0 Allendale Elementary School 5.0 851.0 10.0 90.2 95.8 15.7 Public 89.4 85.2 54.0 2.9 85.5 1.6 5.6
1 Anderson Elementary 2.0 412.0 71.0 32.8 37.3 12.8 Public 43.0 38.3 32.0 3.9 86.7 1.0 4.9
2 Avoca Elementary 4.0 482.0 43.0 78.4 83.6 16.6 Public 75.7 73.0 29.0 1.0 91.5 1.2 4.4
3 Bailey Middle 0.0 394.0 91.0 1.6 1.0 13.1 Public Magnet 2.1 4.4 30.0 80.7 11.7 2.3 4.3
4 Barfield Elementary 4.0 948.0 26.0 85.3 89.2 14.8 Public 81.3 79.6 64.0 11.8 71.2 7.1 6.0
In [11]:
print('Shape of data :', df.shape)
Shape of data : (347, 15)
In [12]:
print('Column Names are:',df.columns)
Column Names are: Index(['name', 'school_rating', 'size', 'reduced_lunch', 'state_percentile_16',
'state_percentile_15', 'stu_teach_ratio', 'school_type', 'avg_score_15',
'avg_score_16', 'full_time_teachers', 'percent_black', 'percent_white',
'percent_asian', 'percent_hispanic'],
dtype='object')
In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 347 non-null object
1 school_rating 347 non-null float64
2 size 347 non-null float64
3 reduced_lunch 347 non-null float64
4 state_percentile_16 347 non-null float64
6 stu_teach_ratio 347 non-null float64
7 school_type 347 non-null object
8 avg_score_15 341 non-null float64
10 full_time_teachers 347 non-null float64
11 percent_black 347 non-null float64
12 percent_white 347 non-null float64
13 percent_asian 347 non-null float64
14 percent_hispanic 347 non-null float64
dtypes: float64(13), object(2)
memory usage: 40.8+ KB
In [14]:
df.isnull().sum()
Out[14]: name 0
school_rating 0
size 0
reduced_lunch 0
state_percentile_16 0
state_percentile_15 6
stu_teach_ratio 0
school_type 0
avg_score_15 6
avg_score_16 0
full_time_teachers 0
percent_black 0
percent_white 0
percent_asian 0
percent_hispanic 0
dtype: int64
In [15]:
df.school_type.unique()
Out[15]: array(['Public', 'Public Magnet', 'Public Charter', 'Public Virtual'],
dtype=object)
In [16]:
s_type={'Public':1, 'Public Magnet':2, 'Public Charter':3, 'Public Virtual':4}
In [18]:
df['school_type']=df['school_type'].map(s_type)
In [20]:
df['school_type']=df['school_type'].astype('float')
In [21]:
df.info()
--- ------ -------------- -----
0 name 347 non-null object
1 school_rating 347 non-null float64
2 size 347 non-null float64
3 reduced_lunch 347 non-null float64
6 stu_teach_ratio 347 non-null float64
7 school_type 347 non-null float64
10 full_time_teachers 347 non-null float64
11 percent_black 347 non-null float64
12 percent_white 347 non-null float64
13 percent_asian 347 non-null float64
14 percent_hispanic 347 non-null float64
dtypes: float64(14), object(1)
In [23]:
df.describe().round(decimals=2)
Out[23]: school_rating size reduced_lunch state_percentile_16 state_percentile_15 stu_teach_ratio school_type avg_score_15 avg_score_16 full_time_teachers percent_black percent_white percent_asian percent_hispanic
count 347.00 347.00 347.00 347.00 341.00 347.00 347.00 341.0 347.00 347.00 347.00 347.00 347.00 347.00
mean 2.97 699.47 50.28 58.80 58.25 15.46 1.19 57.0 57.05 44.94 21.20 61.67 2.64 11.16
std 1.69 400.60 25.48 32.54 32.70 5.73 0.47 26.7 27.97 22.05 23.56 27.27 3.11 12.03
min 0.00 53.00 2.00 0.20 0.60 4.70 1.00 1.5 0.10 2.00 0.00 1.10 0.00 0.00
25% 2.00 420.50 30.00 30.95 27.10 13.70 1.00 37.6 37.00 30.00 3.60 40.60 0.75 3.80
50% 3.00 595.00 51.00 66.40 65.80 15.00 1.00 61.8 60.70 40.00 13.50 68.70 1.60 6.40
75% 4.00 851.00 71.50 88.00 88.60 16.70 1.00 79.6 80.25 54.00 28.35 85.95 3.10 13.80
max 5.00 2314.00 98.00 99.80 99.80 111.00 4.00 99.0 98.90 140.00 97.40 99.70 21.10 65.20
In [24]:
cormat=df.corr()
cormat.round(decimals=2).style.background_gradient()
Out[24]: school_rating size reduced_lunch state_percentile_16 state_percentile_15 stu_teach_ratio school_type avg_score_15 avg_score_16 full_time_teachers percent_black percent_white percent_asian percent_hispanic
school_rating 1.000000 0.180000 -0.820000 0.990000 0.940000 0.200000 -0.110000 0.940000 0.980000 0.120000 -0.590000 0.640000 0.160000 -0.380000
size 0.180000 1.000000 -0.280000 0.170000 0.160000 0.140000 -0.140000 0.160000 0.140000 0.970000 -0.150000 0.100000 0.190000 -0.020000
reduced_lunch -0.820000 -0.280000 1.000000 -0.820000 -0.830000 -0.200000 0.180000 -0.840000 -0.820000 -0.210000 0.560000 -0.670000 -0.230000 0.490000
state_percentile_16 0.990000 0.170000 -0.820000 1.000000 0.950000 0.190000 -0.090000 0.950000 0.990000 0.120000 -0.570000 0.630000 0.150000 -0.380000
state_percentile_15 0.940000 0.160000 -0.830000 0.950000 1.000000 0.140000 -0.100000 0.990000 0.950000 0.110000 -0.560000 0.610000 0.180000 -0.370000
stu_teach_ratio 0.200000 0.140000 -0.200000 0.190000 0.140000 1.000000 0.290000 0.150000 0.180000 0.020000 -0.120000 0.130000 0.090000 -0.090000
school_type -0.110000 -0.140000 0.180000 -0.090000 -0.100000 0.290000 1.000000 -0.120000 -0.100000 -0.170000 0.490000 -0.430000 -0.040000 0.070000
avg_score_15 0.940000 0.160000 -0.840000 0.950000 0.990000 0.150000 -0.120000 1.000000 0.950000 0.110000 -0.600000 0.640000 0.190000 -0.370000
avg_score_16 0.980000 0.140000 -0.820000 0.990000 0.950000 0.180000 -0.100000 0.950000 1.000000 0.090000 -0.590000 0.640000 0.170000 -0.370000
full_time_teachers 0.120000 0.970000 -0.210000 0.120000 0.110000 0.020000 -0.170000 0.110000 0.090000 1.000000 -0.110000 0.060000 0.150000 0.030000
percent_black -0.590000 -0.150000 0.560000 -0.570000 -0.560000 -0.120000 0.490000 -0.600000 -0.590000 -0.110000 1.000000 -0.870000 -0.110000 0.090000
percent_white 0.640000 0.100000 -0.670000 0.630000 0.610000 0.130000 -0.430000 0.640000 0.640000 0.060000 -0.870000 1.000000 -0.090000 -0.540000
percent_asian 0.160000 0.190000 -0.230000 0.150000 0.180000 0.090000 -0.040000 0.190000 0.170000 0.150000 -0.110000 -0.090000 1.000000 0.190000
percent_hispanic -0.380000 -0.020000 0.490000 -0.380000 -0.370000 -0.090000 0.070000 -0.370000 -0.370000 0.030000 0.090000 -0.540000 0.190000 1.000000
In [25]:
fig, axe = plt.subplots(figsize=(12,8))
sns.heatmap(cormat,annot=True,cmap='YlGnBu',square=True)
plt.show()
In [27]:
y = df["school_rating"]
x = df["reduced_lunch"]
correlation = y.corr(x)
print('Correlation between school ratin & reduced lunch :',correlation)
Correlation between school ratin & reduced lunch : -0.8157567373058027
Assignment 2
Problem Statement:
Mtcars, an automobile company in Chambersburg, United States has recorded the production of its cars within a dataset. With respect to some of the feedback given by their customers they are coming up with a new model. As a result of it they have to explore the current dataset to derive
further insights out if it.
Objective:
Import the dataset, explore for dimensionality, type and average value of the horsepower across all the cars. Also, identify few of mostly correlated features which would help in modification.
In [28]:
mtcar = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/mtcars.csv")
In [29]:
mtcar.head()
Out[29]: model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
In [32]:
print("Dimensions of mtcar dataset are(Row:Column):",mtcar.shape)
Dimensions of mtcar dataset are(Row:Column): (32, 12)
In [33]:
print('Data types : \n', mtcar.dtypes)
Data types :
model object
mpg float64
cyl int64
disp float64
hp int64
drat float64
wt float64
qsec float64
vs int64
am int64
gear int64
carb int64
dtype: object
In [36]:
mtcar.hp.mean()
Out[36]: 146.6875
In [39]:
mtcar.groupby(['model']).hp.mean()
Out[39]: model
AMC Javelin 150
Cadillac Fleetwood 205
Camaro Z28 245
Chrysler Imperial 230
Datsun 710 93
Dodge Challenger 150
Duster 360 245
Ferrari Dino 175
Fiat 128 66
Fiat X1-9 66
Ford Pantera L 264
Honda Civic 52
Hornet 4 Drive 110
Hornet Sportabout 175
Lincoln Continental 215
Lotus Europa 113
Maserati Bora 335
Mazda RX4 110
Mazda RX4 Wag 110
Merc 230 95
Merc 240D 62
Merc 280 123
Merc 280C 123
Merc 450SE 180
Merc 450SL 180
Merc 450SLC 180
Pontiac Firebird 175
Porsche 914-2 91
Toyota Corolla 65
Toyota Corona 97
Valiant 105
Volvo 142E 109
Name: hp, dtype: int64
In [40]:
cormat=mtcar.corr()
Out[40]: mpg cyl disp hp drat wt qsec vs am gear carb
mpg 1.000000 -0.850000 -0.850000 -0.780000 0.680000 -0.870000 0.420000 0.660000 0.600000 0.480000 -0.550000
cyl -0.850000 1.000000 0.900000 0.830000 -0.700000 0.780000 -0.590000 -0.810000 -0.520000 -0.490000 0.530000
disp -0.850000 0.900000 1.000000 0.790000 -0.710000 0.890000 -0.430000 -0.710000 -0.590000 -0.560000 0.390000
hp -0.780000 0.830000 0.790000 1.000000 -0.450000 0.660000 -0.710000 -0.720000 -0.240000 -0.130000 0.750000
drat 0.680000 -0.700000 -0.710000 -0.450000 1.000000 -0.710000 0.090000 0.440000 0.710000 0.700000 -0.090000
wt -0.870000 0.780000 0.890000 0.660000 -0.710000 1.000000 -0.170000 -0.550000 -0.690000 -0.580000 0.430000
qsec 0.420000 -0.590000 -0.430000 -0.710000 0.090000 -0.170000 1.000000 0.740000 -0.230000 -0.210000 -0.660000
vs 0.660000 -0.810000 -0.710000 -0.720000 0.440000 -0.550000 0.740000 1.000000 0.170000 0.210000 -0.570000
am 0.600000 -0.520000 -0.590000 -0.240000 0.710000 -0.690000 -0.230000 0.170000 1.000000 0.790000 0.060000
gear 0.480000 -0.490000 -0.560000 -0.130000 0.700000 -0.580000 -0.210000 0.210000 0.790000 1.000000 0.270000
carb -0.550000 0.530000 0.390000 0.750000 -0.090000 0.430000 -0.660000 -0.570000 0.060000 0.270000 1.000000
In [41]:
fig, axe = plt.subplots(figsize=(12,8))
sns.heatmap(cormat,annot=True,cmap='YlGnBu',square=True)
plt.show()
Assignment 3
Problem Statement:
Mtcars, the automobile company in the United States have planned to rework on optimizing the horsepower of their cars, as most of the customers feedbacks were centred around horsepower. However, while developing a ML model with respect to horsepower, the efficiency of the model
was compromised. Irregularity might be one of the causes.
Objective:
Check for missing values and outliers within the horsepower column and remove them.
In [42]:
mtcar['hp'].isnull().sum()
Out[42]: 0
In [48]:
sns.boxplot(mtcar['hp']);
In [46]:
filter = mtcar['hp'].values <300
In [47]:
hp_filter = mtcar[filter]
In [49]:
sns.boxplot(hp_filter['hp']);
Assignment 4
Problem Statement:
Load the load_diabetes datasets internally from sklearn and check for any missing value or outlier data in the ‘data’ column. If any irregularities found treat them accordingly.
Objective:
Perform missing value and outlier data treatment.
In [50]:
from sklearn.datasets import load_diabetes
In [52]:
data = load_diabetes()
In [53]:
print(data.DESCR)
.. _diabetes_dataset:
Diabetes dataset
----------------
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
**Data Set Characteristics:**
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, T-Cells (a type of white blood cells)
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, thyroid stimulating hormone
- s5 ltg, lamotrigine
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
In [56]:
diabetes = pd.DataFrame(data.data, columns = data.feature_names)
diabetes.head()
Out[56]: age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641
In [58]:
diabetes.isna().sum()
Out[58]: age 0
sex 0
bmi 0
bp 0
s1 0
s2 0
s3 0
s4 0
s5 0
s6 0
dtype: int64
In [61]:
diabetes.describe().round(decimals=3)
Out[61]: age sex bmi bp s1 s2 s3 s4 s5 s6
count 442.000 442.000 442.000 442.000 442.000 442.000 442.000 442.000 442.000 442.000
mean -0.000 0.000 -0.000 0.000 -0.000 0.000 -0.000 0.000 -0.000 -0.000
std 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
min -0.107 -0.045 -0.090 -0.112 -0.127 -0.116 -0.102 -0.076 -0.126 -0.138
25% -0.037 -0.045 -0.034 -0.037 -0.034 -0.030 -0.035 -0.039 -0.033 -0.033
50% 0.005 -0.045 -0.007 -0.006 -0.004 -0.004 -0.007 -0.003 -0.002 -0.001
75% 0.038 0.051 0.031 0.036 0.028 0.030 0.029 0.034 0.032 0.028
max 0.111 0.051 0.171 0.132 0.154 0.199 0.181 0.185 0.134 0.136
In [69]:
diabetes.boxplot(figsize=(12,6),fontsize=14)
Out[69]: <AxesSubplot:>
In [83]:
cols=diabetes.columns
cols
Out[83]: Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')
In [84]:
df=diabetes.copy()
In [85]:
def iqr_capping(df,cols):
for col in cols:
q1=df[col].quantile(0.25)
q3=df[col].quantile(0.75)
iqr=q3-q1
UC=q3+(1.5*iqr)
LC=q1-(1.5*iqr)
df[col]=np.where(df[col]>UC,UC,np.where(df[col]<LC,LC,df[col]))
In [86]:
iqr_capping(df,cols)
In [87]:
df.boxplot(figsize=(12,6),fontsize=14)
Out[87]: <AxesSubplot:>
Assignment 5
Problem Statement:
As a macroeconomic analyst at the Organization for Economic Cooperation and Development (OECD), your job is to collect relevant data for analysis. It looks like you have three countries in thenorth_america
data frame and one country in thesouth_americadata frame. As these are in two
separate plots, it's hard to compare the average labor hours between North America and South America. If all the countries were into the same data frame, it would be much easier to do this comparison.
Objective:
Demonstrate concatenation.
In [3]:
df1 = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/north_america_2000_2010.csv")
df1.head()
Out[3]: Country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
0 Canada 1779.0 1771.0 1754.0 1740.0 1760.0 1747 1745.0 1741.0 1735 1701.0 1703.0
1 Mexico 2311.2 2285.2 2271.2 2276.5 2270.6 2281 2280.6 2261.4 2258 2250.2 2242.4
2 USA 1836.0 1814.0 1810.0 1800.0 1802.0 1799 1800.0 1798.0 1792 1767.0 1778.0
In [4]:
df2 = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/south_america_2000_2010.csv")
df2.head()
Out[4]: Country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
0 Chile 2263 2242 2250 2235 2232 2157 2165 2128 2095 2074 2069.6
In [14]:
df_concatnated=pd.concat([df1,df2],keys=['North_America','South_America'])
df_concatnated.head()
Out[14]: Country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
North_America 0 Canada 1779.0 1771.0 1754.0 1740.0 1760.0 1747 1745.0 1741.0 1735 1701.0 1703.0
1 Mexico 2311.2 2285.2 2271.2 2276.5 2270.6 2281 2280.6 2261.4 2258 2250.2 2242.4
2 USA 1836.0 1814.0 1810.0 1800.0 1802.0 1799 1800.0 1798.0 1792 1767.0 1778.0
South_America 0 Chile 2263.0 2242.0 2250.0 2235.0 2232.0 2157 2165.0 2128.0 2095 2074.0 2069.6
In [28]:
df_concatnated.groupby(level=0).mean()
Out[28]: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
North_America 1975.4 1956.733333 1945.066667 1938.833333 1944.2 1942.333333 1941.866667 1933.466667 1928.333333 1906.066667 1907.8
South_America 2263.0 2242.000000 2250.000000 2235.000000 2232.0 2157.000000 2165.000000 2128.000000 2095.000000 2074.000000 2069.6
Assignment 6
Problem Statement:
SFO Public Department -referred to as SFO has captured all the salary data of its employees from year 2011-2014. Now in 2018 the organization is facing some financial crisis. As a first step HR wants to rationalize employee cost to save payroll budget. You have to do data manipulation
and answer the below questions:
1.How much total salary cost has increased from year 2011 to 2014?
2.Who was the top earning employee across all the years?
Objective:
Performdata manipulation and visualization techniques
In [12]:
Salary_df = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/Salaries.csv")
Salary_df.head()
Out[12]: Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.00 400184.25 NaN 567595.43 567595.43 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 NaN 538909.28 538909.28 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.60 NaN 335279.91 335279.91 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.00 56120.71 198306.90 NaN 332343.61 332343.61 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.60 9737.00 182234.59 NaN 326373.19 326373.19 2011 NaN San Francisco NaN
In [5]:
Salary_df['TotalPayBenefits']=(Salary_df.TotalPayBenefits)/100000
Salary_df.head()
Out[5]: Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.00 400184.25 NaN 567595.43 0.000057 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 NaN 538909.28 0.000054 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.60 NaN 335279.91 0.000034 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.00 56120.71 198306.90 NaN 332343.61 0.000033 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.60 9737.00 182234.59 NaN 326373.19 0.000033 2011 NaN San Francisco NaN
In [13]:
Salary_df['EmployeeName'].unique()
Out[13]: array(['NATHANIEL FORD', 'GARY JIMENEZ', 'ALBERT PARDINI', ...,
'Mark W Mcclure', 'Charlene D Mccully', 'Joe Lopez'], dtype=object)
In [14]:
Salary_df['EmployeeName']=Salary_df['EmployeeName'].str.lower().str.replace('bernard fatooh','bernard fatooh')
In [15]:
Salary_df.info()
--- ------ -------------- -----
0 Id 148648 non-null int64
1 EmployeeName 148648 non-null object
2 JobTitle 148648 non-null object
3 BasePay 148043 non-null float64
4 OvertimePay 148648 non-null float64
5 OtherPay 148648 non-null float64
6 Benefits 112490 non-null float64
7 TotalPay 148648 non-null float64
8 TotalPayBenefits 148648 non-null float64
9 Year 148648 non-null int64
10 Notes 0 non-null float64
11 Agency 148648 non-null object
12 Status 38119 non-null object
dtypes: float64(7), int64(2), object(4)
memory usage: 14.7+ MB
1.How much total salary cost has increased from year 2011 to 2014?
In [84]:
Salary_Group=Salary_df.groupby('Year')['TotalPayBenefits'].sum()
Salary_Group.head()
Out[84]: Year
2011 2.594113e+09
2012 3.696790e+09
2013 3.814772e+09
2014 3.821866e+09
Name: TotalPayBenefits, dtype: float64
In [85]:
Salary_2011=Salary_Group.iloc[0]
Salary_2014=Salary_Group.iloc[2]
Increase_Salary=(Salary_2014-Salary_2011)/Salary_2011*100
print(Increase_Salary)
47.05497174543298
2.Who was the top earning employee across all the years?
In [24]:
TopEarn_pivot=Salary_df.pivot_table(index='EmployeeName',columns='Year',values='TotalPayBenefits',aggfunc='mean',margins=True)
TopEarn_pivot.head()
Out[24]: Year 2011 2012 2013 2014 All
EmployeeName
a bernard fatooh 20039.91 23514.85 29379.24 30153.03 25771.7575
a elizabeth marchasin 26282.86 NaN NaN NaN 26282.8600
a jamil niazi 87496.21 NaN NaN NaN 87496.2100
a k finizio NaN NaN NaN 26113.37 26113.3700
a. james robertson ii NaN NaN 22601.80 NaN 22601.8000
In [30]:
TopEarn_pivot=TopEarn_pivot.sort_values(['All'],ascending=False)
TopEarn_pivot.head()
Out[30]: Year 2011 2012 2013 2014 All
EmployeeName
nathaniel ford 567595.43 NaN NaN NaN 567595.430
gary jimenez 538909.28 NaN NaN NaN 538909.280
william j coaker jr. NaN NaN NaN 436224.36 436224.360
amy p hart NaN NaN 383746.78 479652.21 431699.495
gregory p suhr NaN NaN 425815.28 418019.22 421917.250
In [44]:
print("Top Earning employee is",TopEarn_pivot.index[0])
Top Earning employee is nathaniel ford
Assignment 7 -Kaggle -Covid_south_america

Perform the EDA
Read the data
Size, Shape, Datatypes
Description statistical
Missing Values Detection and Treatment
Outlier Detection and Treatment
Correlation heatmap
In [45]:
covid_df = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/covid_south_america_weekly_trend.csv")
covid_df.head()
Out[45]: Country/Other Cases in the last 7 days Cases in the preceding 7 days Weekly Case % Change Cases in the last 7 days/1M pop Deaths in the last 7 days Deaths in the preceding 7 days Weekly Death % Change Deaths in the last 7 days/1M pop Population
0 Argentina 70514 94792 -26 1537 863 1198 -28 19 45881008
1 Bolivia 3537 6862 -48 296 47 111 -58 4 11936049
2 Brazil 576463 741844 -22 2681 5051 5814 -13 23 215056109
3 Chile 194646 234595 -17 10040 874 732 19 45 19387329
4 Colombia 17132 29098 -41 331 602 1043 -42 12 51778918
In [46]:
print('Shape of data :',covid_df.shape)
Shape of data : (13, 10)
In [50]:
print('Size of data :',covid_df.size)
Size of data : 130
In [53]:
print('Data types : \n', covid_df.dtypes)
Data types :
Country/Other object
Cases in the last 7 days int64
Cases in the preceding 7 days int64
Weekly Case % Change int64
Cases in the last 7 days/1M pop int64
Deaths in the last 7 days int64
Deaths in the preceding 7 days int64
Weekly Death % Change int64
Deaths in the last 7 days/1M pop int64
Population int64
dtype: object
In [52]:
covid_df.info()
--- ------ -------------- -----
0 Country/Other 13 non-null object
1 Cases in the last 7 days 13 non-null int64
2 Cases in the preceding 7 days 13 non-null int64
3 Weekly Case % Change 13 non-null int64
4 Cases in the last 7 days/1M pop 13 non-null int64
5 Deaths in the last 7 days 13 non-null int64
6 Deaths in the preceding 7 days 13 non-null int64
7 Weekly Death % Change 13 non-null int64
8 Deaths in the last 7 days/1M pop 13 non-null int64
9 Population 13 non-null int64
dtypes: int64(9), object(1)
In [47]:
print('Column Names are:',covid_df.columns)
Column Names are: Index(['Country/Other', 'Cases in the last 7 days',
'Cases in the preceding 7 days', 'Weekly Case % Change',
'Cases in the last 7 days/1M pop', 'Deaths in the last 7 days',
'Deaths in the preceding 7 days', 'Weekly Death % Change',
'Deaths in the last 7 days/1M pop', 'Population'],
dtype='object')
In [54]:
covid_df.isna().sum()
Out[54]: Country/Other 0
Cases in the last 7 days 0
Cases in the preceding 7 days 0
Weekly Case % Change 0
Cases in the last 7 days/1M pop 0
Deaths in the last 7 days 0
Deaths in the preceding 7 days 0
Weekly Death % Change 0
Deaths in the last 7 days/1M pop 0
Population 0
dtype: int64
In [56]:
covid_df.describe()
Out[56]: Cases in the last 7 days Cases in the preceding 7 days Weekly Case % Change Cases in the last 7 days/1M pop Deaths in the last 7 days Deaths in the preceding 7 days Weekly Death % Change Deaths in the last 7 days/1M pop Population
count 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 1.300000e+01
mean 71469.384615 95625.076923 -39.846154 2005.076923 685.769231 824.846154 -27.538462 17.307692 3.358683e+07
std 160814.226454 204458.574605 13.637242 3107.834789 1370.556526 1569.117631 28.814393 12.565725 5.716988e+07
min 198.000000 290.000000 -61.000000 133.000000 4.000000 3.000000 -58.000000 1.000000 3.114600e+05
25% 3537.000000 5834.000000 -48.000000 331.000000 29.000000 64.000000 -50.000000 10.000000 3.493624e+06
50% 8999.000000 21531.000000 -41.000000 636.000000 96.000000 121.000000 -42.000000 13.000000 1.808557e+07
75% 25974.000000 55235.000000 -32.000000 1537.000000 863.000000 1043.000000 -13.000000 23.000000 3.373056e+07
max 576463.000000 741844.000000 -17.000000 10040.000000 5051.000000 5814.000000 33.000000 45.000000 2.150561e+08
In [57]:
cormat=covid_df.corr()
Out[57]: Cases in the last 7 days Cases in the preceding 7 days Weekly Case % Change Cases in the last 7 days/1M pop Deaths in the last 7 days Deaths in the preceding 7 days Weekly Death % Change Deaths in the last 7 days/1M pop Population
Cases in the last 7 days 1.000000 1.000000 0.580000 0.330000 0.960000 0.950000 0.310000 0.370000 0.920000
Cases in the preceding 7 days 1.000000 1.000000 0.550000 0.310000 0.970000 0.960000 0.310000 0.370000 0.930000
Weekly Case % Change 0.580000 0.550000 1.000000 0.600000 0.420000 0.410000 0.490000 0.360000 0.400000
Cases in the last 7 days/1M pop 0.330000 0.310000 0.600000 1.000000 0.150000 0.100000 0.510000 0.760000 0.030000
Deaths in the last 7 days 0.960000 0.970000 0.420000 0.150000 1.000000 1.000000 0.260000 0.330000 0.970000
Deaths in the preceding 7 days 0.950000 0.960000 0.410000 0.100000 1.000000 1.000000 0.210000 0.280000 0.990000
Weekly Death % Change 0.310000 0.310000 0.490000 0.510000 0.260000 0.210000 1.000000 0.640000 0.120000
Deaths in the last 7 days/1M pop 0.370000 0.370000 0.360000 0.760000 0.330000 0.280000 0.640000 1.000000 0.150000
Population 0.920000 0.930000 0.400000 0.030000 0.970000 0.990000 0.120000 0.150000 1.000000
In [ ]:

Assignment-Data Preprocessing (All)

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Assignment-Data Preprocessing (All)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment-Data Preprocessing (All)

Uploaded by

Copyright:

Available Formats

Lesson 2: Data Preprocessing Assignments

import matplotlib.pyplot as plt

import seaborn as sns

Read the data in pandas data frame

Describe the data to find more details

Find the correlation between ‘reduced_lunch’and‘school_rating’

Shape of data : (347, 15)

Column Names are: Index(['name', 'school_rating', 'size', 'reduced_lunch', 'state_percentile_16',

'state_percentile_15', 'stu_teach_ratio', 'school_type', 'avg_score_15',

'avg_score_16', 'full_time_teachers', 'percent_black', 'percent_white',

RangeIndex: 347 entries, 0 to 346

Data columns (total 15 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 name 347 non-null object

1 school_rating 347 non-null float64

2 size 347 non-null float64

3 reduced_lunch 347 non-null float64

4 state_percentile_16 347 non-null float64

5 state_percentile_15 341 non-null float64

6 stu_teach_ratio 347 non-null float64

7 school_type 347 non-null object

8 avg_score_15 341 non-null float64

9 avg_score_16 347 non-null float64

10 full_time_teachers 347 non-null float64

11 percent_black 347 non-null float64

12 percent_white 347 non-null float64

13 percent_asian 347 non-null float64

14 percent_hispanic 347 non-null float64

dtypes: float64(13), object(2)

memory usage: 40.8+ KB

Out[15]: array(['Public', 'Public Magnet', 'Public Charter', 'Public Virtual'],

RangeIndex: 347 entries, 0 to 346

Data columns (total 15 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 name 347 non-null object

1 school_rating 347 non-null float64

2 size 347 non-null float64

3 reduced_lunch 347 non-null float64

4 state_percentile_16 347 non-null float64

5 state_percentile_15 341 non-null float64

6 stu_teach_ratio 347 non-null float64

7 school_type 347 non-null float64

8 avg_score_15 341 non-null float64

9 avg_score_16 347 non-null float64

10 full_time_teachers 347 non-null float64

11 percent_black 347 non-null float64

12 percent_white 347 non-null float64

13 percent_asian 347 non-null float64

14 percent_hispanic 347 non-null float64

dtypes: float64(14), object(1)

memory usage: 40.8+ KB

print('Correlation between school ratin & reduced lunch :',correlation)

Correlation between school ratin & reduced lunch : -0.8157567373058027

Out[29]: model mpg cyl disp hp drat wt qsec vs am gear carb

0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

Data Set Characteristics: