Assignment-Data Preprocessing (All)
Assignment-Data Preprocessing (All)
Assignment-Data Preprocessing (All)
In [1]:
import numpy as np
import pandas as pd
import warnings
Assignment 1
Problem Statement
Suppose you are a public school administrator. Some schools in your state of Tennessee are performing below average academically. Your superintendent under pressure from frustrated parents and voters approached you with the task of understanding why these schools are under-
performing. To improve school performance, you need to learn more about these schools and their students, just as a business needs to understand its
Perform exploratory data analysis which includes: determining the type of the data, correlation analysis over the same. You need to convert the data into useful information:
In [4]:
df = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/middle_tn_schools.csv")
In [5]:
Out[5]: name school_rating size reduced_lunch state_percentile_16 state_percentile_15 stu_teach_ratio school_type avg_score_15 avg_score_16 full_time_teachers percent_black percent_white percent_asian percent_hispanic
0 Allendale Elementary School 5.0 851.0 10.0 90.2 95.8 15.7 Public 89.4 85.2 54.0 2.9 85.5 1.6 5.6
1 Anderson Elementary 2.0 412.0 71.0 32.8 37.3 12.8 Public 43.0 38.3 32.0 3.9 86.7 1.0 4.9
2 Avoca Elementary 4.0 482.0 43.0 78.4 83.6 16.6 Public 75.7 73.0 29.0 1.0 91.5 1.2 4.4
3 Bailey Middle 0.0 394.0 91.0 1.6 1.0 13.1 Public Magnet 2.1 4.4 30.0 80.7 11.7 2.3 4.3
4 Barfield Elementary 4.0 948.0 26.0 85.3 89.2 14.8 Public 81.3 79.6 64.0 11.8 71.2 7.1 6.0
In [11]:
print('Shape of data :', df.shape)
In [12]:
print('Column Names are:',df.columns)
'percent_asian', 'percent_hispanic'],
In [10]:
<class 'pandas.core.frame.DataFrame'>
In [14]:
Out[14]: name 0
school_rating 0
size 0
reduced_lunch 0
state_percentile_16 0
state_percentile_15 6
stu_teach_ratio 0
school_type 0
avg_score_15 6
avg_score_16 0
full_time_teachers 0
percent_black 0
percent_white 0
percent_asian 0
percent_hispanic 0
dtype: int64
In [15]:
In [16]:
s_type={'Public':1, 'Public Magnet':2, 'Public Charter':3, 'Public Virtual':4}
In [18]:
In [20]:
In [21]:
<class 'pandas.core.frame.DataFrame'>
In [23]:
Out[23]: school_rating size reduced_lunch state_percentile_16 state_percentile_15 stu_teach_ratio school_type avg_score_15 avg_score_16 full_time_teachers percent_black percent_white percent_asian percent_hispanic
count 347.00 347.00 347.00 347.00 341.00 347.00 347.00 341.0 347.00 347.00 347.00 347.00 347.00 347.00
mean 2.97 699.47 50.28 58.80 58.25 15.46 1.19 57.0 57.05 44.94 21.20 61.67 2.64 11.16
std 1.69 400.60 25.48 32.54 32.70 5.73 0.47 26.7 27.97 22.05 23.56 27.27 3.11 12.03
min 0.00 53.00 2.00 0.20 0.60 4.70 1.00 1.5 0.10 2.00 0.00 1.10 0.00 0.00
25% 2.00 420.50 30.00 30.95 27.10 13.70 1.00 37.6 37.00 30.00 3.60 40.60 0.75 3.80
50% 3.00 595.00 51.00 66.40 65.80 15.00 1.00 61.8 60.70 40.00 13.50 68.70 1.60 6.40
75% 4.00 851.00 71.50 88.00 88.60 16.70 1.00 79.6 80.25 54.00 28.35 85.95 3.10 13.80
max 5.00 2314.00 98.00 99.80 99.80 111.00 4.00 99.0 98.90 140.00 97.40 99.70 21.10 65.20
In [24]:
Out[24]: school_rating size reduced_lunch state_percentile_16 state_percentile_15 stu_teach_ratio school_type avg_score_15 avg_score_16 full_time_teachers percent_black percent_white percent_asian percent_hispanic
school_rating 1.000000 0.180000 -0.820000 0.990000 0.940000 0.200000 -0.110000 0.940000 0.980000 0.120000 -0.590000 0.640000 0.160000 -0.380000
size 0.180000 1.000000 -0.280000 0.170000 0.160000 0.140000 -0.140000 0.160000 0.140000 0.970000 -0.150000 0.100000 0.190000 -0.020000
reduced_lunch -0.820000 -0.280000 1.000000 -0.820000 -0.830000 -0.200000 0.180000 -0.840000 -0.820000 -0.210000 0.560000 -0.670000 -0.230000 0.490000
state_percentile_16 0.990000 0.170000 -0.820000 1.000000 0.950000 0.190000 -0.090000 0.950000 0.990000 0.120000 -0.570000 0.630000 0.150000 -0.380000
state_percentile_15 0.940000 0.160000 -0.830000 0.950000 1.000000 0.140000 -0.100000 0.990000 0.950000 0.110000 -0.560000 0.610000 0.180000 -0.370000
stu_teach_ratio 0.200000 0.140000 -0.200000 0.190000 0.140000 1.000000 0.290000 0.150000 0.180000 0.020000 -0.120000 0.130000 0.090000 -0.090000
school_type -0.110000 -0.140000 0.180000 -0.090000 -0.100000 0.290000 1.000000 -0.120000 -0.100000 -0.170000 0.490000 -0.430000 -0.040000 0.070000
avg_score_15 0.940000 0.160000 -0.840000 0.950000 0.990000 0.150000 -0.120000 1.000000 0.950000 0.110000 -0.600000 0.640000 0.190000 -0.370000
avg_score_16 0.980000 0.140000 -0.820000 0.990000 0.950000 0.180000 -0.100000 0.950000 1.000000 0.090000 -0.590000 0.640000 0.170000 -0.370000
full_time_teachers 0.120000 0.970000 -0.210000 0.120000 0.110000 0.020000 -0.170000 0.110000 0.090000 1.000000 -0.110000 0.060000 0.150000 0.030000
percent_black -0.590000 -0.150000 0.560000 -0.570000 -0.560000 -0.120000 0.490000 -0.600000 -0.590000 -0.110000 1.000000 -0.870000 -0.110000 0.090000
percent_white 0.640000 0.100000 -0.670000 0.630000 0.610000 0.130000 -0.430000 0.640000 0.640000 0.060000 -0.870000 1.000000 -0.090000 -0.540000
percent_asian 0.160000 0.190000 -0.230000 0.150000 0.180000 0.090000 -0.040000 0.190000 0.170000 0.150000 -0.110000 -0.090000 1.000000 0.190000
percent_hispanic -0.380000 -0.020000 0.490000 -0.380000 -0.370000 -0.090000 0.070000 -0.370000 -0.370000 0.030000 0.090000 -0.540000 0.190000 1.000000
In [25]:
fig, axe = plt.subplots(figsize=(12,8))
In [27]:
y = df["school_rating"]
x = df["reduced_lunch"]
correlation = y.corr(x)
Assignment 2
Problem Statement:
Mtcars, an automobile company in Chambersburg, United States has recorded the production of its cars within a dataset. With respect to some of the feedback given by their customers they are coming up with a new model. As a result of it they have to explore the current dataset to derive
further insights out if it.
Import the dataset, explore for dimensionality, type and average value of the horsepower across all the cars. Also, identify few of mostly correlated features which would help in modification.
In [28]:
mtcar = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/mtcars.csv")
In [29]:
In [32]:
print("Dimensions of mtcar dataset are(Row:Column):",mtcar.shape)
In [33]:
print('Data types : \n', mtcar.dtypes)
Data types :
model object
mpg float64
cyl int64
disp float64
hp int64
drat float64
wt float64
qsec float64
vs int64
am int64
gear int64
carb int64
dtype: object
In [36]:
Out[36]: 146.6875
In [39]:
Out[39]: model
AMC Javelin 150
Datsun 710 93
Fiat 128 66
Fiat X1-9 66
Honda Civic 52
Merc 230 95
Merc 240D 62
Porsche 914-2 91
Toyota Corolla 65
Toyota Corona 97
Valiant 105
In [40]:
mpg 1.000000 -0.850000 -0.850000 -0.780000 0.680000 -0.870000 0.420000 0.660000 0.600000 0.480000 -0.550000
cyl -0.850000 1.000000 0.900000 0.830000 -0.700000 0.780000 -0.590000 -0.810000 -0.520000 -0.490000 0.530000
disp -0.850000 0.900000 1.000000 0.790000 -0.710000 0.890000 -0.430000 -0.710000 -0.590000 -0.560000 0.390000
hp -0.780000 0.830000 0.790000 1.000000 -0.450000 0.660000 -0.710000 -0.720000 -0.240000 -0.130000 0.750000
drat 0.680000 -0.700000 -0.710000 -0.450000 1.000000 -0.710000 0.090000 0.440000 0.710000 0.700000 -0.090000
wt -0.870000 0.780000 0.890000 0.660000 -0.710000 1.000000 -0.170000 -0.550000 -0.690000 -0.580000 0.430000
qsec 0.420000 -0.590000 -0.430000 -0.710000 0.090000 -0.170000 1.000000 0.740000 -0.230000 -0.210000 -0.660000
vs 0.660000 -0.810000 -0.710000 -0.720000 0.440000 -0.550000 0.740000 1.000000 0.170000 0.210000 -0.570000
am 0.600000 -0.520000 -0.590000 -0.240000 0.710000 -0.690000 -0.230000 0.170000 1.000000 0.790000 0.060000
gear 0.480000 -0.490000 -0.560000 -0.130000 0.700000 -0.580000 -0.210000 0.210000 0.790000 1.000000 0.270000
carb -0.550000 0.530000 0.390000 0.750000 -0.090000 0.430000 -0.660000 -0.570000 0.060000 0.270000 1.000000
In [41]:
fig, axe = plt.subplots(figsize=(12,8))
Assignment 3
Problem Statement:
Mtcars, the automobile company in the United States have planned to rework on optimizing the horsepower of their cars, as most of the customers feedbacks were centred around horsepower. However, while developing a ML model with respect to horsepower, the efficiency of the model
was compromised. Irregularity might be one of the causes.
Check for missing values and outliers within the horsepower column and remove them.
In [42]:
Out[42]: 0
In [48]:
In [46]:
filter = mtcar['hp'].values <300
In [47]:
hp_filter = mtcar[filter]
In [49]:
Assignment 4
Problem Statement:
Load the load_diabetes datasets internally from sklearn and check for any missing value or outlier data in the ‘data’ column. If any irregularities found treat them accordingly.
Perform missing value and outlier data treatment.
In [50]:
from sklearn.datasets import load_diabetes
In [52]:
data = load_diabetes()
In [53]:
.. _diabetes_dataset:
Diabetes dataset
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- sex
- s5 ltg, lamotrigine
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
In [56]:
diabetes = pd.DataFrame(, columns = data.feature_names)
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641
In [58]:
Out[58]: age 0
sex 0
bmi 0
bp 0
s1 0
s2 0
s3 0
s4 0
s5 0
s6 0
dtype: int64
In [61]:
count 442.000 442.000 442.000 442.000 442.000 442.000 442.000 442.000 442.000 442.000
mean -0.000 0.000 -0.000 0.000 -0.000 0.000 -0.000 0.000 -0.000 -0.000
std 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
min -0.107 -0.045 -0.090 -0.112 -0.127 -0.116 -0.102 -0.076 -0.126 -0.138
25% -0.037 -0.045 -0.034 -0.037 -0.034 -0.030 -0.035 -0.039 -0.033 -0.033
50% 0.005 -0.045 -0.007 -0.006 -0.004 -0.004 -0.007 -0.003 -0.002 -0.001
75% 0.038 0.051 0.031 0.036 0.028 0.030 0.029 0.034 0.032 0.028
max 0.111 0.051 0.171 0.132 0.154 0.199 0.181 0.185 0.134 0.136
In [69]:
Out[69]: <AxesSubplot:>
In [83]:
Out[83]: Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')
In [84]:
In [85]:
def iqr_capping(df,cols):
In [86]:
In [87]:
Out[87]: <AxesSubplot:>
Assignment 5
Problem Statement:
As a macroeconomic analyst at the Organization for Economic Cooperation and Development (OECD), your job is to collect relevant data for analysis. It looks like you have three countries in thenorth_america
data frame and one country in thesouth_americadata frame. As these are in two
separate plots, it's hard to compare the average labor hours between North America and South America. If all the countries were into the same data frame, it would be much easier to do this comparison.
Demonstrate concatenation.
In [3]:
df1 = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/north_america_2000_2010.csv")
Out[3]: Country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
0 Canada 1779.0 1771.0 1754.0 1740.0 1760.0 1747 1745.0 1741.0 1735 1701.0 1703.0
1 Mexico 2311.2 2285.2 2271.2 2276.5 2270.6 2281 2280.6 2261.4 2258 2250.2 2242.4
2 USA 1836.0 1814.0 1810.0 1800.0 1802.0 1799 1800.0 1798.0 1792 1767.0 1778.0
In [4]:
df2 = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/south_america_2000_2010.csv")
Out[4]: Country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
0 Chile 2263 2242 2250 2235 2232 2157 2165 2128 2095 2074 2069.6
In [14]:
Out[14]: Country 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
North_America 0 Canada 1779.0 1771.0 1754.0 1740.0 1760.0 1747 1745.0 1741.0 1735 1701.0 1703.0
1 Mexico 2311.2 2285.2 2271.2 2276.5 2270.6 2281 2280.6 2261.4 2258 2250.2 2242.4
2 USA 1836.0 1814.0 1810.0 1800.0 1802.0 1799 1800.0 1798.0 1792 1767.0 1778.0
South_America 0 Chile 2263.0 2242.0 2250.0 2235.0 2232.0 2157 2165.0 2128.0 2095 2074.0 2069.6
In [28]:
Out[28]: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
North_America 1975.4 1956.733333 1945.066667 1938.833333 1944.2 1942.333333 1941.866667 1933.466667 1928.333333 1906.066667 1907.8
South_America 2263.0 2242.000000 2250.000000 2235.000000 2232.0 2157.000000 2165.000000 2128.000000 2095.000000 2074.000000 2069.6
Assignment 6
Problem Statement:
SFO Public Department -referred to as SFO has captured all the salary data of its employees from year 2011-2014. Now in 2018 the organization is facing some financial crisis. As a first step HR wants to rationalize employee cost to save payroll budget. You have to do data manipulation
and answer the below questions:
1.How much total salary cost has increased from year 2011 to 2014?
2.Who was the top earning employee across all the years?
Performdata manipulation and visualization techniques
In [12]:
Salary_df = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/Demo Datasets/Lesson 3/Salaries.csv")
Out[12]: Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.00 400184.25 NaN 567595.43 567595.43 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 NaN 538909.28 538909.28 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.60 NaN 335279.91 335279.91 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.00 56120.71 198306.90 NaN 332343.61 332343.61 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.60 9737.00 182234.59 NaN 326373.19 326373.19 2011 NaN San Francisco NaN
In [5]:
Out[5]: Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.00 400184.25 NaN 567595.43 0.000057 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 NaN 538909.28 0.000054 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.60 NaN 335279.91 0.000034 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.00 56120.71 198306.90 NaN 332343.61 0.000033 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.60 9737.00 182234.59 NaN 326373.19 0.000033 2011 NaN San Francisco NaN
In [13]:
In [14]:
Salary_df['EmployeeName']=Salary_df['EmployeeName'].str.lower().str.replace('bernard fatooh','bernard fatooh')
In [15]:
<class 'pandas.core.frame.DataFrame'>
1.How much total salary cost has increased from year 2011 to 2014?
In [84]:
Out[84]: Year
2011 2.594113e+09
2012 3.696790e+09
2013 3.814772e+09
2014 3.821866e+09
In [85]:
2.Who was the top earning employee across all the years?
In [24]:
In [30]:
In [44]:
print("Top Earning employee is",TopEarn_pivot.index[0])
Description statistical
Correlation heatmap
In [45]:
covid_df = pd.read_csv(r"C:/Users/dipan/Desktop/Simplilearn/Jupyter/Data files/covid_south_america_weekly_trend.csv")
Out[45]: Country/Other Cases in the last 7 days Cases in the preceding 7 days Weekly Case % Change Cases in the last 7 days/1M pop Deaths in the last 7 days Deaths in the preceding 7 days Weekly Death % Change Deaths in the last 7 days/1M pop Population
In [46]:
print('Shape of data :',covid_df.shape)
In [50]:
print('Size of data :',covid_df.size)
In [53]:
print('Data types : \n', covid_df.dtypes)
Data types :
Country/Other object
Population int64
dtype: object
In [52]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
In [47]:
print('Column Names are:',covid_df.columns)
In [54]:
Out[54]: Country/Other 0
Population 0
dtype: int64
In [56]:
Out[56]: Cases in the last 7 days Cases in the preceding 7 days Weekly Case % Change Cases in the last 7 days/1M pop Deaths in the last 7 days Deaths in the preceding 7 days Weekly Death % Change Deaths in the last 7 days/1M pop Population
count 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 13.000000 1.300000e+01
mean 71469.384615 95625.076923 -39.846154 2005.076923 685.769231 824.846154 -27.538462 17.307692 3.358683e+07
std 160814.226454 204458.574605 13.637242 3107.834789 1370.556526 1569.117631 28.814393 12.565725 5.716988e+07
min 198.000000 290.000000 -61.000000 133.000000 4.000000 3.000000 -58.000000 1.000000 3.114600e+05
25% 3537.000000 5834.000000 -48.000000 331.000000 29.000000 64.000000 -50.000000 10.000000 3.493624e+06
50% 8999.000000 21531.000000 -41.000000 636.000000 96.000000 121.000000 -42.000000 13.000000 1.808557e+07
75% 25974.000000 55235.000000 -32.000000 1537.000000 863.000000 1043.000000 -13.000000 23.000000 3.373056e+07
max 576463.000000 741844.000000 -17.000000 10040.000000 5051.000000 5814.000000 33.000000 45.000000 2.150561e+08
In [57]:
Out[57]: Cases in the last 7 days Cases in the preceding 7 days Weekly Case % Change Cases in the last 7 days/1M pop Deaths in the last 7 days Deaths in the preceding 7 days Weekly Death % Change Deaths in the last 7 days/1M pop Population
Cases in the last 7 days 1.000000 1.000000 0.580000 0.330000 0.960000 0.950000 0.310000 0.370000 0.920000
Cases in the preceding 7 days 1.000000 1.000000 0.550000 0.310000 0.970000 0.960000 0.310000 0.370000 0.930000
Weekly Case % Change 0.580000 0.550000 1.000000 0.600000 0.420000 0.410000 0.490000 0.360000 0.400000
Cases in the last 7 days/1M pop 0.330000 0.310000 0.600000 1.000000 0.150000 0.100000 0.510000 0.760000 0.030000
Deaths in the last 7 days 0.960000 0.970000 0.420000 0.150000 1.000000 1.000000 0.260000 0.330000 0.970000
Deaths in the preceding 7 days 0.950000 0.960000 0.410000 0.100000 1.000000 1.000000 0.210000 0.280000 0.990000
Weekly Death % Change 0.310000 0.310000 0.490000 0.510000 0.260000 0.210000 1.000000 0.640000 0.120000
Deaths in the last 7 days/1M pop 0.370000 0.370000 0.360000 0.760000 0.330000 0.280000 0.640000 1.000000 0.150000
Population 0.920000 0.930000 0.400000 0.030000 0.970000 0.990000 0.120000 0.150000 1.000000
In [ ]: