Advanced Statistics Jupyter File PDF
Advanced Statistics Jupyter File PDF
Advanced Statistics Jupyter File PDF
A research laboratory was developing a new compound for the relief of severe cases of hay fever. In an
experiment with 36 volunteers, the amounts of the two active ingredients (A & B) in the compound were
varied at three levels each. Randomization was used in assigning four volunteers to each of the nine
treatments. The data on hours of relief can be found in the following .csv file: Fever.csvView in a new window
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.formula.api import ols
from statsmodels.stats.anova import _get_covariance,anova_lm
import os
%matplotlib inline
In [2]:
df=pd.read_csv("C:\\Users\\Shubham\\Downloads\\Fever-1.csv")
df.head()
Out[2]:
A B Volunteer Relief
0 1 1 1 2.4
1 1 1 2 2.7
2 1 1 3 2.3
3 1 1 4 2.5
4 1 2 1 4.6
In [3]:
df.columns
Out[3]:
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 36 non-null int64
1 B 36 non-null int64
2 Volunteer 36 non-null int64
3 Relief 36 non-null float64
dtypes: float64(1), int64(3)
memory usage: 1.2 KB
In [5]:
df.describe()
Out[5]:
A B Volunteer Relief
In [6]:
Out[6]:
A 0
B 0
Volunteer 0
Relief 0
dtype: int64
In [7]:
df["A"].unique()
Out[7]:
In [8]:
df["B"].unique()
Out[8]:
In [9]:
df["Volunteer"].unique()
Out[9]:
In [10]:
df["Relief"].unique()
Out[10]:
array([ 2.4, 2.7, 2.3, 2.5, 4.6, 4.2, 4.9, 4.7, 4.8, 4.5, 4.4,
5.8, 5.2, 5.5, 5.3, 8.9, 9.1, 8.7, 9. , 9.3, 9.4, 6.1,
5.7, 5.9, 6.2, 9.9, 10.5, 10.6, 10.1, 13.5, 13. , 13.3, 13.2])
In [11]:
df.shape
Out[11]:
(36, 4)
1.1) State the Null and Alternate Hypothesis for conducting one-way ANOVA for both the variables ‘A’ and ‘B’
individually. [both statement and statistical form like Ho=mu, Ha>mu]
The Null & Alternate Hypothesis for Variable ‘A’ is: Null Hypothesis HO: The mean of Active ingredient A is
equal to variable Relief Alternate Hypothesis H1: At least one of the means of active ingredient A is not equal
to variable Relief H0: µ1=µ2
H1: µ1≠µ2 Where µ1= Mean of active ingredient of A, µ2= Mean of Hours of relief
The Null & Alternate Hypothesis for Variable ‘B’ is: Null Hypothesis HO: The mean of Active ingredient B is
equal to variable Relief Alternate Hypothesis H1: At least one of the means of active ingredient B is not equal
to variable Relief H0: µ1=µ2
H1: µ1≠µ2 Where µ1= Mean of active ingredient of B, µ2= Mean of Hours of relief
1.2) Perform one-way ANOVA for variable ‘A’ with respect to the variable ‘Relief’. State whether the Null
Hypothesis is accepted or rejected based on the ANOVA results.
In [12]:
plt.figure(figsize=(10,5))
sns.boxplot(x="A",y="Relief",data=df)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fc1b2baf0>
As seen from the boxplot there is a very large difference in mean of the Relief provided by different active
ingredient of A, let’s find out statically the variance in means using ANOVA.
H1: Atleast one of the mean of active ingredient A is not equal to variable Relief
alpha=0.5
formula='Relief~C(A)'
model=ols(formula,df).fit()
aov_table=anova_lm(model)
print(aov_table)
Since P_Value is less than alpha 0.5, we are failed to accept the null Hypothesis and statistically say the
mean of categorical variable A is same as mean of variable Relief. This bring us into conclusion that at 95%
confidence level we can conclude that relief provided by ingredients at different level of compound A is not
same.
1.3) Perform one-way ANOVA for variable ‘B’ with respect to the variable ‘Relief’. State whether the Null
Hypothesis is accepted or rejected based on the ANOVA results.
In [14]:
plt.figure(figsize=(10,5))
sns.boxplot(x="B",y="Relief",data=df)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fc2326d60>
H1: Atleast one of the mean of active ingredient B is not equal to variable Relief
formula='Relief~C(B)'
model=ols(formula,df).fit()
aov_table=anova_lm(model)
print(aov_table)
since P_value is less than alpha 0.5, we failed to reject the null hypothesis. So at 95% confidence level we
can say that mean of categirical variable is equal to variable relief
1.4) Analyse the effects of one variable on another with the help of an interaction plot. What is the interaction
between the two treatments?
In [16]:
sns.pointplot(x='A',y='Relief',data=df,hue='B')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fc2803dc0>
In [17]:
sns.pointplot(x='A',y='Relief',data=df,hue='B',ci=None)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fc287eaf0>
The plot suggests that there is interaction between the levels of ingredient A and ingredient B.
1.5) Perform a two-way ANOVA based on the different ingredients (variable ‘A’ & ‘B’ along with their
interaction 'A*B') with the variable 'Relief' and state your results.
H1:There is atleast some interaction between the levels of ingredient A and ingredient B
In [18]:
#Interaction Effect:
model=ols('Relief~C(A)+C(B)+C(A):C(B)',data=df).fit()
aov_table=anova_lm(model)
print(aov_table)
1.6) Mention the business implications of performing ANOVA for this particular case study.
conclude that the responses significantly differ across the levels of the two ingredients, while holding
constant the other and the interactions.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import zscore
from sklearn.decomposition import PCA
from statsmodels import multivariate
In [19]:
df1=pd.read_csv("C:\\Users\\Shubham\\Downloads\\Education+-+Post+12th+Standard.csv")
df1
Out[19]:
Abilene
0 Christian 1660 1232 721 23 52 2885 537
University
Adelphi
1 2186 1924 512 16 29 2683 1227
University
Adrian
2 1428 1097 336 22 50 1036 99
College
Agnes Scott
3 417 349 137 60 89 510 63
College
Alaska
4 Pacific 193 146 55 16 44 249 869
University
Worcester
772 State 2197 1515 543 4 26 3089 2029
College
Xavier
773 1959 1805 695 24 47 2849 1107
University
Xavier
774 University of 2097 1915 695 34 61 2793 166
Louisiana
Yale
775 10705 2453 1317 95 99 5217 83
University
York College
776 of 2989 1855 691 28 63 2988 1726
Pennsylvania
The dataset "Education+-+Post+12th+Standard" shows the data of different universities and each university
is judged on 17 different attributes.
In [105]:
df1.shape
Out[105]:
(777, 18)
In [21]:
df1.dtypes
Out[21]:
Names object
Apps int64
Accept int64
Enroll int64
Top10perc int64
Top25perc int64
F.Undergrad int64
P.Undergrad int64
Outstate int64
Room.Board int64
Books int64
Personal int64
PhD int64
Terminal int64
S.F.Ratio float64
perc.alumni int64
Expend int64
Grad.Rate int64
dtype: object
We can see that all the attributes are either float or int except for the Names attribute
df1.describe()
Out[22]:
In [123]:
df1["Names"].unique()
Out[123]:
df1.isnull().sum()
Out[24]:
Names 0
Apps 0
Accept 0
Enroll 0
Top10perc 0
Top25perc 0
F.Undergrad 0
P.Undergrad 0
Outstate 0
Room.Board 0
Books 0
Personal 0
PhD 0
Terminal 0
S.F.Ratio 0
perc.alumni 0
Expend 0
Grad.Rate 0
dtype: int64
dups=df1.duplicated().sum()
print("No od duplicated rows present is",dups)
2.1) Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed]. The
inferences drawn from this should be properly documented.
Uni-variate analysis
Uni-variate analysis except Names column
In [26]:
a = sns.distplot(df1['Enroll'] , ax=axes[2][0])
a.set_title("Total Enrolled Application Distribution",fontsize=15)
a = sns.boxplot(df1["Enroll"],orient = "v" , ax=axes[2][1])
a.set_title("Total Enrolled Application Distribution",fontsize=15)
a = sns.distplot(df1["Top10perc"] , ax=axes[3][0])
a.set_title("Top 10% Application Distribution",fontsize=15)
a = sns.boxplot(df1["Top10perc"],orient = "v" , ax=axes[3][1])
a.set_title("Top 10% Application Distribution",fontsize=15)
a = sns.distplot(df1["Top25perc"] , ax=axes[4][0])
a.set_title("Top 25% Application Distribution",fontsize=15)
a = sns.boxplot(df1["Top25perc"],orient = "v" , ax=axes[4][1])
a.set_title("Top 25% Application Distribution",fontsize=15)
Out[26]:
In [27]:
a = sns.distplot(df1['P.Undergrad'] , ax=axes[1][0])
a.set_title("Part time graduate Distribution",fontsize=15)
a = sns.boxplot(df1['P.Undergrad'],orient = "v" , ax=axes[1][1])
a.set_title("Part time graduate Distribution",fontsize=15)
a = sns.distplot(df1['Outstate'] , ax=axes[2][0])
a.set_title("Outstate graduate Distribution",fontsize=15)
a = sns.boxplot(df1['Outstate'],orient = "v" , ax=axes[2][1])
a.set_title("Outstate graduate Distribution",fontsize=15)
a = sns.distplot(df1['Room.Board'] , ax=axes[3][0])
a.set_title("Room Boarding Distribution",fontsize=15)
a = sns.boxplot(df1['Room.Board'],orient = "v" , ax=axes[3][1])
a.set_title("Room Boarding Distribution",fontsize=15)
a = sns.distplot(df1['Books'] , ax=axes[4][0])
a.set_title("Books Distribution",fontsize=15)
a = sns.boxplot(df1['Books'],orient = "v" , ax=axes[4][1])
a.set_title("Books Distribution",fontsize=15)
Out[27]:
In [28]:
a = sns.distplot(df1['PhD'] , ax=axes[1][0])
a.set_title("PhD Prof Distribution",fontsize=15)
a = sns.boxplot(df1['PhD'],orient = "v" , ax=axes[1][1])
a.set_title("PhD Proff Distribution",fontsize=15)
a = sns.distplot(df1['Terminal'] , ax=axes[2][0])
a.set_title("Terminal Degree Prof Distribution",fontsize=15)
a = sns.boxplot(df1['Terminal'],orient = "v" , ax=axes[2][1])
a.set_title("Terminal Degree Prof Distribution",fontsize=15)
a = sns.distplot(df1['S.F.Ratio'] , ax=axes[3][0])
a.set_title("students to Prof ratio Distribution",fontsize=15)
a = sns.boxplot(df1['S.F.Ratio'],orient = "v" , ax=axes[3][1])
a.set_title("students to Prof ratio Distribution",fontsize=15)
a = sns.distplot(df1['perc.alumni'] , ax=axes[4][0])
a.set_title("% of alumni who donated Distribution",fontsize=15)
a = sns.boxplot(df1['perc.alumni'],orient = "v" , ax=axes[4][1])
a.set_title("% of alumni who donated Distribution",fontsize=15)
Out[28]:
In [29]:
a = sns.distplot(df1['Grad.Rate'] , ax=axes[1][0])
a.set_title("Graduation rate Distribution",fontsize=15)
a = sns.boxplot(df1['Grad.Rate'],orient = "v" , ax=axes[1][1])
a.set_title("IGraduation rate Distribution",fontsize=15)
Out[29]:
From the plots above it can be noticed that only Apps:"No of application recieved does'nt have any
outlier",otherwise all the attributes have outlers present
Multi-Variate Analysis
In [30]:
Out[30]:
we can see from the above matrix that lots of column have high correlation and are highly related to each
other. The maxiumn correlation ia among "Full time graduate" & "No of students enrolled". SO this dataset is
perfect for performing PCA for data reduction.
In [31]:
plt.subplots(figsize=(15,15))
sns.heatmap(df1.corr(), annot=True) # plot the correlation coefficients as a heatmap
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fc62b9250>
In [32]:
Out[32]:
<seaborn.axisgrid.PairGrid at 0x13fc6045490>
The correlation matrix and pairwise scatterplots indicate high correlation among
F.undergrad,Enroll,apps,accept,top10perc,top25perc etc such pairs of high and moderate correlations
indicate that dimension reduction must be considered for this data.
2.2) Scale the variables and write the inference for using the type of scaling function for this case study.
In [33]:
df2=df1.drop(["Names"],axis=1)
df2
Out[33]:
In [110]:
Out[110]:
In [72]:
data_new.boxplot(figsize=(20,3))
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fd4bb8550>
2.3) Comment on the comparison between covariance and the correlation matrix after scaling.
In [114]:
cov_matrix = np.cov(data_new.T)
Covariance Matrix
%s [[ 1.00128866 0.94466636 0.84791332 0.33927032 0.35209304 0.815540
18
0.3987775 0.05022367 0.16515151 0.13272942 0.17896117 0.39120081
0.36996762 0.09575627 -0.09034216 0.2599265 0.14694372]
[ 0.94466636 1.00128866 0.91281145 0.19269493 0.24779465 0.87534985
0.44183938 -0.02578774 0.09101577 0.11367165 0.20124767 0.35621633
0.3380184 0.17645611 -0.16019604 0.12487773 0.06739929]
[ 0.84791332 0.91281145 1.00128866 0.18152715 0.2270373 0.96588274
0.51372977 -0.1556777 -0.04028353 0.11285614 0.28129148 0.33189629
0.30867133 0.23757707 -0.18102711 0.06425192 -0.02236983]
[ 0.33927032 0.19269493 0.18152715 1.00128866 0.89314445 0.1414708
-0.10549205 0.5630552 0.37195909 0.1190116 -0.09343665 0.53251337
0.49176793 -0.38537048 0.45607223 0.6617651 0.49562711]
[ 0.35209304 0.24779465 0.2270373 0.89314445 1.00128866 0.19970167
-0.05364569 0.49002449 0.33191707 0.115676 -0.08091441 0.54656564
0.52542506 -0.29500852 0.41840277 0.52812713 0.47789622]
[ 0.81554018 0.87534985 0.96588274 0.1414708 0.19970167 1.00128866
0.57124738 -0.21602002 -0.06897917 0.11569867 0.31760831 0.3187472
0.30040557 0.28006379 -0.22975792 0.01867565 -0.07887464]
[ 0.3987775 0.44183938 0.51372977 -0.10549205 -0.05364569 0.57124738
1.00128866 -0.25383901 -0.06140453 0.08130416 0.32029384 0.14930637
0.14208644 0.23283016 -0.28115421 -0.08367612 -0.25733218]
[ 0.05022367 -0.02578774 -0.1556777 0.5630552 0.49002449 -0.21602002
-0.25383901 1.00128866 0.65509951 0.03890494 -0.29947232 0.38347594
0.40850895 -0.55553625 0.56699214 0.6736456 0.57202613]
[ 0.16515151 0.09101577 -0.04028353 0.37195909 0.33191707 -0.06897917
-0.06140453 0.65509951 1.00128866 0.12812787 -0.19968518 0.32962651
0.3750222 -0.36309504 0.27271444 0.50238599 0.42548915]
[ 0.13272942 0.11367165 0.11285614 0.1190116 0.115676 0.11569867
0.08130416 0.03890494 0.12812787 1.00128866 0.17952581 0.0269404
0.10008351 -0.03197042 -0.04025955 0.11255393 0.00106226]
[ 0.17896117 0.20124767 0.28129148 -0.09343665 -0.08091441 0.31760831
0.32029384 -0.29947232 -0.19968518 0.17952581 1.00128866 -0.01094989
-0.03065256 0.13652054 -0.2863366 -0.09801804 -0.26969106]
[ 0.39120081 0.35621633 0.33189629 0.53251337 0.54656564 0.3187472
0.14930637 0.38347594 0.32962651 0.0269404 -0.01094989 1.00128866
0.85068186 -0.13069832 0.24932955 0.43331936 0.30543094]
[ 0.36996762 0.3380184 0.30867133 0.49176793 0.52542506 0.30040557
0.14208644 0.40850895 0.3750222 0.10008351 -0.03065256 0.85068186
1.00128866 -0.16031027 0.26747453 0.43936469 0.28990033]
[ 0.09575627 0.17645611 0.23757707 -0.38537048 -0.29500852 0.28006379
0.23283016 -0.55553625 -0.36309504 -0.03197042 0.13652054 -0.13069832
-0.16031027 1.00128866 -0.4034484 -0.5845844 -0.30710565]
[-0.09034216 -0.16019604 -0.18102711 0.45607223 0.41840277 -0.22975792
-0.28115421 0.56699214 0.27271444 -0.04025955 -0.2863366 0.24932955
0.26747453 -0.4034484 1.00128866 0.41825001 0.49153016]
[ 0.2599265 0.12487773 0.06425192 0.6617651 0.52812713 0.01867565
-0.08367612 0.6736456 0.50238599 0.11255393 -0.09801804 0.43331936
0.43936469 -0.5845844 0.41825001 1.00128866 0.39084571]
[ 0.14694372 0.06739929 -0.02236983 0.49562711 0.47789622 -0.07887464
-0.25733218 0.57202613 0.42548915 0.00106226 -0.26969106 0.30543094
0.28990033 -0.30710565 0.49153016 0.39084571 1.00128866]]
In [74]:
Out[74]:
In [75]:
#With standardisation (Without standardisation also, correlation matrix yields same res
ult)
data_new.corr()
Out[75]:
We can state that above three approaches yield the same eigenvectors and eigenvalue pairs:
Finally we can say that after scaling - the covariance and the correlation have the same values
2.4) Check the dataset for outliers before and after scaling. Draw your inferences from this exercise.
In [116]:
df2.boxplot(figsize=(20,10))
Out[116]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fd8ab5a00>
data_new.boxplot(figsize=(20,10))
Out[117]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fd8d3be20>
In [38]:
def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
In [55]:
lratio,uratio=remove_outlier(data_new['Apps'])
data_new['Apps']=np.where(data_new['Apps']>uratio,uratio,data_new['Apps'])
data_new['Apps']=np.where(data_new['Apps']<lratio,lratio,data_new['Apps'])
lratio,uratio=remove_outlier(data_new['Accept'])
data_new['Accept']=np.where(data_new['Accept']>uratio,uratio,data_new['Accept'])
data_new['Accept']=np.where(data_new['Accept']<lratio,lratio,data_new['Accept'])
lratio,uratio=remove_outlier(data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']>uratio,uratio,data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']<lratio,lratio,data_new['Enroll'])
lratio,uratio=remove_outlier(data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']>uratio,uratio,data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']<lratio,lratio,data_new['Enroll'])
lratio,uratio=remove_outlier(data_new['Top10perc'])
data_new['Top10perc']=np.where(data_new['Top10perc']>uratio,uratio,data_new['Top10perc'
])
data_new['Top10perc']=np.where(data_new['Top10perc']<lratio,lratio,data_new['Top10perc'
])
lratio,uratio=remove_outlier(data_new['Top25perc'])
data_new['Top25perc']=np.where(data_new['Top25perc']>uratio,uratio,data_new['Top25perc'
])
data_new['Top25perc']=np.where(data_new['Top25perc']<lratio,lratio,data_new['Top25perc'
])
lratio,uratio=remove_outlier(data_new['F.Undergrad'])
data_new['F.Undergrad']=np.where(data_new['F.Undergrad']>uratio,uratio,data_new['F.Unde
rgrad'])
data_new['F.Undergrad']=np.where(data_new['F.Undergrad']<lratio,lratio,data_new['F.Unde
rgrad'])
lratio,uratio=remove_outlier(data_new['P.Undergrad'])
data_new['P.Undergrad']=np.where(data_new['P.Undergrad']>uratio,uratio,data_new['P.Unde
rgrad'])
data_new['P.Undergrad']=np.where(data_new['P.Undergrad']<lratio,lratio,data_new['P.Unde
rgrad'])
lratio,uratio=remove_outlier(data_new['Outstate'])
data_new['Outstate']=np.where(data_new['Outstate']>uratio,uratio,data_new['Outstate'])
data_new['Outstate']=np.where(data_new['Outstate']<lratio,lratio,data_new['Outstate'])
lratio,uratio=remove_outlier(data_new['Room.Board'])
data_new['Room.Board']=np.where(data_new['Room.Board']>uratio,uratio,data_new['Room.Boa
rd'])
data_new['Room.Board']=np.where(data_new['Room.Board']<lratio,lratio,data_new['Room.Boa
rd'])
lratio,uratio=remove_outlier(data_new['Books'])
data_new['Books']=np.where(data_new['Books']>uratio,uratio,data_new['Books'])
data_new['Books']=np.where(data_new['Books']<lratio,lratio,data_new['Books'])
lratio,uratio=remove_outlier(data_new['Personal'])
data_new['Personal']=np.where(data_new['Personal']>uratio,uratio,data_new['Personal'])
data_new['Personal']=np.where(data_new['Personal']<lratio,lratio,data_new['Personal'])
lratio,uratio=remove_outlier(data_new['PhD'])
data_new['PhD']=np.where(data_new['PhD']>uratio,uratio,data_new['PhD'])
data_new['PhD']=np.where(data_new['PhD']<lratio,lratio,data_new['PhD'])
lratio,uratio=remove_outlier(data_new['Terminal'])
data_new['Terminal']=np.where(data_new['Terminal']>uratio,uratio,data_new['Terminal'])
data_new['Terminal']=np.where(data_new['Terminal']<lratio,lratio,data_new['Terminal'])
lratio,uratio=remove_outlier(data_new['S.F.Ratio'])
data_new['S.F.Ratio']=np.where(data_new['S.F.Ratio']>uratio,uratio,data_new['S.F.Ratio'
])
data_new['S.F.Ratio']=np.where(data_new['S.F.Ratio']<lratio,lratio,data_new['S.F.Ratio'
])
lratio,uratio=remove_outlier(data_new['perc.alumni'])
data_new['perc.alumni']=np.where(data_new['perc.alumni']>uratio,uratio,data_new['perc.a
lumni'])
data_new['perc.alumni']=np.where(data_new['perc.alumni']<lratio,lratio,data_new['perc.a
lumni'])
lratio,uratio=remove_outlier(data_new['Expend'])
data_new['Expend']=np.where(data_new['Expend']>uratio,uratio,data_new['Expend'])
data_new['Expend']=np.where(data_new['Expend']<lratio,lratio,data_new['Expend'])
lratio,uratio=remove_outlier(data_new['Grad.Rate'])
data_new['Grad.Rate']=np.where(data_new['Grad.Rate']>uratio,uratio,data_new['Grad.Rate'
])
data_new['Grad.Rate']=np.where(data_new['Grad.Rate']<lratio,lratio,data_new['Grad.Rate'
])
data_new.boxplot(figsize=(20,10))
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fd4d3d400>
2.5) Build the covariance matrix and calculate the eigenvalues and the eigenvector.
In [76]:
cov_matrix = np.cov(data_new.T)
print('Covariance Matrix \n%s', cov_matrix)
Covariance Matrix
%s [[ 1.00128866 0.94466636 0.84791332 0.33927032 0.35209304 0.815540
18
0.3987775 0.05022367 0.16515151 0.13272942 0.17896117 0.39120081
0.36996762 0.09575627 -0.09034216 0.2599265 0.14694372]
[ 0.94466636 1.00128866 0.91281145 0.19269493 0.24779465 0.87534985
0.44183938 -0.02578774 0.09101577 0.11367165 0.20124767 0.35621633
0.3380184 0.17645611 -0.16019604 0.12487773 0.06739929]
[ 0.84791332 0.91281145 1.00128866 0.18152715 0.2270373 0.96588274
0.51372977 -0.1556777 -0.04028353 0.11285614 0.28129148 0.33189629
0.30867133 0.23757707 -0.18102711 0.06425192 -0.02236983]
[ 0.33927032 0.19269493 0.18152715 1.00128866 0.89314445 0.1414708
-0.10549205 0.5630552 0.37195909 0.1190116 -0.09343665 0.53251337
0.49176793 -0.38537048 0.45607223 0.6617651 0.49562711]
[ 0.35209304 0.24779465 0.2270373 0.89314445 1.00128866 0.19970167
-0.05364569 0.49002449 0.33191707 0.115676 -0.08091441 0.54656564
0.52542506 -0.29500852 0.41840277 0.52812713 0.47789622]
[ 0.81554018 0.87534985 0.96588274 0.1414708 0.19970167 1.00128866
0.57124738 -0.21602002 -0.06897917 0.11569867 0.31760831 0.3187472
0.30040557 0.28006379 -0.22975792 0.01867565 -0.07887464]
[ 0.3987775 0.44183938 0.51372977 -0.10549205 -0.05364569 0.57124738
1.00128866 -0.25383901 -0.06140453 0.08130416 0.32029384 0.14930637
0.14208644 0.23283016 -0.28115421 -0.08367612 -0.25733218]
[ 0.05022367 -0.02578774 -0.1556777 0.5630552 0.49002449 -0.21602002
-0.25383901 1.00128866 0.65509951 0.03890494 -0.29947232 0.38347594
0.40850895 -0.55553625 0.56699214 0.6736456 0.57202613]
[ 0.16515151 0.09101577 -0.04028353 0.37195909 0.33191707 -0.06897917
-0.06140453 0.65509951 1.00128866 0.12812787 -0.19968518 0.32962651
0.3750222 -0.36309504 0.27271444 0.50238599 0.42548915]
[ 0.13272942 0.11367165 0.11285614 0.1190116 0.115676 0.11569867
0.08130416 0.03890494 0.12812787 1.00128866 0.17952581 0.0269404
0.10008351 -0.03197042 -0.04025955 0.11255393 0.00106226]
[ 0.17896117 0.20124767 0.28129148 -0.09343665 -0.08091441 0.31760831
0.32029384 -0.29947232 -0.19968518 0.17952581 1.00128866 -0.01094989
-0.03065256 0.13652054 -0.2863366 -0.09801804 -0.26969106]
[ 0.39120081 0.35621633 0.33189629 0.53251337 0.54656564 0.3187472
0.14930637 0.38347594 0.32962651 0.0269404 -0.01094989 1.00128866
0.85068186 -0.13069832 0.24932955 0.43331936 0.30543094]
[ 0.36996762 0.3380184 0.30867133 0.49176793 0.52542506 0.30040557
0.14208644 0.40850895 0.3750222 0.10008351 -0.03065256 0.85068186
1.00128866 -0.16031027 0.26747453 0.43936469 0.28990033]
[ 0.09575627 0.17645611 0.23757707 -0.38537048 -0.29500852 0.28006379
0.23283016 -0.55553625 -0.36309504 -0.03197042 0.13652054 -0.13069832
-0.16031027 1.00128866 -0.4034484 -0.5845844 -0.30710565]
[-0.09034216 -0.16019604 -0.18102711 0.45607223 0.41840277 -0.22975792
-0.28115421 0.56699214 0.27271444 -0.04025955 -0.2863366 0.24932955
0.26747453 -0.4034484 1.00128866 0.41825001 0.49153016]
[ 0.2599265 0.12487773 0.06425192 0.6617651 0.52812713 0.01867565
-0.08367612 0.6736456 0.50238599 0.11255393 -0.09801804 0.43331936
0.43936469 -0.5845844 0.41825001 1.00128866 0.39084571]
[ 0.14694372 0.06739929 -0.02236983 0.49562711 0.47789622 -0.07887464
-0.25733218 0.57202613 0.42548915 0.00106226 -0.26969106 0.30543094
0.28990033 -0.30710565 0.49153016 0.39084571 1.00128866]]
In [127]:
Eigen Values
%s [5.45052162 4.48360686 1.17466761 1.00820573 0.93423123 0.84849117
0.6057878 0.58787222 0.53061262 0.4043029 0.02302787 0.03672545
0.31344588 0.08802464 0.1439785 0.16779415 0.22061096]
Eigen Vectors
%s [[-2.48765602e-01 3.31598227e-01 6.30921033e-02 -2.81310530e-01
5.74140964e-03 1.62374420e-02 4.24863486e-02 1.03090398e-01
9.02270802e-02 -5.25098025e-02 3.58970400e-01 -4.59139498e-01
4.30462074e-02 -1.33405806e-01 8.06328039e-02 -5.95830975e-01
2.40709086e-02]
[-2.07601502e-01 3.72116750e-01 1.01249056e-01 -2.67817346e-01
5.57860920e-02 -7.53468452e-03 1.29497196e-02 5.62709623e-02
1.77864814e-01 -4.11400844e-02 -5.43427250e-01 5.18568789e-01
-5.84055850e-02 1.45497511e-01 3.34674281e-02 -2.92642398e-01
-1.45102446e-01]
[-1.76303592e-01 4.03724252e-01 8.29855709e-02 -1.61826771e-01
-5.56936353e-02 4.25579803e-02 2.76928937e-02 -5.86623552e-02
1.28560713e-01 -3.44879147e-02 6.09651110e-01 4.04318439e-01
-6.93988831e-02 -2.95896092e-02 -8.56967180e-02 4.44638207e-01
1.11431545e-02]
[-3.54273947e-01 -8.24118211e-02 -3.50555339e-02 5.15472524e-02
-3.95434345e-01 5.26927980e-02 1.61332069e-01 1.22678028e-01
-3.41099863e-01 -6.40257785e-02 -1.44986329e-01 1.48738723e-01
-8.10481404e-03 -6.97722522e-01 -1.07828189e-01 -1.02303616e-03
3.85543001e-02]
[-3.44001279e-01 -4.47786551e-02 2.41479376e-02 1.09766541e-01
-4.26533594e-01 -3.30915896e-02 1.18485556e-01 1.02491967e-01
-4.03711989e-01 -1.45492289e-02 8.03478445e-02 -5.18683400e-02
-2.73128469e-01 6.17274818e-01 1.51742110e-01 -2.18838802e-02
-8.93515563e-02]
[-1.54640962e-01 4.17673774e-01 6.13929764e-02 -1.00412335e-01
-4.34543659e-02 4.34542349e-02 2.50763629e-02 -7.88896442e-02
5.94419181e-02 -2.08471834e-02 -4.14705279e-01 -5.60363054e-01
-8.11578181e-02 -9.91640992e-03 -5.63728817e-02 5.23622267e-01
5.61767721e-02]
[-2.64425045e-02 3.15087830e-01 -1.39681716e-01 1.58558487e-01
3.02385408e-01 1.91198583e-01 -6.10423460e-02 -5.70783816e-01
-5.60672902e-01 2.23105808e-01 9.01788964e-03 5.27313042e-02
1.00693324e-01 -2.09515982e-02 1.92857500e-02 -1.25997650e-01
-6.35360730e-02]
[-2.94736419e-01 -2.49643522e-01 -4.65988731e-02 -1.31291364e-01
2.22532003e-01 3.00003910e-02 -1.08528966e-01 -9.84599754e-03
4.57332880e-03 -1.86675363e-01 5.08995918e-02 -1.01594830e-01
1.43220673e-01 -3.83544794e-02 -3.40115407e-02 1.41856014e-01
-8.23443779e-01]
[-2.49030449e-01 -1.37808883e-01 -1.48967389e-01 -1.84995991e-01
5.60919470e-01 -1.62755446e-01 -2.09744235e-01 2.21453442e-01
-2.75022548e-01 -2.98324237e-01 1.14639620e-03 2.59293381e-02
-3.59321731e-01 -3.40197083e-03 -5.84289756e-02 6.97485854e-02
3.54559731e-01]
[-6.47575181e-02 5.63418434e-02 -6.77411649e-01 -8.70892205e-02
-1.27288825e-01 -6.41054950e-01 1.49692034e-01 -2.13293009e-01
1.33663353e-01 8.20292186e-02 7.72631963e-04 -2.88282896e-03
3.19400370e-02 9.43887925e-03 -6.68494643e-02 -1.14379958e-02
-2.81593679e-02]
[ 4.25285386e-02 2.19929218e-01 -4.99721120e-01 2.30710568e-01
-2.22311021e-01 3.31398003e-01 -6.33790064e-01 2.32660840e-01
9.44688900e-02 -1.36027616e-01 -1.11433396e-03 1.28904022e-02
-1.85784733e-02 3.09001353e-03 2.75286207e-02 -3.94547417e-02
-3.92640266e-02]
[-3.18312875e-01 5.83113174e-02 1.27028371e-01 5.34724832e-01
localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 48/56
11/21/2020 Project 3
2.6) Write the explicit form of the first PC (in terms of Eigen Vectors).
In order to decide which eigenvector(s) can dropped without losing too much information for the construction
of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with
the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can
be dropped. In order to do so, the common approach is to rank the eigenvalues from highest to lowest in
order choose the top k eigenvectors.
In [79]:
2.7) Discuss the cumulative values of the eigenvalues. How does it help you to decide on the optimum
number of principal components? What do the eigenvectors indicate? Perform PCA and export the data of
the Principal Component scores into a data frame.
After sorting the eigenpairs, the next question is “how many principal components are we going to choose for
our new feature subspace?” A useful measure is the so-called “explained variance,” which can be calculated
from the eigenvalues. The explained variance tells us how much information (variance) can be attributed to
each of the principal components.
In [94]:
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
print(var_exp)
cum_var_exp = np.cumsum(var_exp)
print(cum_var_exp)
In [85]:
# Ploting
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'I
ndividual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative exp
lained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
The plot above clearly shows that most of the variance (32.02% of the variance to be precise) can be
explained by the first principal component. The second principal component explain (26.34%) while the third
and fourth principal components explain 6.9% & 5.9% respectively. Together, the first two principal
components contain 58.36% of the information.
In [135]:
Out[135]:
In [139]:
Out[139]:
Out[141]:
For each PC, the row of length 17 gives the weights with which the corresponding variables need to be
multiplied to get the PC. Note that the weights can be positive or negative.
In [144]:
# Find PC scores
pc = pca.fit_transform(data_new)
pca_df = pd.DataFrame(pc,columns=pc_comps)
np.round(pca_df.iloc[:6,:5],2)
Out[144]:
In [149]:
round(pca_df.corr(),5)
Out[149]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14
PC1 1.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0 0.0 -0.0 0.0 0.0 0.0 0.0
PC2 0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0
PC3 -0.0 0.0 1.0 0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 0.0
PC4 -0.0 0.0 0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0
PC5 0.0 -0.0 0.0 0.0 1.0 0.0 -0.0 0.0 -0.0 0.0 0.0 0.0 -0.0 -0.0
PC6 -0.0 -0.0 0.0 0.0 0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.0
PC7 0.0 -0.0 -0.0 -0.0 -0.0 0.0 1.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 0.0
PC8 -0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 0.0
PC9 0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 1.0 0.0 -0.0 0.0 0.0 0.0
PC10 -0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0 1.0 -0.0 0.0 -0.0 0.0
PC11 0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 -0.0 1.0 -0.0 -0.0 0.0
PC12 0.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 1.0 -0.0 0.0
PC13 0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 1.0 -0.0
PC14 0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 1.0
PC15 -0.0 0.0 0.0 -0.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0
PC16 0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0
PC17 0.0 -0.0 -0.0 0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 -0.0 -0.0
Let us now investigate the correlations among the first 5 PCs with the original 17 variables
In [155]:
Out[155]:
. Among the correlations between the PCs and the constituent variables, the following are considerably large:
• PC1 and (Top 10perc, Top 25perc) • PC2 and (Enroll,F.Graduate)
In [ ]: