Advanced Statistics Jupyter File PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

11/21/2020 Project 3

A research laboratory was developing a new compound for the relief of severe cases of hay fever. In an
experiment with 36 volunteers, the amounts of the two active ingredients (A & B) in the compound were
varied at three levels each. Randomization was used in assigning four volunteers to each of the nine
treatments. The data on hours of relief can be found in the following .csv file: Fever.csvView in a new window

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.formula.api import ols
from statsmodels.stats.anova import _get_covariance,anova_lm
import os
%matplotlib inline

In [2]:

df=pd.read_csv("C:\\Users\\Shubham\\Downloads\\Fever-1.csv")
df.head()

Out[2]:

A B Volunteer Relief

0 1 1 1 2.4

1 1 1 2 2.7

2 1 1 3 2.3

3 1 1 4 2.5

4 1 2 1 4.6

In [3]:

df.columns

Out[3]:

Index(['A', 'B', 'Volunteer', 'Relief'], dtype='object')

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 36 non-null int64
1 B 36 non-null int64
2 Volunteer 36 non-null int64
3 Relief 36 non-null float64
dtypes: float64(1), int64(3)
memory usage: 1.2 KB

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 1/56


11/21/2020 Project 3

In [5]:

df.describe()

Out[5]:

A B Volunteer Relief

count 36.000000 36.000000 36.000000 36.000000

mean 2.000000 2.000000 2.500000 7.183333

std 0.828079 0.828079 1.133893 3.272090

min 1.000000 1.000000 1.000000 2.300000

25% 1.000000 1.000000 1.750000 4.675000

50% 2.000000 2.000000 2.500000 6.000000

75% 3.000000 3.000000 3.250000 9.325000

max 3.000000 3.000000 4.000000 13.500000

In [6]:

#Checking for null values


df.isnull().sum()

Out[6]:

A 0
B 0
Volunteer 0
Relief 0
dtype: int64

In [7]:

df["A"].unique()

Out[7]:

array([1, 2, 3], dtype=int64)

In [8]:

df["B"].unique()

Out[8]:

array([1, 2, 3], dtype=int64)

In [9]:

df["Volunteer"].unique()

Out[9]:

array([1, 2, 3, 4], dtype=int64)

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 2/56


11/21/2020 Project 3

In [10]:

df["Relief"].unique()

Out[10]:

array([ 2.4, 2.7, 2.3, 2.5, 4.6, 4.2, 4.9, 4.7, 4.8, 4.5, 4.4,
5.8, 5.2, 5.5, 5.3, 8.9, 9.1, 8.7, 9. , 9.3, 9.4, 6.1,
5.7, 5.9, 6.2, 9.9, 10.5, 10.6, 10.1, 13.5, 13. , 13.3, 13.2])

In [11]:

df.shape

Out[11]:

(36, 4)

1.1) State the Null and Alternate Hypothesis for conducting one-way ANOVA for both the variables ‘A’ and ‘B’
individually. [both statement and statistical form like Ho=mu, Ha>mu]

The Null & Alternate Hypothesis for Variable ‘A’ is: Null Hypothesis HO: The mean of Active ingredient A is
equal to variable Relief Alternate Hypothesis H1: At least one of the means of active ingredient A is not equal
to variable Relief H0: µ1=µ2
H1: µ1≠µ2 Where µ1= Mean of active ingredient of A, µ2= Mean of Hours of relief

The Null & Alternate Hypothesis for Variable ‘B’ is: Null Hypothesis HO: The mean of Active ingredient B is
equal to variable Relief Alternate Hypothesis H1: At least one of the means of active ingredient B is not equal
to variable Relief H0: µ1=µ2
H1: µ1≠µ2 Where µ1= Mean of active ingredient of B, µ2= Mean of Hours of relief

1.2) Perform one-way ANOVA for variable ‘A’ with respect to the variable ‘Relief’. State whether the Null
Hypothesis is accepted or rejected based on the ANOVA results.

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 3/56


11/21/2020 Project 3

In [12]:

plt.figure(figsize=(10,5))

sns.boxplot(x="A",y="Relief",data=df)

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x13fc1b2baf0>

As seen from the boxplot there is a very large difference in mean of the Relief provided by different active
ingredient of A, let’s find out statically the variance in means using ANOVA.

Stating Null and Alternate Hypothesis


HO: The mean of Active ingredient A is equal to variable Relief

H1: Atleast one of the mean of active ingredient A is not equal to variable Relief

Define confidence level

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 4/56


11/21/2020 Project 3

alpha=0.5

Performing One way ANOVA between categorical


variable A and contineous variable Relief
In [13]:

formula='Relief~C(A)'
model=ols(formula,df).fit()
aov_table=anova_lm(model)
print(aov_table)

df sum_sq mean_sq F PR(>F)


C(A) 2.0 220.02 110.010000 23.465387 4.578242e-07
Residual 33.0 154.71 4.688182 NaN NaN

Since P_Value is less than alpha 0.5, we are failed to accept the null Hypothesis and statistically say the
mean of categorical variable A is same as mean of variable Relief. This bring us into conclusion that at 95%
confidence level we can conclude that relief provided by ingredients at different level of compound A is not
same.

1.3) Perform one-way ANOVA for variable ‘B’ with respect to the variable ‘Relief’. State whether the Null
Hypothesis is accepted or rejected based on the ANOVA results.

In [14]:

plt.figure(figsize=(10,5))

sns.boxplot(x="B",y="Relief",data=df)

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x13fc2326d60>

Stating Null and Alternate Hypothesis


localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 5/56
11/21/2020 Project 3

HO: The mean of Active ingredient B is equal to variable Relief

H1: Atleast one of the mean of active ingredient B is not equal to variable Relief

Define confidence level


alpha=0.5 or confidence level of 95%

Performing One way ANOVA between categorical


variable A and contineous variable Relief
In [15]:

formula='Relief~C(B)'
model=ols(formula,df).fit()
aov_table=anova_lm(model)
print(aov_table)

df sum_sq mean_sq F PR(>F)


C(B) 2.0 123.66 61.830000 8.126777 0.00135
Residual 33.0 251.07 7.608182 NaN NaN

since P_value is less than alpha 0.5, we failed to reject the null hypothesis. So at 95% confidence level we
can say that mean of categirical variable is equal to variable relief

1.4) Analyse the effects of one variable on another with the help of an interaction plot. What is the interaction
between the two treatments?

In [16]:

sns.pointplot(x='A',y='Relief',data=df,hue='B')

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x13fc2803dc0>

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 6/56


11/21/2020 Project 3

In [17]:

sns.pointplot(x='A',y='Relief',data=df,hue='B',ci=None)

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x13fc287eaf0>

The plot suggests that there is interaction between the levels of ingredient A and ingredient B.

1.5) Perform a two-way ANOVA based on the different ingredients (variable ‘A’ & ‘B’ along with their
interaction 'A*B') with the variable 'Relief' and state your results.

Formulate the Hypothesis of the two-way ANOVA with


both "A" &"B" variables with respect to "Relief"
variable.
HO:there is no interaction between the levels of ingredient A and ingredient B

H1:There is atleast some interaction between the levels of ingredient A and ingredient B

In [18]:

#Interaction Effect:
model=ols('Relief~C(A)+C(B)+C(A):C(B)',data=df).fit()
aov_table=anova_lm(model)
print(aov_table)

df sum_sq mean_sq F PR(>F)


C(A) 2.0 220.020 110.010000 1827.858462 1.514043e-29
C(B) 2.0 123.660 61.830000 1027.329231 3.348751e-26
C(A):C(B) 4.0 29.425 7.356250 122.226923 6.972083e-17
Residual 27.0 1.625 0.060185 NaN NaN

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 7/56


11/21/2020 Project 3

The null hypothesis for the interactin is that “there is


no interaction between the levels of ingredient A and
ingredient B”. The alternative hypothesis is that “there
is interaction”. The test statistic is F = 122.2 and the
p.value is less than 0.05. Therefore, at the alpha level
of 0.05, we reject the null hypothesis and conclude
that there is significant interaction between the levels
of ingredient A and ingredient B.
3- It is possible for the interaction to be significant when the main effects are not significant. So it makes
sense to test the significance of the main effects. The null phpothesis for the main effect for A is that “the
responses do not differ by the levels of factor A, while holding constant the levels of factor B and the
interactions”. The null phpothesis for the main effect for B is that “the responses do not differ by the levels of
factor A, while holding constant the levels of factor A and the interactions”. The test statistics for the main
effetcs A and B are F = 1827.9 and F = 1027.3, respectively. the p-values are less than 0.05 for each. We
reject the null hypothesis and conclude that the responses significantly differ across the levels of the two
ingredients, while holding constant the other and the interactions.

1.6) Mention the business implications of performing ANOVA for this particular case study.

conclude that the responses significantly differ across the levels of the two ingredients, while holding
constant the other and the interactions.

The dataset Education - Post 12th Standard.csvView


in a new window is a dataset that contains the names
of various colleges. This particular case study is
based on various parameters of various institutions.
You are expected to do Principal Component Analysis
for this case study according to the instructions given
in the following rubric. The data dictionary of the
'Education - Post 12th Standard.csv' can be found in
the following file: Data Dictionary.xlsxView in a new
window.
In [131]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import zscore
from sklearn.decomposition import PCA
from statsmodels import multivariate

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 8/56


11/21/2020 Project 3

In [19]:

df1=pd.read_csv("C:\\Users\\Shubham\\Downloads\\Education+-+Post+12th+Standard.csv")
df1

Out[19]:

Names Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad O

Abilene
0 Christian 1660 1232 721 23 52 2885 537
University

Adelphi
1 2186 1924 512 16 29 2683 1227
University

Adrian
2 1428 1097 336 22 50 1036 99
College

Agnes Scott
3 417 349 137 60 89 510 63
College

Alaska
4 Pacific 193 146 55 16 44 249 869
University

... ... ... ... ... ... ... ... ...

Worcester
772 State 2197 1515 543 4 26 3089 2029
College

Xavier
773 1959 1805 695 24 47 2849 1107
University

Xavier
774 University of 2097 1915 695 34 61 2793 166
Louisiana

Yale
775 10705 2453 1317 95 99 5217 83
University

York College
776 of 2989 1855 691 28 63 2988 1726
Pennsylvania

777 rows × 18 columns

The dataset "Education+-+Post+12th+Standard" shows the data of different universities and each university
is judged on 17 different attributes.

In [105]:

df1.shape

Out[105]:

(777, 18)

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 9/56


11/21/2020 Project 3

In [21]:

df1.dtypes

Out[21]:

Names object
Apps int64
Accept int64
Enroll int64
Top10perc int64
Top25perc int64
F.Undergrad int64
P.Undergrad int64
Outstate int64
Room.Board int64
Books int64
Personal int64
PhD int64
Terminal int64
S.F.Ratio float64
perc.alumni int64
Expend int64
Grad.Rate int64
dtype: object

We can see that all the attributes are either float or int except for the Names attribute

Summary of the dataset


In [22]:

df1.describe()

Out[22]:

Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Un

count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777

mean 3001.638353 2018.804376 779.972973 27.558559 55.796654 3699.907336 855

std 3870.201484 2451.113971 929.176190 17.640364 19.804778 4850.420531 1522

min 81.000000 72.000000 35.000000 1.000000 9.000000 139.000000 1

25% 776.000000 604.000000 242.000000 15.000000 41.000000 992.000000 95

50% 1558.000000 1110.000000 434.000000 23.000000 54.000000 1707.000000 353

75% 3624.000000 2424.000000 902.000000 35.000000 69.000000 4005.000000 967

max 48094.000000 26330.000000 6392.000000 96.000000 100.000000 31643.000000 21836

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 10/56


11/21/2020 Project 3

In [123]:

df1["Names"].unique()

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 11/56


11/21/2020 Project 3

Out[123]:

array(['Abilene Christian University', 'Adelphi University',


'Adrian College', 'Agnes Scott College',
'Alaska Pacific University', 'Albertson College',
'Albertus Magnus College', 'Albion College', 'Albright College',
'Alderson-Broaddus College', 'Alfred University',
'Allegheny College', 'Allentown Coll. of St. Francis de Sales',
'Alma College', 'Alverno College',
'American International College', 'Amherst College',
'Anderson University', 'Andrews University',
'Angelo State University', 'Antioch University',
'Appalachian State University', 'Aquinas College',
'Arizona State University Main campus',
'Arkansas College (Lyon College)', 'Arkansas Tech University',
'Assumption College', 'Auburn University-Main Campus',
'Augsburg College', 'Augustana College IL', 'Augustana College',
'Austin College', 'Averett College', 'Baker University',
'Baldwin-Wallace College', 'Barat College', 'Bard College',
'Barnard College', 'Barry University', 'Baylor University',
'Beaver College', 'Bellarmine College', 'Belmont Abbey College',
'Belmont University', 'Beloit College', 'Bemidji State University',
'Benedictine College', 'Bennington College', 'Bentley College',
'Berry College', 'Bethany College', 'Bethel College KS',
'Bethel College', 'Bethune Cookman College',
'Birmingham-Southern College', 'Blackburn College',
'Bloomsburg Univ. of Pennsylvania', 'Bluefield College',
'Bluffton College', 'Boston University', 'Bowdoin College',
'Bowling Green State University', 'Bradford College',
'Bradley University', 'Brandeis University', 'Brenau University',
'Brewton-Parker College', 'Briar Cliff College',
'Bridgewater College', 'Brigham Young University at Provo',
'Brown University', 'Bryn Mawr College', 'Bucknell University',
'Buena Vista College', 'Butler University', 'Cabrini College',
'Caldwell College', 'California Lutheran University',
'California Polytechnic-San Luis',
'California State University at Fresno', 'Calvin College',
'Campbell University', 'Campbellsville College',
'Canisius College', 'Capital University', 'Capitol College',
'Carleton College', 'Carnegie Mellon University',
'Carroll College', 'Carson-Newman College', 'Carthage College',
'Case Western Reserve University', 'Castleton State College',
'Catawba College', 'Catholic University of America',
'Cazenovia College', 'Cedar Crest College', 'Cedarville College',
'Centenary College', 'Centenary College of Louisiana',
'Center for Creative Studies', 'Central College',
'Central Connecticut State University',
'Central Missouri State University',
'Central Washington University', 'Central Wesleyan College',
'Centre College', 'Chapman University', 'Chatham College',
'Chestnut Hill College', 'Christendom College',
'Christian Brothers University', 'Christopher Newport University',
'Claflin College', 'Claremont McKenna College', 'Clark University',
'Clarke College', 'Clarkson University', 'Clemson University',
'Clinch Valley Coll. of the Univ. of Virginia', 'Coe College',
'Coker College', 'Colby College', 'Colgate University',
'College Misericordia', 'College of Charleston',
'College of Mount St. Joseph', 'College of Mount St. Vincent',
'College of Notre Dame', 'College of Notre Dame of Maryland',
'College of Saint Benedict', 'College of Saint Catherine',
'College of Saint Elizabeth', 'College of Saint Rose',
localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 12/56
11/21/2020 Project 3

'College of Santa Fe', 'College of St. Joseph',


'College of St. Scholastica', 'College of the Holy Cross',
'College of William and Mary', 'College of Wooster',
'Colorado College', 'Colorado State University',
'Columbia College MO', 'Columbia College', 'Columbia University',
'Concordia College at St. Paul', 'Concordia Lutheran College',
'Concordia University CA', 'Concordia University',
'Connecticut College', 'Converse College', 'Cornell College',
'Creighton University', 'Culver-Stockton College',
'Cumberland College', "D'Youville College", 'Dana College',
'Daniel Webster College', 'Dartmouth College', 'Davidson College',
'Defiance College', 'Delta State University', 'Denison University',
'DePauw University', 'Dickinson College',
'Dickinson State University', 'Dillard University',
'Doane College', 'Dominican College of Blauvelt', 'Dordt College',
'Dowling College', 'Drake University', 'Drew University',
'Drury College', 'Duke University', 'Earlham College',
'East Carolina University', 'East Tennessee State University',
'East Texas Baptist University', 'Eastern College',
'Eastern Connecticut State University',
'Eastern Illinois University', 'Eastern Mennonite College',
'Eastern Nazarene College', 'Eckerd College',
'Elizabethtown College', 'Elmira College', 'Elms College',
'Elon College', 'Embry Riddle Aeronautical University',
'Emory & Henry College', 'Emory University',
'Emporia State University', 'Erskine College', 'Eureka College',
'Evergreen State College', 'Fairfield University',
'Fayetteville State University', 'Ferrum College',
'Flagler College', 'Florida Institute of Technology',
'Florida International University', 'Florida Southern College',
'Florida State University', 'Fontbonne College',
'Fordham University', 'Fort Lewis College',
'Francis Marion University',
'Franciscan University of Steubenville', 'Franklin College',
'Franklin Pierce College', 'Freed-Hardeman University',
'Fresno Pacific College', 'Furman University', 'Gannon University',
'Gardner Webb University', 'Geneva College', 'George Fox College',
'George Mason University', 'George Washington University',
'Georgetown College', 'Georgetown University',
'Georgia Institute of Technology', 'Georgia State University',
'Georgian Court College', 'Gettysburg College',
'Goldey Beacom College', 'Gonzaga University', 'Gordon College',
'Goshen College', 'Goucher College', 'Grace College and Seminary',
'Graceland College', 'Grand Valley State University',
'Green Mountain College', 'Greensboro College',
'Greenville College', 'Grinnell College', 'Grove City College',
'Guilford College', 'Gustavus Adolphus College',
'Gwynedd Mercy College', 'Hamilton College', 'Hamline University',
'Hampden - Sydney College', 'Hampton University',
'Hanover College', 'Hardin-Simmons University',
'Harding University', 'Hartwick College', 'Harvard University',
'Harvey Mudd College', 'Hastings College', 'Hendrix College',
'Hillsdale College', 'Hiram College',
'Hobart and William Smith Colleges', 'Hofstra University',
'Hollins College', 'Hood College', 'Hope College',
'Houghton College', 'Huntingdon College', 'Huntington College',
'Huron University', 'Husson College',
'Illinois Benedictine College', 'Illinois College',
'Illinois Institute of Technology', 'Illinois State University',
'Illinois Wesleyan University', 'Immaculata College',
'Incarnate Word College', 'Indiana State University',
localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 13/56
11/21/2020 Project 3

'Indiana University at Bloomington', 'Indiana Wesleyan University',


'Iona College', 'Iowa State University', 'Ithaca College',
'James Madison University', 'Jamestown College',
'Jersey City State College', 'John Brown University',
'John Carroll University', 'Johns Hopkins University',
'Johnson State College', 'Judson College', 'Juniata College',
'Kansas State University', 'Kansas Wesleyan University',
'Keene State College', 'Kentucky Wesleyan College',
'Kenyon College', 'Keuka College', "King's College",
'King College', 'Knox College', 'La Roche College',
'La Salle University', 'Lafayette College', 'LaGrange College',
'Lake Forest College', 'Lakeland College', 'Lamar University',
'Lambuth University', 'Lander University', 'Lawrence University',
'Le Moyne College', 'Lebanon Valley College', 'Lehigh University',
'Lenoir-Rhyne College', 'Lesley College', 'LeTourneau University',
'Lewis and Clark College', 'Lewis University',
'Lincoln Memorial University', 'Lincoln University',
'Lindenwood College', 'Linfield College', 'Livingstone College',
'Lock Haven University of Pennsylvania', 'Longwood College',
'Loras College', 'Louisiana College',
'Louisiana State University at Baton Rouge',
'Louisiana Tech University', 'Loyola College',
'Loyola Marymount University', 'Loyola University',
'Loyola University Chicago', 'Luther College', 'Lycoming College',
'Lynchburg College', 'Lyndon State College', 'Macalester College',
'MacMurray College', 'Malone College', 'Manchester College',
'Manhattan College', 'Manhattanville College',
'Mankato State University', 'Marian College of Fond du Lac',
'Marietta College', 'Marist College', 'Marquette University',
'Marshall University', 'Mary Baldwin College',
'Mary Washington College', 'Marymount College Tarrytown',
'Marymount Manhattan College', 'Marymount University',
'Maryville College', 'Maryville University', 'Marywood College',
'Massachusetts Institute of Technology',
'Mayville State University', 'McKendree College',
'McMurry University', 'McPherson College', 'Mercer University',
'Mercyhurst College', 'Meredith College', 'Merrimack College',
'Mesa State College', 'Messiah College',
'Miami University at Oxford', 'Michigan State University',
'Michigan Technological University', 'MidAmerica Nazarene College',
'Millersville University of Penn.', 'Milligan College',
'Millikin University', 'Millsaps College',
'Milwaukee School of Engineering', 'Mississippi College',
'Mississippi State University', 'Mississippi University for Women',
'Missouri Southern State College', 'Missouri Valley College',
'Monmouth College IL', 'Monmouth College',
'Montana College of Mineral Sci. & Tech.',
'Montana State University', 'Montclair State University',
'Montreat-Anderson College', 'Moorhead State University',
'Moravian College', 'Morehouse College', 'Morningside College',
'Morris College', 'Mount Holyoke College', 'Mount Marty College',
'Mount Mary College', 'Mount Mercy College',
'Mount Saint Clare College', "Mount Saint Mary's College",
'Mount Saint Mary College', "Mount St. Mary's College",
'Mount Union College', 'Mount Vernon Nazarene College',
'Muhlenberg College', 'Murray State University',
'Muskingum College', 'National-Louis University',
'Nazareth College of Rochester',
'New Jersey Institute of Technology',
'New Mexico Institute of Mining and Tech.', 'New York University',
'Newberry College', 'Niagara University',
localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 14/56
11/21/2020 Project 3

'North Adams State College',


'North Carolina A. & T. State University',
'North Carolina State University at Raleigh',
'North Carolina Wesleyan College', 'North Central College',
'North Dakota State University', 'North Park College',
'Northeast Missouri State University', 'Northeastern University',
'Northern Arizona University', 'Northern Illinois University',
'Northwest Missouri State University',
'Northwest Nazarene College', 'Northwestern College',
'Northwestern University', 'Norwich University',
'Notre Dame College', 'Oakland University', 'Oberlin College',
'Occidental College', 'Oglethorpe University',
'Ohio Northern University', 'Ohio University',
'Ohio Wesleyan University', 'Oklahoma Baptist University',
'Oklahoma Christian University', 'Oklahoma State University',
'Otterbein College', 'Ouachita Baptist University',
'Our Lady of the Lake University', 'Pace University',
'Pacific Lutheran University', 'Pacific Union College',
'Pacific University', 'Pembroke State University',
'Pennsylvania State Univ. Main Campus', 'Pepperdine University',
'Peru State College', 'Pfeiffer College',
'Philadelphia Coll. of Textiles and Sci.', 'Phillips University',
'Piedmont College', 'Pikeville College', 'Pitzer College',
'Point Loma Nazarene College', 'Point Park College',
'Polytechnic University', 'Prairie View A. and M. University',
'Presbyterian College', 'Princeton University',
'Providence College', 'Purdue University at West Lafayette',
'Queens College', 'Quincy University', 'Quinnipiac College',
'Radford University', 'Ramapo College of New Jersey',
'Randolph-Macon College', "Randolph-Macon Woman's College",
'Reed College', 'Regis College',
'Rensselaer Polytechnic Institute', 'Rhodes College',
'Rider University', 'Ripon College', 'Rivier College',
'Roanoke College', 'Rockhurst College', 'Rocky Mountain College',
'Roger Williams University', 'Rollins College', 'Rosary College',
'Rowan College of New Jersey', 'Rutgers at New Brunswick',
'Rutgers State University at Camden',
'Rutgers State University at Newark', 'Sacred Heart University',
'Saint Ambrose University', 'Saint Anselm College',
'Saint Cloud State University', 'Saint Francis College IN',
'Saint Francis College', "Saint John's University",
"Saint Joseph's College IN", "Saint Joseph's College",
"Saint Joseph's University", 'Saint Joseph College',
'Saint Louis University', "Saint Mary's College",
"Saint Mary's College of Minnesota",
'Saint Mary-of-the-Woods College', "Saint Michael's College",
'Saint Olaf College', "Saint Peter's College",
'Saint Vincent College', 'Saint Xavier University',
'Salem-Teikyo University', 'Salem College',
'Salisbury State University', 'Samford University',
'San Diego State University', 'Santa Clara University',
'Sarah Lawrence College', 'Savannah Coll. of Art and Design',
'Schreiner College', 'Scripps College',
'Seattle Pacific University', 'Seattle University',
'Seton Hall University', 'Seton Hill College',
'Shippensburg University of Penn.', 'Shorter College',
'Siena College', 'Siena Heights College', 'Simmons College',
'Simpson College', 'Sioux Falls College', 'Skidmore College',
'Smith College', 'South Dakota State University',
'Southeast Missouri State University',
'Southeastern Oklahoma State Univ.', 'Southern California College',
localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 15/56
11/21/2020 Project 3

'Southern Illinois University at Edwardsville',


'Southern Methodist University', 'Southwest Baptist University',
'Southwest Missouri State University',
'Southwest State University', 'Southwestern Adventist College',
'Southwestern College', 'Southwestern University',
'Spalding University', 'Spelman College', 'Spring Arbor College',
'St. Bonaventure University', "St. John's College",
'St. John Fisher College', 'St. Lawrence University',
"St. Martin's College", "St. Mary's College of California",
"St. Mary's College of Maryland",
"St. Mary's University of San Antonio", 'St. Norbert College',
"St. Paul's College", 'St. Thomas Aquinas College',
'Stephens College', 'Stetson University',
'Stevens Institute of Technology',
'Stockton College of New Jersey', 'Stonehill College',
'SUNY at Albany', 'SUNY at Binghamton', 'SUNY at Buffalo',
'SUNY at Stony Brook', 'SUNY College at Brockport',
'SUNY College at Oswego', 'SUNY College at Buffalo',
'SUNY College at Cortland', 'SUNY College at Fredonia',
'SUNY College at Geneseo', 'SUNY College at New Paltz',
'SUNY College at Plattsburgh', 'SUNY College at Potsdam',
'SUNY College at Purchase', 'Susquehanna University',
'Sweet Briar College', 'Syracuse University', 'Tabor College',
'Talladega College', 'Taylor University',
'Tennessee Wesleyan College', 'Texas A&M Univ. at College Station',
'Texas A&M University at Galveston', 'Texas Christian University',
'Texas Lutheran College', 'Texas Southern University',
'Texas Wesleyan University', 'The Citadel', 'Thiel College',
'Tiffin University', 'Transylvania University',
'Trenton State College', 'Tri-State University',
'Trinity College CT', 'Trinity College DC', 'Trinity College VT',
'Trinity University', 'Tulane University', 'Tusculum College',
'Tuskegee University', 'Union College KY', 'Union College NY',
'Univ. of Wisconsin at OshKosh',
'University of Alabama at Birmingham',
'University of Arkansas at Fayetteville',
'University of California at Berkeley',
'University of California at Irvine',
'University of Central Florida', 'University of Charleston',
'University of Chicago', 'University of Cincinnati',
'University of Connecticut at Storrs', 'University of Dallas',
'University of Dayton', 'University of Delaware',
'University of Denver', 'University of Detroit Mercy',
'University of Dubuque', 'University of Evansville',
'University of Florida', 'University of Georgia',
'University of Hartford', 'University of Hawaii at Manoa',
'University of Illinois - Urbana',
'University of Illinois at Chicago', 'University of Indianapolis',
'University of Kansas', 'University of La Verne',
'University of Louisville', 'University of Maine at Farmington',
'University of Maine at Machias',
'University of Maine at Presque Isle',
'University of Maryland at Baltimore County',
'University of Maryland at College Park',
'University of Massachusetts at Amherst',
'University of Massachusetts at Dartmouth', 'University of Miami',
'University of Michigan at Ann Arbor',
'University of Minnesota at Duluth',
'University of Minnesota at Morris',
'University of Minnesota Twin Cities', 'University of Mississippi',
'University of Missouri at Columbia',
localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 16/56
11/21/2020 Project 3

'University of Missouri at Rolla',


'University of Missouri at Saint Louis', 'University of Mobile',
'University of Montevallo', 'University of Nebraska at Lincoln',
'University of New England', 'University of New Hampshire',
'University of North Carolina at Asheville',
'University of North Carolina at Chapel Hill',
'University of North Carolina at Charlotte',
'University of North Carolina at Greensboro',
'University of North Carolina at Wilmington',
'University of North Dakota', 'University of North Florida',
'University of North Texas', 'University of Northern Colorado',
'University of Northern Iowa', 'University of Notre Dame',
'University of Oklahoma', 'University of Oregon',
'University of Pennsylvania',
'University of Pittsburgh-Main Campus', 'University of Portland',
'University of Puget Sound', 'University of Rhode Island',
'University of Richmond', 'University of Rochester',
'University of San Diego', 'University of San Francisco',
'University of Sci. and Arts of Oklahoma',
'University of Scranton', 'University of South Carolina at Aiken',
'University of South Carolina at Columbia',
'University of South Florida', 'University of Southern California',
'University of Southern Colorado',
'University of Southern Indiana',
'University of Southern Mississippi',
'University of St. Thomas MN', 'University of St. Thomas TX',
'University of Tennessee at Knoxville',
'University of Texas at Arlington',
'University of Texas at Austin',
'University of Texas at San Antonio', 'University of the Arts',
'University of the Pacific', 'University of the South',
'University of Tulsa', 'University of Utah',
'University of Vermont', 'University of Virginia',
'University of Washington', 'University of West Florida',
'University of Wisconsin-Stout',
'University of Wisconsin-Superior',
'University of Wisconsin-Whitewater',
'University of Wisconsin at Green Bay',
'University of Wisconsin at Madison',
'University of Wisconsin at Milwaukee', 'University of Wyoming',
'Upper Iowa University', 'Ursinus College', 'Ursuline College',
'Valley City State University', 'Valparaiso University',
'Vanderbilt University', 'Vassar College', 'Villanova University',
'Virginia Commonwealth University', 'Virginia State University',
'Virginia Tech', 'Virginia Union University',
'Virginia Wesleyan College', 'Viterbo College', 'Voorhees College',
'Wabash College', 'Wagner College', 'Wake Forest University',
'Walsh University', 'Warren Wilson College', 'Wartburg College',
'Washington and Jefferson College',
'Washington and Lee University', 'Washington College',
'Washington State University', 'Washington University',
'Wayne State College', 'Waynesburg College', 'Webber College',
'Webster University', 'Wellesley College', 'Wells College',
'Wentworth Institute of Technology', 'Wesley College',
'Wesleyan University', 'West Chester University of Penn.',
'West Liberty State College', 'West Virginia Wesleyan College',
'Western Carolina University', 'Western Maryland College',
'Western Michigan University', 'Western New England College',
'Western State College of Colorado',
'Western Washington University', 'Westfield State College',
'Westminster College MO', 'Westminster College',
localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 17/56
11/21/2020 Project 3

'Westminster College of Salt Lake City', 'Westmont College',


'Wheaton College IL', 'Westminster College PA',
'Wheeling Jesuit College', 'Whitman College', 'Whittier College',
'Whitworth College', 'Widener University', 'Wilkes University',
'Willamette University', 'William Jewell College',
'William Woods University', 'Williams College', 'Wilson College',
'Wingate College', 'Winona State University',
'Winthrop University', 'Wisconsin Lutheran College',
'Wittenberg University', 'Wofford College',
'Worcester Polytechnic Institute', 'Worcester State College',
'Xavier University', 'Xavier University of Louisiana',
'Yale University', 'York College of Pennsylvania'], dtype=object)

Check for null values


In [24]:

df1.isnull().sum()

Out[24]:

Names 0
Apps 0
Accept 0
Enroll 0
Top10perc 0
Top25perc 0
F.Undergrad 0
P.Undergrad 0
Outstate 0
Room.Board 0
Books 0
Personal 0
PhD 0
Terminal 0
S.F.Ratio 0
perc.alumni 0
Expend 0
Grad.Rate 0
dtype: int64

There is no null value present in dataset

Checks for Duplicates


In [25]:

dups=df1.duplicated().sum()
print("No od duplicated rows present is",dups)

No od duplicated rows present is 0

2.1) Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed]. The
inferences drawn from this should be properly documented.

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 18/56


11/21/2020 Project 3

Uni-variate analysis
Uni-variate analysis except Names column

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 19/56


11/21/2020 Project 3

In [26]:

fig, axes = plt.subplots(nrows=5,ncols=2)


fig.set_size_inches(12, 14)
a = sns.distplot(df1['Apps'] , ax=axes[0][0])
a.set_title("Application Distribution",fontsize=15)
a = sns.boxplot(df1['Apps'],orient = "v" , ax=axes[0][1])
a.set_title("Application Distribution",fontsize=15)
fig. tight_layout(pad=3.0)
a = sns.distplot(df1['Accept'] , ax=axes[1][0])
a.set_title("AcceptedApplication Distribution",fontsize=15)
a = sns.boxplot(df1['Accept'],orient = "v" , ax=axes[1][1])
a.set_title("Accepted Application Distribution",fontsize=15)

a = sns.distplot(df1['Enroll'] , ax=axes[2][0])
a.set_title("Total Enrolled Application Distribution",fontsize=15)
a = sns.boxplot(df1["Enroll"],orient = "v" , ax=axes[2][1])
a.set_title("Total Enrolled Application Distribution",fontsize=15)

a = sns.distplot(df1["Top10perc"] , ax=axes[3][0])
a.set_title("Top 10% Application Distribution",fontsize=15)
a = sns.boxplot(df1["Top10perc"],orient = "v" , ax=axes[3][1])
a.set_title("Top 10% Application Distribution",fontsize=15)

a = sns.distplot(df1["Top25perc"] , ax=axes[4][0])
a.set_title("Top 25% Application Distribution",fontsize=15)
a = sns.boxplot(df1["Top25perc"],orient = "v" , ax=axes[4][1])
a.set_title("Top 25% Application Distribution",fontsize=15)

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 20/56


11/21/2020 Project 3

Out[26]:

Text(0.5, 1.0, 'Top 25% Application Distribution')

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 21/56


11/21/2020 Project 3

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 22/56


11/21/2020 Project 3

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 23/56


11/21/2020 Project 3

In [27]:

fig, axes = plt.subplots(nrows=5,ncols=2)


fig.set_size_inches(12, 14)
a = sns.distplot(df1['F.Undergrad'] , ax=axes[0][0])
a.set_title("Full time graduate Distribution",fontsize=15)
a = sns.boxplot(df1['F.Undergrad'],orient = "v" , ax=axes[0][1])
a.set_title("Full time graduate Distribution",fontsize=15)
fig. tight_layout(pad=3.0)

a = sns.distplot(df1['P.Undergrad'] , ax=axes[1][0])
a.set_title("Part time graduate Distribution",fontsize=15)
a = sns.boxplot(df1['P.Undergrad'],orient = "v" , ax=axes[1][1])
a.set_title("Part time graduate Distribution",fontsize=15)

a = sns.distplot(df1['Outstate'] , ax=axes[2][0])
a.set_title("Outstate graduate Distribution",fontsize=15)
a = sns.boxplot(df1['Outstate'],orient = "v" , ax=axes[2][1])
a.set_title("Outstate graduate Distribution",fontsize=15)

a = sns.distplot(df1['Room.Board'] , ax=axes[3][0])
a.set_title("Room Boarding Distribution",fontsize=15)
a = sns.boxplot(df1['Room.Board'],orient = "v" , ax=axes[3][1])
a.set_title("Room Boarding Distribution",fontsize=15)

a = sns.distplot(df1['Books'] , ax=axes[4][0])
a.set_title("Books Distribution",fontsize=15)
a = sns.boxplot(df1['Books'],orient = "v" , ax=axes[4][1])
a.set_title("Books Distribution",fontsize=15)

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 24/56


11/21/2020 Project 3

Out[27]:

Text(0.5, 1.0, 'Books Distribution')

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 25/56


11/21/2020 Project 3

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 26/56


11/21/2020 Project 3

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 27/56


11/21/2020 Project 3

In [28]:

fig, axes = plt.subplots(nrows=5,ncols=2)


fig.set_size_inches(12, 14)
a = sns.distplot(df1['Personal'] , ax=axes[0][0])
a.set_title("Personal Expenditure for students Distribution",fontsize=15)
a = sns.boxplot(df1['Personal'],orient = "v" , ax=axes[0][1])
a.set_title("Personal Expenditure for students Distribution",fontsize=15)
fig. tight_layout(pad=3.0)

a = sns.distplot(df1['PhD'] , ax=axes[1][0])
a.set_title("PhD Prof Distribution",fontsize=15)
a = sns.boxplot(df1['PhD'],orient = "v" , ax=axes[1][1])
a.set_title("PhD Proff Distribution",fontsize=15)

a = sns.distplot(df1['Terminal'] , ax=axes[2][0])
a.set_title("Terminal Degree Prof Distribution",fontsize=15)
a = sns.boxplot(df1['Terminal'],orient = "v" , ax=axes[2][1])
a.set_title("Terminal Degree Prof Distribution",fontsize=15)

a = sns.distplot(df1['S.F.Ratio'] , ax=axes[3][0])
a.set_title("students to Prof ratio Distribution",fontsize=15)
a = sns.boxplot(df1['S.F.Ratio'],orient = "v" , ax=axes[3][1])
a.set_title("students to Prof ratio Distribution",fontsize=15)

a = sns.distplot(df1['perc.alumni'] , ax=axes[4][0])
a.set_title("% of alumni who donated Distribution",fontsize=15)
a = sns.boxplot(df1['perc.alumni'],orient = "v" , ax=axes[4][1])
a.set_title("% of alumni who donated Distribution",fontsize=15)

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 28/56


11/21/2020 Project 3

Out[28]:

Text(0.5, 1.0, '% of alumni who donated Distribution')

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 29/56


11/21/2020 Project 3

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 30/56


11/21/2020 Project 3

In [29]:

fig, axes = plt.subplots(nrows=2,ncols=2)


fig.set_size_inches(12,7)
a = sns.distplot(df1['Expend'] , ax=axes[0][0])
a.set_title("Instructional expend per student Distribution",fontsize=15)
a = sns.boxplot(df1['Expend'],orient = "v" , ax=axes[0][1])
a.set_title("Instructional expend per student Distribution",fontsize=15)
fig. tight_layout(pad=3.0)

a = sns.distplot(df1['Grad.Rate'] , ax=axes[1][0])
a.set_title("Graduation rate Distribution",fontsize=15)
a = sns.boxplot(df1['Grad.Rate'],orient = "v" , ax=axes[1][1])
a.set_title("IGraduation rate Distribution",fontsize=15)

Out[29]:

Text(0.5, 1.0, 'IGraduation rate Distribution')

From the plots above it can be noticed that only Apps:"No of application recieved does'nt have any
outlier",otherwise all the attributes have outlers present

Multi-Variate Analysis

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 31/56


11/21/2020 Project 3

In [30]:

# Check for correlation of variable


df1.corr(method='pearson')

Out[30]:

Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergra

Apps 1.000000 0.943451 0.846822 0.338834 0.351640 0.814491 0.39826

Accept 0.943451 1.000000 0.911637 0.192447 0.247476 0.874223 0.44127

Enroll 0.846822 0.911637 1.000000 0.181294 0.226745 0.964640 0.51306

Top10perc 0.338834 0.192447 0.181294 1.000000 0.891995 0.141289 -0.10535

Top25perc 0.351640 0.247476 0.226745 0.891995 1.000000 0.199445 -0.05357

F.Undergrad 0.814491 0.874223 0.964640 0.141289 0.199445 1.000000 0.57051

P.Undergrad 0.398264 0.441271 0.513069 -0.105356 -0.053577 0.570512 1.00000

Outstate 0.050159 -0.025755 -0.155477 0.562331 0.489394 -0.215742 -0.25351

Room.Board 0.164939 0.090899 -0.040232 0.371480 0.331490 -0.068890 -0.06132

Books 0.132559 0.113525 0.112711 0.118858 0.115527 0.115550 0.08120

Personal 0.178731 0.200989 0.280929 -0.093316 -0.080810 0.317200 0.31988

PhD 0.390697 0.355758 0.331469 0.531828 0.545862 0.318337 0.14911

Terminal 0.369491 0.337583 0.308274 0.491135 0.524749 0.300019 0.14190

S.F.Ratio 0.095633 0.176229 0.237271 -0.384875 -0.294629 0.279703 0.23253

perc.alumni -0.090226 -0.159990 -0.180794 0.455485 0.417864 -0.229462 -0.28079

Expend 0.259592 0.124717 0.064169 0.660913 0.527447 0.018652 -0.08356

Grad.Rate 0.146755 0.067313 -0.022341 0.494989 0.477281 -0.078773 -0.25700

we can see from the above matrix that lots of column have high correlation and are highly related to each
other. The maxiumn correlation ia among "Full time graduate" & "No of students enrolled". SO this dataset is
perfect for performing PCA for data reduction.

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 32/56


11/21/2020 Project 3

In [31]:

plt.subplots(figsize=(15,15))
sns.heatmap(df1.corr(), annot=True) # plot the correlation coefficients as a heatmap

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x13fc62b9250>

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 33/56


11/21/2020 Project 3

In [32]:

#Let us check for pair plots


sns.pairplot(df1,diag_kind='kde')

Out[32]:

<seaborn.axisgrid.PairGrid at 0x13fc6045490>

The correlation matrix and pairwise scatterplots indicate high correlation among
F.undergrad,Enroll,apps,accept,top10perc,top25perc etc such pairs of high and moderate correlations
indicate that dimension reduction must be considered for this data.

2.2) Scale the variables and write the inference for using the type of scaling function for this case study.

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 34/56


11/21/2020 Project 3

In [33]:

df2=df1.drop(["Names"],axis=1)
df2

Out[33]:

Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room

0 1660 1232 721 23 52 2885 537 7440

1 2186 1924 512 16 29 2683 1227 12280

2 1428 1097 336 22 50 1036 99 11250

3 417 349 137 60 89 510 63 12960

4 193 146 55 16 44 249 869 7560

... ... ... ... ... ... ... ... ...

772 2197 1515 543 4 26 3089 2029 6797

773 1959 1805 695 24 47 2849 1107 11520

774 2097 1915 695 34 61 2793 166 6900

775 10705 2453 1317 95 99 5217 83 19840

776 2989 1855 691 28 63 2988 1726 4990

777 rows × 17 columns

In [110]:

# scaling the data using Z-score


# All variables must be on same scale, hence we can omit scaling.
# Standardization
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(data_new)
from scipy.stats import zscore
data_new=df2.apply(zscore)

data_new.to_excel("C:\\Users\\Shubham\\Downloads\\file.xlsx", index = False)


data_new.head()

Out[110]:

Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstat

0 -0.346882 -0.321205 -0.063509 -0.258583 -0.191827 -0.168116 -0.209207 -0.74635

1 -0.210884 -0.038703 -0.288584 -0.655656 -1.353911 -0.209788 0.244307 0.45749

2 -0.406866 -0.376318 -0.478121 -0.315307 -0.292878 -0.549565 -0.497090 0.20130

3 -0.668261 -0.681682 -0.692427 1.840231 1.677612 -0.658079 -0.520752 0.62663

4 -0.726176 -0.764555 -0.780735 -0.655656 -0.596031 -0.711924 0.009005 -0.71650

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 35/56


11/21/2020 Project 3

In [72]:

data_new.boxplot(figsize=(20,3))

Out[72]:

<matplotlib.axes._subplots.AxesSubplot at 0x13fd4bb8550>

2.3) Comment on the comparison between covariance and the correlation matrix after scaling.

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 36/56


11/21/2020 Project 3

In [114]:

# Step 1 - Create covariance matrix

cov_matrix = np.cov(data_new.T)

print('Covariance Matrix \n%s', cov_matrix)

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 37/56


11/21/2020 Project 3

Covariance Matrix
%s [[ 1.00128866 0.94466636 0.84791332 0.33927032 0.35209304 0.815540
18
0.3987775 0.05022367 0.16515151 0.13272942 0.17896117 0.39120081
0.36996762 0.09575627 -0.09034216 0.2599265 0.14694372]
[ 0.94466636 1.00128866 0.91281145 0.19269493 0.24779465 0.87534985
0.44183938 -0.02578774 0.09101577 0.11367165 0.20124767 0.35621633
0.3380184 0.17645611 -0.16019604 0.12487773 0.06739929]
[ 0.84791332 0.91281145 1.00128866 0.18152715 0.2270373 0.96588274
0.51372977 -0.1556777 -0.04028353 0.11285614 0.28129148 0.33189629
0.30867133 0.23757707 -0.18102711 0.06425192 -0.02236983]
[ 0.33927032 0.19269493 0.18152715 1.00128866 0.89314445 0.1414708
-0.10549205 0.5630552 0.37195909 0.1190116 -0.09343665 0.53251337
0.49176793 -0.38537048 0.45607223 0.6617651 0.49562711]
[ 0.35209304 0.24779465 0.2270373 0.89314445 1.00128866 0.19970167
-0.05364569 0.49002449 0.33191707 0.115676 -0.08091441 0.54656564
0.52542506 -0.29500852 0.41840277 0.52812713 0.47789622]
[ 0.81554018 0.87534985 0.96588274 0.1414708 0.19970167 1.00128866
0.57124738 -0.21602002 -0.06897917 0.11569867 0.31760831 0.3187472
0.30040557 0.28006379 -0.22975792 0.01867565 -0.07887464]
[ 0.3987775 0.44183938 0.51372977 -0.10549205 -0.05364569 0.57124738
1.00128866 -0.25383901 -0.06140453 0.08130416 0.32029384 0.14930637
0.14208644 0.23283016 -0.28115421 -0.08367612 -0.25733218]
[ 0.05022367 -0.02578774 -0.1556777 0.5630552 0.49002449 -0.21602002
-0.25383901 1.00128866 0.65509951 0.03890494 -0.29947232 0.38347594
0.40850895 -0.55553625 0.56699214 0.6736456 0.57202613]
[ 0.16515151 0.09101577 -0.04028353 0.37195909 0.33191707 -0.06897917
-0.06140453 0.65509951 1.00128866 0.12812787 -0.19968518 0.32962651
0.3750222 -0.36309504 0.27271444 0.50238599 0.42548915]
[ 0.13272942 0.11367165 0.11285614 0.1190116 0.115676 0.11569867
0.08130416 0.03890494 0.12812787 1.00128866 0.17952581 0.0269404
0.10008351 -0.03197042 -0.04025955 0.11255393 0.00106226]
[ 0.17896117 0.20124767 0.28129148 -0.09343665 -0.08091441 0.31760831
0.32029384 -0.29947232 -0.19968518 0.17952581 1.00128866 -0.01094989
-0.03065256 0.13652054 -0.2863366 -0.09801804 -0.26969106]
[ 0.39120081 0.35621633 0.33189629 0.53251337 0.54656564 0.3187472
0.14930637 0.38347594 0.32962651 0.0269404 -0.01094989 1.00128866
0.85068186 -0.13069832 0.24932955 0.43331936 0.30543094]
[ 0.36996762 0.3380184 0.30867133 0.49176793 0.52542506 0.30040557
0.14208644 0.40850895 0.3750222 0.10008351 -0.03065256 0.85068186
1.00128866 -0.16031027 0.26747453 0.43936469 0.28990033]
[ 0.09575627 0.17645611 0.23757707 -0.38537048 -0.29500852 0.28006379
0.23283016 -0.55553625 -0.36309504 -0.03197042 0.13652054 -0.13069832
-0.16031027 1.00128866 -0.4034484 -0.5845844 -0.30710565]
[-0.09034216 -0.16019604 -0.18102711 0.45607223 0.41840277 -0.22975792
-0.28115421 0.56699214 0.27271444 -0.04025955 -0.2863366 0.24932955
0.26747453 -0.4034484 1.00128866 0.41825001 0.49153016]
[ 0.2599265 0.12487773 0.06425192 0.6617651 0.52812713 0.01867565
-0.08367612 0.6736456 0.50238599 0.11255393 -0.09801804 0.43331936
0.43936469 -0.5845844 0.41825001 1.00128866 0.39084571]
[ 0.14694372 0.06739929 -0.02236983 0.49562711 0.47789622 -0.07887464
-0.25733218 0.57202613 0.42548915 0.00106226 -0.26969106 0.30543094
0.28990033 -0.30710565 0.49153016 0.39084571 1.00128866]]

Comparing Correlation and Covarince Matrix

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 38/56


11/21/2020 Project 3

In [74]:

# Now without Scaling lets check out correlation matrix


df_corr =df2.copy()
df_corr.corr()

Out[74]:

Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergra

Apps 1.000000 0.943451 0.846822 0.338834 0.351640 0.814491 0.39826

Accept 0.943451 1.000000 0.911637 0.192447 0.247476 0.874223 0.44127

Enroll 0.846822 0.911637 1.000000 0.181294 0.226745 0.964640 0.51306

Top10perc 0.338834 0.192447 0.181294 1.000000 0.891995 0.141289 -0.10535

Top25perc 0.351640 0.247476 0.226745 0.891995 1.000000 0.199445 -0.05357

F.Undergrad 0.814491 0.874223 0.964640 0.141289 0.199445 1.000000 0.57051

P.Undergrad 0.398264 0.441271 0.513069 -0.105356 -0.053577 0.570512 1.00000

Outstate 0.050159 -0.025755 -0.155477 0.562331 0.489394 -0.215742 -0.25351

Room.Board 0.164939 0.090899 -0.040232 0.371480 0.331490 -0.068890 -0.06132

Books 0.132559 0.113525 0.112711 0.118858 0.115527 0.115550 0.08120

Personal 0.178731 0.200989 0.280929 -0.093316 -0.080810 0.317200 0.31988

PhD 0.390697 0.355758 0.331469 0.531828 0.545862 0.318337 0.14911

Terminal 0.369491 0.337583 0.308274 0.491135 0.524749 0.300019 0.14190

S.F.Ratio 0.095633 0.176229 0.237271 -0.384875 -0.294629 0.279703 0.23253

perc.alumni -0.090226 -0.159990 -0.180794 0.455485 0.417864 -0.229462 -0.28079

Expend 0.259592 0.124717 0.064169 0.660913 0.527447 0.018652 -0.08356

Grad.Rate 0.146755 0.067313 -0.022341 0.494989 0.477281 -0.078773 -0.25700

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 39/56


11/21/2020 Project 3

In [75]:

#With standardisation (Without standardisation also, correlation matrix yields same res
ult)
data_new.corr()

Out[75]:

Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergra

Apps 1.000000 0.943451 0.846822 0.338834 0.351640 0.814491 0.39826

Accept 0.943451 1.000000 0.911637 0.192447 0.247476 0.874223 0.44127

Enroll 0.846822 0.911637 1.000000 0.181294 0.226745 0.964640 0.51306

Top10perc 0.338834 0.192447 0.181294 1.000000 0.891995 0.141289 -0.10535

Top25perc 0.351640 0.247476 0.226745 0.891995 1.000000 0.199445 -0.05357

F.Undergrad 0.814491 0.874223 0.964640 0.141289 0.199445 1.000000 0.57051

P.Undergrad 0.398264 0.441271 0.513069 -0.105356 -0.053577 0.570512 1.00000

Outstate 0.050159 -0.025755 -0.155477 0.562331 0.489394 -0.215742 -0.25351

Room.Board 0.164939 0.090899 -0.040232 0.371480 0.331490 -0.068890 -0.06132

Books 0.132559 0.113525 0.112711 0.118858 0.115527 0.115550 0.08120

Personal 0.178731 0.200989 0.280929 -0.093316 -0.080810 0.317200 0.31988

PhD 0.390697 0.355758 0.331469 0.531828 0.545862 0.318337 0.14911

Terminal 0.369491 0.337583 0.308274 0.491135 0.524749 0.300019 0.14190

S.F.Ratio 0.095633 0.176229 0.237271 -0.384875 -0.294629 0.279703 0.23253

perc.alumni -0.090226 -0.159990 -0.180794 0.455485 0.417864 -0.229462 -0.28079

Expend 0.259592 0.124717 0.064169 0.660913 0.527447 0.018652 -0.08356

Grad.Rate 0.146755 0.067313 -0.022341 0.494989 0.477281 -0.078773 -0.25700

We can state that above three approaches yield the same eigenvectors and eigenvalue pairs:

1.Eigen decomposition of the covariance matrix after standardizing the data.

2.Eigen decomposition of the correlation matrix.

3.Eigen decomposition of the correlation matrix after standardizing the data.

Finally we can say that after scaling - the covariance and the correlation have the same values

2.4) Check the dataset for outliers before and after scaling. Draw your inferences from this exercise.

Outliers before scaling:

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 40/56


11/21/2020 Project 3

In [116]:

df2.boxplot(figsize=(20,10))

Out[116]:

<matplotlib.axes._subplots.AxesSubplot at 0x13fd8ab5a00>

Outliers after scaling:


In [117]:

data_new.boxplot(figsize=(20,10))

Out[117]:

<matplotlib.axes._subplots.AxesSubplot at 0x13fd8d3be20>

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 41/56


11/21/2020 Project 3

In [38]:

def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 42/56


11/21/2020 Project 3

In [55]:

lratio,uratio=remove_outlier(data_new['Apps'])
data_new['Apps']=np.where(data_new['Apps']>uratio,uratio,data_new['Apps'])
data_new['Apps']=np.where(data_new['Apps']<lratio,lratio,data_new['Apps'])

lratio,uratio=remove_outlier(data_new['Accept'])
data_new['Accept']=np.where(data_new['Accept']>uratio,uratio,data_new['Accept'])
data_new['Accept']=np.where(data_new['Accept']<lratio,lratio,data_new['Accept'])

lratio,uratio=remove_outlier(data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']>uratio,uratio,data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']<lratio,lratio,data_new['Enroll'])

lratio,uratio=remove_outlier(data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']>uratio,uratio,data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']<lratio,lratio,data_new['Enroll'])

lratio,uratio=remove_outlier(data_new['Top10perc'])
data_new['Top10perc']=np.where(data_new['Top10perc']>uratio,uratio,data_new['Top10perc'
])
data_new['Top10perc']=np.where(data_new['Top10perc']<lratio,lratio,data_new['Top10perc'
])

lratio,uratio=remove_outlier(data_new['Top25perc'])
data_new['Top25perc']=np.where(data_new['Top25perc']>uratio,uratio,data_new['Top25perc'
])
data_new['Top25perc']=np.where(data_new['Top25perc']<lratio,lratio,data_new['Top25perc'
])

lratio,uratio=remove_outlier(data_new['F.Undergrad'])
data_new['F.Undergrad']=np.where(data_new['F.Undergrad']>uratio,uratio,data_new['F.Unde
rgrad'])
data_new['F.Undergrad']=np.where(data_new['F.Undergrad']<lratio,lratio,data_new['F.Unde
rgrad'])

lratio,uratio=remove_outlier(data_new['P.Undergrad'])
data_new['P.Undergrad']=np.where(data_new['P.Undergrad']>uratio,uratio,data_new['P.Unde
rgrad'])
data_new['P.Undergrad']=np.where(data_new['P.Undergrad']<lratio,lratio,data_new['P.Unde
rgrad'])

lratio,uratio=remove_outlier(data_new['Outstate'])
data_new['Outstate']=np.where(data_new['Outstate']>uratio,uratio,data_new['Outstate'])
data_new['Outstate']=np.where(data_new['Outstate']<lratio,lratio,data_new['Outstate'])

lratio,uratio=remove_outlier(data_new['Room.Board'])
data_new['Room.Board']=np.where(data_new['Room.Board']>uratio,uratio,data_new['Room.Boa
rd'])
data_new['Room.Board']=np.where(data_new['Room.Board']<lratio,lratio,data_new['Room.Boa
rd'])

lratio,uratio=remove_outlier(data_new['Books'])
data_new['Books']=np.where(data_new['Books']>uratio,uratio,data_new['Books'])
data_new['Books']=np.where(data_new['Books']<lratio,lratio,data_new['Books'])

lratio,uratio=remove_outlier(data_new['Personal'])
data_new['Personal']=np.where(data_new['Personal']>uratio,uratio,data_new['Personal'])
data_new['Personal']=np.where(data_new['Personal']<lratio,lratio,data_new['Personal'])

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 43/56


11/21/2020 Project 3

lratio,uratio=remove_outlier(data_new['PhD'])
data_new['PhD']=np.where(data_new['PhD']>uratio,uratio,data_new['PhD'])
data_new['PhD']=np.where(data_new['PhD']<lratio,lratio,data_new['PhD'])

lratio,uratio=remove_outlier(data_new['Terminal'])
data_new['Terminal']=np.where(data_new['Terminal']>uratio,uratio,data_new['Terminal'])
data_new['Terminal']=np.where(data_new['Terminal']<lratio,lratio,data_new['Terminal'])

lratio,uratio=remove_outlier(data_new['S.F.Ratio'])
data_new['S.F.Ratio']=np.where(data_new['S.F.Ratio']>uratio,uratio,data_new['S.F.Ratio'
])
data_new['S.F.Ratio']=np.where(data_new['S.F.Ratio']<lratio,lratio,data_new['S.F.Ratio'
])

lratio,uratio=remove_outlier(data_new['perc.alumni'])
data_new['perc.alumni']=np.where(data_new['perc.alumni']>uratio,uratio,data_new['perc.a
lumni'])
data_new['perc.alumni']=np.where(data_new['perc.alumni']<lratio,lratio,data_new['perc.a
lumni'])

lratio,uratio=remove_outlier(data_new['Expend'])
data_new['Expend']=np.where(data_new['Expend']>uratio,uratio,data_new['Expend'])
data_new['Expend']=np.where(data_new['Expend']<lratio,lratio,data_new['Expend'])

lratio,uratio=remove_outlier(data_new['Grad.Rate'])
data_new['Grad.Rate']=np.where(data_new['Grad.Rate']>uratio,uratio,data_new['Grad.Rate'
])
data_new['Grad.Rate']=np.where(data_new['Grad.Rate']<lratio,lratio,data_new['Grad.Rate'
])

Data after outliers treatment:


In [57]:

data_new.boxplot(figsize=(20,10))

Out[57]:

<matplotlib.axes._subplots.AxesSubplot at 0x13fd4d3d400>

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 44/56


11/21/2020 Project 3

The above boxplot shows the result after outliers treatment

2.5) Build the covariance matrix and calculate the eigenvalues and the eigenvector.

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 45/56


11/21/2020 Project 3

In [76]:

cov_matrix = np.cov(data_new.T)
print('Covariance Matrix \n%s', cov_matrix)

Covariance Matrix
%s [[ 1.00128866 0.94466636 0.84791332 0.33927032 0.35209304 0.815540
18
0.3987775 0.05022367 0.16515151 0.13272942 0.17896117 0.39120081
0.36996762 0.09575627 -0.09034216 0.2599265 0.14694372]
[ 0.94466636 1.00128866 0.91281145 0.19269493 0.24779465 0.87534985
0.44183938 -0.02578774 0.09101577 0.11367165 0.20124767 0.35621633
0.3380184 0.17645611 -0.16019604 0.12487773 0.06739929]
[ 0.84791332 0.91281145 1.00128866 0.18152715 0.2270373 0.96588274
0.51372977 -0.1556777 -0.04028353 0.11285614 0.28129148 0.33189629
0.30867133 0.23757707 -0.18102711 0.06425192 -0.02236983]
[ 0.33927032 0.19269493 0.18152715 1.00128866 0.89314445 0.1414708
-0.10549205 0.5630552 0.37195909 0.1190116 -0.09343665 0.53251337
0.49176793 -0.38537048 0.45607223 0.6617651 0.49562711]
[ 0.35209304 0.24779465 0.2270373 0.89314445 1.00128866 0.19970167
-0.05364569 0.49002449 0.33191707 0.115676 -0.08091441 0.54656564
0.52542506 -0.29500852 0.41840277 0.52812713 0.47789622]
[ 0.81554018 0.87534985 0.96588274 0.1414708 0.19970167 1.00128866
0.57124738 -0.21602002 -0.06897917 0.11569867 0.31760831 0.3187472
0.30040557 0.28006379 -0.22975792 0.01867565 -0.07887464]
[ 0.3987775 0.44183938 0.51372977 -0.10549205 -0.05364569 0.57124738
1.00128866 -0.25383901 -0.06140453 0.08130416 0.32029384 0.14930637
0.14208644 0.23283016 -0.28115421 -0.08367612 -0.25733218]
[ 0.05022367 -0.02578774 -0.1556777 0.5630552 0.49002449 -0.21602002
-0.25383901 1.00128866 0.65509951 0.03890494 -0.29947232 0.38347594
0.40850895 -0.55553625 0.56699214 0.6736456 0.57202613]
[ 0.16515151 0.09101577 -0.04028353 0.37195909 0.33191707 -0.06897917
-0.06140453 0.65509951 1.00128866 0.12812787 -0.19968518 0.32962651
0.3750222 -0.36309504 0.27271444 0.50238599 0.42548915]
[ 0.13272942 0.11367165 0.11285614 0.1190116 0.115676 0.11569867
0.08130416 0.03890494 0.12812787 1.00128866 0.17952581 0.0269404
0.10008351 -0.03197042 -0.04025955 0.11255393 0.00106226]
[ 0.17896117 0.20124767 0.28129148 -0.09343665 -0.08091441 0.31760831
0.32029384 -0.29947232 -0.19968518 0.17952581 1.00128866 -0.01094989
-0.03065256 0.13652054 -0.2863366 -0.09801804 -0.26969106]
[ 0.39120081 0.35621633 0.33189629 0.53251337 0.54656564 0.3187472
0.14930637 0.38347594 0.32962651 0.0269404 -0.01094989 1.00128866
0.85068186 -0.13069832 0.24932955 0.43331936 0.30543094]
[ 0.36996762 0.3380184 0.30867133 0.49176793 0.52542506 0.30040557
0.14208644 0.40850895 0.3750222 0.10008351 -0.03065256 0.85068186
1.00128866 -0.16031027 0.26747453 0.43936469 0.28990033]
[ 0.09575627 0.17645611 0.23757707 -0.38537048 -0.29500852 0.28006379
0.23283016 -0.55553625 -0.36309504 -0.03197042 0.13652054 -0.13069832
-0.16031027 1.00128866 -0.4034484 -0.5845844 -0.30710565]
[-0.09034216 -0.16019604 -0.18102711 0.45607223 0.41840277 -0.22975792
-0.28115421 0.56699214 0.27271444 -0.04025955 -0.2863366 0.24932955
0.26747453 -0.4034484 1.00128866 0.41825001 0.49153016]
[ 0.2599265 0.12487773 0.06425192 0.6617651 0.52812713 0.01867565
-0.08367612 0.6736456 0.50238599 0.11255393 -0.09801804 0.43331936
0.43936469 -0.5845844 0.41825001 1.00128866 0.39084571]
[ 0.14694372 0.06739929 -0.02236983 0.49562711 0.47789622 -0.07887464
-0.25733218 0.57202613 0.42548915 0.00106226 -0.26969106 0.30543094
0.28990033 -0.30710565 0.49153016 0.39084571 1.00128866]]

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 46/56


11/21/2020 Project 3

In [127]:

# Identify eigen values and eigen vector


eig_vals, eig_vecs = np.linalg.eig(cov_matrix.T)
print('\n Eigen Values \n %s', eig_vals)
print('Eigen Vectors \n %s', eig_vecs)

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 47/56


11/21/2020 Project 3

Eigen Values
%s [5.45052162 4.48360686 1.17466761 1.00820573 0.93423123 0.84849117
0.6057878 0.58787222 0.53061262 0.4043029 0.02302787 0.03672545
0.31344588 0.08802464 0.1439785 0.16779415 0.22061096]
Eigen Vectors
%s [[-2.48765602e-01 3.31598227e-01 6.30921033e-02 -2.81310530e-01
5.74140964e-03 1.62374420e-02 4.24863486e-02 1.03090398e-01
9.02270802e-02 -5.25098025e-02 3.58970400e-01 -4.59139498e-01
4.30462074e-02 -1.33405806e-01 8.06328039e-02 -5.95830975e-01
2.40709086e-02]
[-2.07601502e-01 3.72116750e-01 1.01249056e-01 -2.67817346e-01
5.57860920e-02 -7.53468452e-03 1.29497196e-02 5.62709623e-02
1.77864814e-01 -4.11400844e-02 -5.43427250e-01 5.18568789e-01
-5.84055850e-02 1.45497511e-01 3.34674281e-02 -2.92642398e-01
-1.45102446e-01]
[-1.76303592e-01 4.03724252e-01 8.29855709e-02 -1.61826771e-01
-5.56936353e-02 4.25579803e-02 2.76928937e-02 -5.86623552e-02
1.28560713e-01 -3.44879147e-02 6.09651110e-01 4.04318439e-01
-6.93988831e-02 -2.95896092e-02 -8.56967180e-02 4.44638207e-01
1.11431545e-02]
[-3.54273947e-01 -8.24118211e-02 -3.50555339e-02 5.15472524e-02
-3.95434345e-01 5.26927980e-02 1.61332069e-01 1.22678028e-01
-3.41099863e-01 -6.40257785e-02 -1.44986329e-01 1.48738723e-01
-8.10481404e-03 -6.97722522e-01 -1.07828189e-01 -1.02303616e-03
3.85543001e-02]
[-3.44001279e-01 -4.47786551e-02 2.41479376e-02 1.09766541e-01
-4.26533594e-01 -3.30915896e-02 1.18485556e-01 1.02491967e-01
-4.03711989e-01 -1.45492289e-02 8.03478445e-02 -5.18683400e-02
-2.73128469e-01 6.17274818e-01 1.51742110e-01 -2.18838802e-02
-8.93515563e-02]
[-1.54640962e-01 4.17673774e-01 6.13929764e-02 -1.00412335e-01
-4.34543659e-02 4.34542349e-02 2.50763629e-02 -7.88896442e-02
5.94419181e-02 -2.08471834e-02 -4.14705279e-01 -5.60363054e-01
-8.11578181e-02 -9.91640992e-03 -5.63728817e-02 5.23622267e-01
5.61767721e-02]
[-2.64425045e-02 3.15087830e-01 -1.39681716e-01 1.58558487e-01
3.02385408e-01 1.91198583e-01 -6.10423460e-02 -5.70783816e-01
-5.60672902e-01 2.23105808e-01 9.01788964e-03 5.27313042e-02
1.00693324e-01 -2.09515982e-02 1.92857500e-02 -1.25997650e-01
-6.35360730e-02]
[-2.94736419e-01 -2.49643522e-01 -4.65988731e-02 -1.31291364e-01
2.22532003e-01 3.00003910e-02 -1.08528966e-01 -9.84599754e-03
4.57332880e-03 -1.86675363e-01 5.08995918e-02 -1.01594830e-01
1.43220673e-01 -3.83544794e-02 -3.40115407e-02 1.41856014e-01
-8.23443779e-01]
[-2.49030449e-01 -1.37808883e-01 -1.48967389e-01 -1.84995991e-01
5.60919470e-01 -1.62755446e-01 -2.09744235e-01 2.21453442e-01
-2.75022548e-01 -2.98324237e-01 1.14639620e-03 2.59293381e-02
-3.59321731e-01 -3.40197083e-03 -5.84289756e-02 6.97485854e-02
3.54559731e-01]
[-6.47575181e-02 5.63418434e-02 -6.77411649e-01 -8.70892205e-02
-1.27288825e-01 -6.41054950e-01 1.49692034e-01 -2.13293009e-01
1.33663353e-01 8.20292186e-02 7.72631963e-04 -2.88282896e-03
3.19400370e-02 9.43887925e-03 -6.68494643e-02 -1.14379958e-02
-2.81593679e-02]
[ 4.25285386e-02 2.19929218e-01 -4.99721120e-01 2.30710568e-01
-2.22311021e-01 3.31398003e-01 -6.33790064e-01 2.32660840e-01
9.44688900e-02 -1.36027616e-01 -1.11433396e-03 1.28904022e-02
-1.85784733e-02 3.09001353e-03 2.75286207e-02 -3.94547417e-02
-3.92640266e-02]
[-3.18312875e-01 5.83113174e-02 1.27028371e-01 5.34724832e-01
localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 48/56
11/21/2020 Project 3

1.40166326e-01 -9.12555212e-02 1.09641298e-03 7.70400002e-02


1.85181525e-01 1.23452200e-01 1.38133366e-02 -2.98075465e-02
4.03723253e-02 1.12055599e-01 -6.91126145e-01 -1.27696382e-01
2.32224316e-02]
[-3.17056016e-01 4.64294477e-02 6.60375454e-02 5.19443019e-01
2.04719730e-01 -1.54927646e-01 2.84770105e-02 1.21613297e-02
2.54938198e-01 8.85784627e-02 6.20932749e-03 2.70759809e-02
-5.89734026e-02 -1.58909651e-01 6.71008607e-01 5.83134662e-02
1.64850420e-02]
[ 1.76957895e-01 2.46665277e-01 2.89848401e-01 1.61189487e-01
-7.93882496e-02 -4.87045875e-01 -2.19259358e-01 8.36048735e-02
-2.74544380e-01 -4.72045249e-01 -2.22215182e-03 2.12476294e-02
4.45000727e-01 2.08991284e-02 4.13740967e-02 1.77152700e-02
-1.10262122e-02]
[-2.05082369e-01 -2.46595274e-01 1.46989274e-01 -1.73142230e-02
-2.16297411e-01 4.73400144e-02 -2.43321156e-01 -6.78523654e-01
2.55334907e-01 -4.22999706e-01 -1.91869743e-02 -3.33406243e-03
-1.30727978e-01 8.41789410e-03 -2.71542091e-02 -1.04088088e-01
1.82660654e-01]
[-3.18908750e-01 -1.31689865e-01 -2.26743985e-01 -7.92734946e-02
7.59581203e-02 2.98118619e-01 2.26584481e-01 5.41593771e-02
4.91388809e-02 -1.32286331e-01 -3.53098218e-02 4.38803230e-02
6.92088870e-01 2.27742017e-01 7.31225166e-02 9.37464497e-02
3.25982295e-01]
[-2.52315654e-01 -1.69240532e-01 2.08064649e-01 -2.69129066e-01
-1.09267913e-01 -2.16163313e-01 -5.59943937e-01 5.33553891e-03
-4.19043052e-02 5.90271067e-01 -1.30710024e-02 5.00844705e-03
2.19839000e-01 3.39433604e-03 3.64767385e-02 6.91969778e-02
1.22106697e-01]]

We can see there are only 4 eigen values greater than 1

2.6) Write the explicit form of the first PC (in terms of Eigen Vectors).

In order to decide which eigenvector(s) can dropped without losing too much information for the construction
of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with
the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can
be dropped. In order to do so, the common approach is to rank the eigenvalues from highest to lowest in
order choose the top k eigenvectors.

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 49/56


11/21/2020 Project 3

In [79]:

# Make a list of (eigenvalue, eigenvector) tuples


eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low


eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues


print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])

Eigenvalues in descending order:


5.450521622150296
4.483606861940848
1.174667612947485
1.0082057299695018
0.9342312255505819
0.8484911715045005
0.6057878032794005
0.5878722195930824
0.530612624700581
0.4043028977516895
0.31344587981029803
0.22061096461638874
0.16779415216580862
0.14397849747566208
0.08802463699454342
0.03672544741045181
0.02302786863373005

2.7) Discuss the cumulative values of the eigenvalues. How does it help you to decide on the optimum
number of principal components? What do the eigenvectors indicate? Perform PCA and export the data of
the Principal Component scores into a data frame.

After sorting the eigenpairs, the next question is “how many principal components are we going to choose for
our new feature subspace?” A useful measure is the so-called “explained variance,” which can be calculated
from the eigenvalues. The explained variance tells us how much information (variance) can be attributed to
each of the principal components.

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 50/56


11/21/2020 Project 3

In [94]:

tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
print(var_exp)

cum_var_exp = np.cumsum(var_exp)
print(cum_var_exp)

[32.02062819886918, 26.340214436112486, 6.900916554222489, 5.9229892229262


89, 5.488405110358481, 4.984700954557441, 3.558871491746649, 3.45362133699
92588, 3.1172336798217195, 2.375191525893793, 1.8414263209386883, 1.296041
4001235347, 0.9857541228001174, 0.8458423350830024, 0.5171255833731979, 0.
21575401007275807, 0.13528371610095027]
[ 32.0206282 58.36084263 65.26175919 71.18474841 76.67315352
81.65785448 85.21672597 88.67034731 91.78758099 94.16277251
96.00419883 97.30024023 98.28599436 99.13183669 99.64896227
99.86471628 100. ]

In [85]:

# Ploting
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'I
ndividual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative exp
lained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

The plot above clearly shows that most of the variance (32.02% of the variance to be precise) can be
explained by the first principal component. The second principal component explain (26.34%) while the third
and fourth principal components explain 6.9% & 5.9% respectively. Together, the first two principal
components contain 58.36% of the information.

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 51/56


11/21/2020 Project 3

In [135]:

pca = PCA(n_components= 17)


pca.fit_transform(data_new)
pc_comps = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10','PC11','PC12',
'PC13','PC14','PC15','PC16','PC17']
prop_var = np.round(pca.explained_variance_ratio_,2)
std_dev = np.round(np.sqrt(pca.explained_variance_),2)
cum_var = np.round(np.cumsum(pca.explained_variance_ratio_),2)
temp = pd.DataFrame(pc_comps,columns=['PCs'])
temp['Proportion Of Variance'] = prop_var
temp['Standard Deviation'] = std_dev
temp['Cumulative Proportion'] = cum_var
temp

Out[135]:

PCs Proportion Of Variance Standard Deviation Cumulative Proportion

0 PC1 0.32 2.33 0.32

1 PC2 0.26 2.12 0.58

2 PC3 0.07 1.08 0.65

3 PC4 0.06 1.00 0.71

4 PC5 0.05 0.97 0.77

5 PC6 0.05 0.92 0.82

6 PC7 0.04 0.78 0.85

7 PC8 0.03 0.77 0.89

8 PC9 0.03 0.73 0.92

9 PC10 0.02 0.64 0.94

10 PC11 0.02 0.56 0.96

11 PC12 0.01 0.47 0.97

12 PC13 0.01 0.41 0.98

13 PC14 0.01 0.38 0.99

14 PC15 0.01 0.30 1.00

15 PC16 0.00 0.19 1.00

16 PC17 0.00 0.15 1.00

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 52/56


11/21/2020 Project 3

In [139]:

# Obtain the screeplot


plt.figure(figsize=(10,5))
plt.plot(temp['Proportion Of Variance'],marker = 'o')
plt.xticks(np.arange(0,17),labels=np.arange(1,18))
plt.xlabel('#of principal components')
plt.ylabel('Proportion of variance explained')

Out[139]:

Text(0, 0.5, 'Proportion of variance explained')

Structure of Principal components & PC score


In [141]:

# Print first 5 PCs


pc_df_pcafunc =pd.DataFrame(np.round(pca.components_,2),index=pc_comps,columns=data_new
.columns)
pc_df_pcafunc.head(5)

Out[141]:

Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room

PC1 0.25 0.21 0.18 0.35 0.34 0.15 0.03 0.29

PC2 0.33 0.37 0.40 -0.08 -0.04 0.42 0.32 -0.25

PC3 -0.06 -0.10 -0.08 0.04 -0.02 -0.06 0.14 0.05

PC4 0.28 0.27 0.16 -0.05 -0.11 0.10 -0.16 0.13

PC5 0.01 0.06 -0.06 -0.40 -0.43 -0.04 0.30 0.22

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 53/56


11/21/2020 Project 3

For each PC, the row of length 17 gives the weights with which the corresponding variables need to be
multiplied to get the PC. Note that the weights can be positive or negative.

In [144]:

# Find PC scores
pc = pca.fit_transform(data_new)
pca_df = pd.DataFrame(pc,columns=pc_comps)
np.round(pca_df.iloc[:6,:5],2)

Out[144]:

PC1 PC2 PC3 PC4 PC5

0 -1.59 0.77 -0.10 -0.92 -0.74

1 -2.19 -0.58 2.28 3.59 1.06

2 -1.43 -1.09 -0.44 0.68 -0.37

3 2.86 -2.63 0.14 -1.30 -0.18

4 -2.21 0.02 2.39 -1.11 0.68

5 -0.57 -1.50 0.02 0.07 -0.38

To check that the PCs are orthogonal, correlation matrix is computed

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 54/56


11/21/2020 Project 3

In [149]:

round(pca_df.corr(),5)

Out[149]:

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14

PC1 1.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0 0.0 -0.0 0.0 0.0 0.0 0.0

PC2 0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0

PC3 -0.0 0.0 1.0 0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 0.0

PC4 -0.0 0.0 0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0

PC5 0.0 -0.0 0.0 0.0 1.0 0.0 -0.0 0.0 -0.0 0.0 0.0 0.0 -0.0 -0.0

PC6 -0.0 -0.0 0.0 0.0 0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.0

PC7 0.0 -0.0 -0.0 -0.0 -0.0 0.0 1.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 0.0

PC8 -0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 0.0

PC9 0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 1.0 0.0 -0.0 0.0 0.0 0.0

PC10 -0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0 1.0 -0.0 0.0 -0.0 0.0

PC11 0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 -0.0 1.0 -0.0 -0.0 0.0

PC12 0.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 1.0 -0.0 0.0

PC13 0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 1.0 -0.0

PC14 0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 1.0

PC15 -0.0 0.0 0.0 -0.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0

PC16 0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0

PC17 0.0 -0.0 -0.0 0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 -0.0 -0.0

Let us now investigate the correlations among the first 5 PCs with the original 17 variables

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 55/56


11/21/2020 Project 3

In [155]:

result = pd.concat((data_new,pca_df),axis = 1).corr()


np.round(result.iloc[0:17,17:22],2)

Out[155]:

PC1 PC2 PC3 PC4 PC5

Apps 0.58 0.70 -0.07 0.28 0.01

Accept 0.48 0.79 -0.11 0.27 0.05

Enroll 0.41 0.85 -0.09 0.16 -0.05

Top10perc 0.83 -0.17 0.04 -0.05 -0.38

Top25perc 0.80 -0.09 -0.03 -0.11 -0.41

F.Undergrad 0.36 0.88 -0.07 0.10 -0.04

P.Undergrad 0.06 0.67 0.15 -0.16 0.29

Outstate 0.69 -0.53 0.05 0.13 0.21

Room.Board 0.58 -0.29 0.16 0.19 0.54

Books 0.15 0.12 0.73 0.09 -0.12

Personal -0.10 0.47 0.54 -0.23 -0.21

PhD 0.74 0.12 -0.14 -0.54 0.14

Terminal 0.74 0.10 -0.07 -0.52 0.20

S.F.Ratio -0.41 0.52 -0.31 -0.16 -0.08

perc.alumni 0.48 -0.52 -0.16 0.02 -0.21

Expend 0.74 -0.28 0.25 0.08 0.07

Grad.Rate 0.59 -0.36 -0.23 0.27 -0.11

. Among the correlations between the PCs and the constituent variables, the following are considerably large:
• PC1 and (Top 10perc, Top 25perc) • PC2 and (Enroll,F.Graduate)

In [ ]:

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 56/56

You might also like