Advanced Statistics Jupyter File PDF

11/21/2020 Project 3
A research laboratory was developing a new compound for the relief of severe cases of hay fever. In an
experiment with 36 volunteers, the amounts of the two active ingredients (A & B) in the compound were
varied at three levels each. Randomization was used in assigning four volunteers to each of the nine
treatments. The data on hours of relief can be found in the following .csv file: Fever.csvView in a new window
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.formula.api import ols
from statsmodels.stats.anova import _get_covariance,anova_lm
import os
%matplotlib inline
In [2]:
df=pd.read_csv("C:\\Users\\Shubham\\Downloads\\Fever-1.csv")
df.head()
Out[2]:
A B Volunteer Relief
0 1 1 1 2.4
1 1 1 2 2.7
2 1 1 3 2.3
3 1 1 4 2.5
4 1 2 1 4.6
In [3]:
df.columns
Out[3]:
Index(['A', 'B', 'Volunteer', 'Relief'], dtype='object')
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 36 non-null int64
1 B 36 non-null int64
2 Volunteer 36 non-null int64
3 Relief 36 non-null float64
dtypes: float64(1), int64(3)
memory usage: 1.2 KB
localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 1/56

11/21/2020 Project 3
In [5]:
df.describe()
Out[5]:
A B Volunteer Relief
count 36.000000 36.000000 36.000000 36.000000
mean 2.000000 2.000000 2.500000 7.183333
std 0.828079 0.828079 1.133893 3.272090
min 1.000000 1.000000 1.000000 2.300000
25% 1.000000 1.000000 1.750000 4.675000
50% 2.000000 2.000000 2.500000 6.000000
75% 3.000000 3.000000 3.250000 9.325000
max 3.000000 3.000000 4.000000 13.500000
In [6]:
#Checking for null values

df.isnull().sum()
Out[6]:
A 0
B 0
Volunteer 0
Relief 0
dtype: int64
In [7]:
df["A"].unique()
Out[7]:
array([1, 2, 3], dtype=int64)
In [8]:
df["B"].unique()
Out[8]:
array([1, 2, 3], dtype=int64)
In [9]:
df["Volunteer"].unique()
Out[9]:
array([1, 2, 3, 4], dtype=int64)

11/21/2020 Project 3
In [10]:
df["Relief"].unique()
Out[10]:
array([ 2.4, 2.7, 2.3, 2.5, 4.6, 4.2, 4.9, 4.7, 4.8, 4.5, 4.4,
5.8, 5.2, 5.5, 5.3, 8.9, 9.1, 8.7, 9. , 9.3, 9.4, 6.1,
5.7, 5.9, 6.2, 9.9, 10.5, 10.6, 10.1, 13.5, 13. , 13.3, 13.2])
In [11]:
df.shape
Out[11]:
(36, 4)
1.1) State the Null and Alternate Hypothesis for conducting one-way ANOVA for both the variables ‘A’ and ‘B’
individually. [both statement and statistical form like Ho=mu, Ha>mu]
The Null & Alternate Hypothesis for Variable ‘A’ is: Null Hypothesis HO: The mean of Active ingredient A is
equal to variable Relief Alternate Hypothesis H1: At least one of the means of active ingredient A is not equal
to variable Relief H0: µ1=µ2
H1: µ1≠µ2 Where µ1= Mean of active ingredient of A, µ2= Mean of Hours of relief
The Null & Alternate Hypothesis for Variable ‘B’ is: Null Hypothesis HO: The mean of Active ingredient B is
equal to variable Relief Alternate Hypothesis H1: At least one of the means of active ingredient B is not equal
to variable Relief H0: µ1=µ2
H1: µ1≠µ2 Where µ1= Mean of active ingredient of B, µ2= Mean of Hours of relief
1.2) Perform one-way ANOVA for variable ‘A’ with respect to the variable ‘Relief’. State whether the Null
Hypothesis is accepted or rejected based on the ANOVA results.

11/21/2020 Project 3
In [12]:
plt.figure(figsize=(10,5))
sns.boxplot(x="A",y="Relief",data=df)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fc1b2baf0>
As seen from the boxplot there is a very large difference in mean of the Relief provided by different active
ingredient of A, let’s find out statically the variance in means using ANOVA.
Stating Null and Alternate Hypothesis

HO: The mean of Active ingredient A is equal to variable Relief
H1: Atleast one of the mean of active ingredient A is not equal to variable Relief
Define confidence level

11/21/2020 Project 3
alpha=0.5
Performing One way ANOVA between categorical

variable A and contineous variable Relief
In [13]:
formula='Relief~C(A)'
model=ols(formula,df).fit()
aov_table=anova_lm(model)
print(aov_table)
df sum_sq mean_sq F PR(>F)

C(A) 2.0 220.02 110.010000 23.465387 4.578242e-07
Residual 33.0 154.71 4.688182 NaN NaN
Since P_Value is less than alpha 0.5, we are failed to accept the null Hypothesis and statistically say the
mean of categorical variable A is same as mean of variable Relief. This bring us into conclusion that at 95%
confidence level we can conclude that relief provided by ingredients at different level of compound A is not
same.
1.3) Perform one-way ANOVA for variable ‘B’ with respect to the variable ‘Relief’. State whether the Null
Hypothesis is accepted or rejected based on the ANOVA results.
In [14]:
sns.boxplot(x="B",y="Relief",data=df)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fc2326d60>
Stating Null and Alternate Hypothesis

11/21/2020 Project 3
HO: The mean of Active ingredient B is equal to variable Relief
H1: Atleast one of the mean of active ingredient B is not equal to variable Relief
Define confidence level

alpha=0.5 or confidence level of 95%
Performing One way ANOVA between categorical

variable A and contineous variable Relief
In [15]:
formula='Relief~C(B)'
model=ols(formula,df).fit()
print(aov_table)

C(B) 2.0 123.66 61.830000 8.126777 0.00135
Residual 33.0 251.07 7.608182 NaN NaN
since P_value is less than alpha 0.5, we failed to reject the null hypothesis. So at 95% confidence level we
can say that mean of categirical variable is equal to variable relief
1.4) Analyse the effects of one variable on another with the help of an interaction plot. What is the interaction
between the two treatments?
In [16]:
sns.pointplot(x='A',y='Relief',data=df,hue='B')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fc2803dc0>

11/21/2020 Project 3
In [17]:
sns.pointplot(x='A',y='Relief',data=df,hue='B',ci=None)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fc287eaf0>
The plot suggests that there is interaction between the levels of ingredient A and ingredient B.
1.5) Perform a two-way ANOVA based on the different ingredients (variable ‘A’ & ‘B’ along with their
interaction 'A*B') with the variable 'Relief' and state your results.
Formulate the Hypothesis of the two-way ANOVA with

both "A" &"B" variables with respect to "Relief"
variable.
HO:there is no interaction between the levels of ingredient A and ingredient B
H1:There is atleast some interaction between the levels of ingredient A and ingredient B
In [18]:
#Interaction Effect:
model=ols('Relief~C(A)+C(B)+C(A):C(B)',data=df).fit()
print(aov_table)

C(A) 2.0 220.020 110.010000 1827.858462 1.514043e-29
C(B) 2.0 123.660 61.830000 1027.329231 3.348751e-26
C(A):C(B) 4.0 29.425 7.356250 122.226923 6.972083e-17
Residual 27.0 1.625 0.060185 NaN NaN

11/21/2020 Project 3
The null hypothesis for the interactin is that “there is

no interaction between the levels of ingredient A and
ingredient B”. The alternative hypothesis is that “there
is interaction”. The test statistic is F = 122.2 and the
p.value is less than 0.05. Therefore, at the alpha level
of 0.05, we reject the null hypothesis and conclude
that there is significant interaction between the levels
of ingredient A and ingredient B.
3- It is possible for the interaction to be significant when the main effects are not significant. So it makes
sense to test the significance of the main effects. The null phpothesis for the main effect for A is that “the
responses do not differ by the levels of factor A, while holding constant the levels of factor B and the
interactions”. The null phpothesis for the main effect for B is that “the responses do not differ by the levels of
factor A, while holding constant the levels of factor A and the interactions”. The test statistics for the main
effetcs A and B are F = 1827.9 and F = 1027.3, respectively. the p-values are less than 0.05 for each. We
reject the null hypothesis and conclude that the responses significantly differ across the levels of the two
ingredients, while holding constant the other and the interactions.
1.6) Mention the business implications of performing ANOVA for this particular case study.
conclude that the responses significantly differ across the levels of the two ingredients, while holding
constant the other and the interactions.
The dataset Education - Post 12th Standard.csvView

in a new window is a dataset that contains the names
of various colleges. This particular case study is
based on various parameters of various institutions.
You are expected to do Principal Component Analysis
for this case study according to the instructions given
in the following rubric. The data dictionary of the
'Education - Post 12th Standard.csv' can be found in
the following file: Data Dictionary.xlsxView in a new
window.
In [131]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import zscore
from sklearn.decomposition import PCA
from statsmodels import multivariate

11/21/2020 Project 3
In [19]:
df1=pd.read_csv("C:\\Users\\Shubham\\Downloads\\Education+-+Post+12th+Standard.csv")
df1
Out[19]:
Names Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad O
Abilene
0 Christian 1660 1232 721 23 52 2885 537
University
Adelphi
1 2186 1924 512 16 29 2683 1227
University
Adrian
2 1428 1097 336 22 50 1036 99
College
Agnes Scott
3 417 349 137 60 89 510 63
College
Alaska
4 Pacific 193 146 55 16 44 249 869
University
... ... ... ... ... ... ... ... ...
Worcester
772 State 2197 1515 543 4 26 3089 2029
College
Xavier
773 1959 1805 695 24 47 2849 1107
University
Xavier
774 University of 2097 1915 695 34 61 2793 166
Louisiana
Yale
775 10705 2453 1317 95 99 5217 83
University
York College
776 of 2989 1855 691 28 63 2988 1726
Pennsylvania
777 rows × 18 columns
The dataset "Education+-+Post+12th+Standard" shows the data of different universities and each university
is judged on 17 different attributes.
In [105]:
df1.shape
Out[105]:
(777, 18)

11/21/2020 Project 3
In [21]:
df1.dtypes
Out[21]:
Names object
Apps int64
Accept int64
Enroll int64
Top10perc int64
Top25perc int64
F.Undergrad int64
P.Undergrad int64
Outstate int64
Room.Board int64
Books int64
Personal int64
PhD int64
Terminal int64
S.F.Ratio float64
perc.alumni int64
Expend int64
Grad.Rate int64
dtype: object
We can see that all the attributes are either float or int except for the Names attribute
Summary of the dataset

In [22]:
df1.describe()
Out[22]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Un
count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777
mean 3001.638353 2018.804376 779.972973 27.558559 55.796654 3699.907336 855
std 3870.201484 2451.113971 929.176190 17.640364 19.804778 4850.420531 1522
min 81.000000 72.000000 35.000000 1.000000 9.000000 139.000000 1
25% 776.000000 604.000000 242.000000 15.000000 41.000000 992.000000 95
50% 1558.000000 1110.000000 434.000000 23.000000 54.000000 1707.000000 353
75% 3624.000000 2424.000000 902.000000 35.000000 69.000000 4005.000000 967
max 48094.000000 26330.000000 6392.000000 96.000000 100.000000 31643.000000 21836

11/21/2020 Project 3
In [123]:
df1["Names"].unique()

11/21/2020 Project 3
Out[123]:
array(['Abilene Christian University', 'Adelphi University',

'Adrian College', 'Agnes Scott College',
'Alaska Pacific University', 'Albertson College',
'Albertus Magnus College', 'Albion College', 'Albright College',
'Alderson-Broaddus College', 'Alfred University',
'Allegheny College', 'Allentown Coll. of St. Francis de Sales',
'Alma College', 'Alverno College',
'American International College', 'Amherst College',
'Anderson University', 'Andrews University',
'Angelo State University', 'Antioch University',
'Appalachian State University', 'Aquinas College',
'Arizona State University Main campus',
'Arkansas College (Lyon College)', 'Arkansas Tech University',
'Assumption College', 'Auburn University-Main Campus',
'Augsburg College', 'Augustana College IL', 'Augustana College',
'Austin College', 'Averett College', 'Baker University',
'Baldwin-Wallace College', 'Barat College', 'Bard College',
'Barnard College', 'Barry University', 'Baylor University',
'Beaver College', 'Bellarmine College', 'Belmont Abbey College',
'Belmont University', 'Beloit College', 'Bemidji State University',
'Benedictine College', 'Bennington College', 'Bentley College',
'Berry College', 'Bethany College', 'Bethel College KS',
'Bethel College', 'Bethune Cookman College',
'Birmingham-Southern College', 'Blackburn College',
'Bloomsburg Univ. of Pennsylvania', 'Bluefield College',
'Bluffton College', 'Boston University', 'Bowdoin College',
'Bowling Green State University', 'Bradford College',
'Bradley University', 'Brandeis University', 'Brenau University',
'Brewton-Parker College', 'Briar Cliff College',
'Bridgewater College', 'Brigham Young University at Provo',
'Brown University', 'Bryn Mawr College', 'Bucknell University',
'Buena Vista College', 'Butler University', 'Cabrini College',
'Caldwell College', 'California Lutheran University',
'California Polytechnic-San Luis',
'California State University at Fresno', 'Calvin College',
'Campbell University', 'Campbellsville College',
'Canisius College', 'Capital University', 'Capitol College',
'Carleton College', 'Carnegie Mellon University',
'Carroll College', 'Carson-Newman College', 'Carthage College',
'Case Western Reserve University', 'Castleton State College',
'Catawba College', 'Catholic University of America',
'Cazenovia College', 'Cedar Crest College', 'Cedarville College',
'Centenary College', 'Centenary College of Louisiana',
'Center for Creative Studies', 'Central College',
'Central Connecticut State University',
'Central Missouri State University',
'Central Washington University', 'Central Wesleyan College',
'Centre College', 'Chapman University', 'Chatham College',
'Chestnut Hill College', 'Christendom College',
'Christian Brothers University', 'Christopher Newport University',
'Claflin College', 'Claremont McKenna College', 'Clark University',
'Clarke College', 'Clarkson University', 'Clemson University',
'Clinch Valley Coll. of the Univ. of Virginia', 'Coe College',
'Coker College', 'Colby College', 'Colgate University',
'College Misericordia', 'College of Charleston',
'College of Mount St. Joseph', 'College of Mount St. Vincent',
'College of Notre Dame', 'College of Notre Dame of Maryland',
'College of Saint Benedict', 'College of Saint Catherine',
'College of Saint Elizabeth', 'College of Saint Rose',
11/21/2020 Project 3
'College of Santa Fe', 'College of St. Joseph',

'College of St. Scholastica', 'College of the Holy Cross',
'College of William and Mary', 'College of Wooster',
'Colorado College', 'Colorado State University',
'Columbia College MO', 'Columbia College', 'Columbia University',
'Concordia College at St. Paul', 'Concordia Lutheran College',
'Concordia University CA', 'Concordia University',
'Connecticut College', 'Converse College', 'Cornell College',
'Creighton University', 'Culver-Stockton College',
'Cumberland College', "D'Youville College", 'Dana College',
'Daniel Webster College', 'Dartmouth College', 'Davidson College',
'Defiance College', 'Delta State University', 'Denison University',
'DePauw University', 'Dickinson College',
'Dickinson State University', 'Dillard University',
'Doane College', 'Dominican College of Blauvelt', 'Dordt College',
'Dowling College', 'Drake University', 'Drew University',
'Drury College', 'Duke University', 'Earlham College',
'East Carolina University', 'East Tennessee State University',
'East Texas Baptist University', 'Eastern College',
'Eastern Connecticut State University',
'Eastern Illinois University', 'Eastern Mennonite College',
'Eastern Nazarene College', 'Eckerd College',
'Elizabethtown College', 'Elmira College', 'Elms College',
'Elon College', 'Embry Riddle Aeronautical University',
'Emory & Henry College', 'Emory University',
'Emporia State University', 'Erskine College', 'Eureka College',
'Evergreen State College', 'Fairfield University',
'Fayetteville State University', 'Ferrum College',
'Flagler College', 'Florida Institute of Technology',
'Florida International University', 'Florida Southern College',
'Florida State University', 'Fontbonne College',
'Fordham University', 'Fort Lewis College',
'Francis Marion University',
'Franciscan University of Steubenville', 'Franklin College',
'Franklin Pierce College', 'Freed-Hardeman University',
'Fresno Pacific College', 'Furman University', 'Gannon University',
'Gardner Webb University', 'Geneva College', 'George Fox College',
'George Mason University', 'George Washington University',
'Georgetown College', 'Georgetown University',
'Georgia Institute of Technology', 'Georgia State University',
'Georgian Court College', 'Gettysburg College',
'Goldey Beacom College', 'Gonzaga University', 'Gordon College',
'Goshen College', 'Goucher College', 'Grace College and Seminary',
'Graceland College', 'Grand Valley State University',
'Green Mountain College', 'Greensboro College',
'Greenville College', 'Grinnell College', 'Grove City College',
'Guilford College', 'Gustavus Adolphus College',
'Gwynedd Mercy College', 'Hamilton College', 'Hamline University',
'Hampden - Sydney College', 'Hampton University',
'Hanover College', 'Hardin-Simmons University',
'Harding University', 'Hartwick College', 'Harvard University',
'Harvey Mudd College', 'Hastings College', 'Hendrix College',
'Hillsdale College', 'Hiram College',
'Hobart and William Smith Colleges', 'Hofstra University',
'Hollins College', 'Hood College', 'Hope College',
'Houghton College', 'Huntingdon College', 'Huntington College',
'Huron University', 'Husson College',
'Illinois Benedictine College', 'Illinois College',
'Illinois Institute of Technology', 'Illinois State University',
'Illinois Wesleyan University', 'Immaculata College',
'Incarnate Word College', 'Indiana State University',
11/21/2020 Project 3
'Indiana University at Bloomington', 'Indiana Wesleyan University',

'Iona College', 'Iowa State University', 'Ithaca College',
'James Madison University', 'Jamestown College',
'Jersey City State College', 'John Brown University',
'John Carroll University', 'Johns Hopkins University',
'Johnson State College', 'Judson College', 'Juniata College',
'Kansas State University', 'Kansas Wesleyan University',
'Keene State College', 'Kentucky Wesleyan College',
'Kenyon College', 'Keuka College', "King's College",
'King College', 'Knox College', 'La Roche College',
'La Salle University', 'Lafayette College', 'LaGrange College',
'Lake Forest College', 'Lakeland College', 'Lamar University',
'Lambuth University', 'Lander University', 'Lawrence University',
'Le Moyne College', 'Lebanon Valley College', 'Lehigh University',
'Lenoir-Rhyne College', 'Lesley College', 'LeTourneau University',
'Lewis and Clark College', 'Lewis University',
'Lincoln Memorial University', 'Lincoln University',
'Lindenwood College', 'Linfield College', 'Livingstone College',
'Lock Haven University of Pennsylvania', 'Longwood College',
'Loras College', 'Louisiana College',
'Louisiana State University at Baton Rouge',
'Louisiana Tech University', 'Loyola College',
'Loyola Marymount University', 'Loyola University',
'Loyola University Chicago', 'Luther College', 'Lycoming College',
'Lynchburg College', 'Lyndon State College', 'Macalester College',
'MacMurray College', 'Malone College', 'Manchester College',
'Manhattan College', 'Manhattanville College',
'Mankato State University', 'Marian College of Fond du Lac',
'Marietta College', 'Marist College', 'Marquette University',
'Marshall University', 'Mary Baldwin College',
'Mary Washington College', 'Marymount College Tarrytown',
'Marymount Manhattan College', 'Marymount University',
'Maryville College', 'Maryville University', 'Marywood College',
'Massachusetts Institute of Technology',
'Mayville State University', 'McKendree College',
'McMurry University', 'McPherson College', 'Mercer University',
'Mercyhurst College', 'Meredith College', 'Merrimack College',
'Mesa State College', 'Messiah College',
'Miami University at Oxford', 'Michigan State University',
'Michigan Technological University', 'MidAmerica Nazarene College',
'Millersville University of Penn.', 'Milligan College',
'Millikin University', 'Millsaps College',
'Milwaukee School of Engineering', 'Mississippi College',
'Mississippi State University', 'Mississippi University for Women',
'Missouri Southern State College', 'Missouri Valley College',
'Monmouth College IL', 'Monmouth College',
'Montana College of Mineral Sci. & Tech.',
'Montana State University', 'Montclair State University',
'Montreat-Anderson College', 'Moorhead State University',
'Moravian College', 'Morehouse College', 'Morningside College',
'Morris College', 'Mount Holyoke College', 'Mount Marty College',
'Mount Mary College', 'Mount Mercy College',
'Mount Saint Clare College', "Mount Saint Mary's College",
'Mount Saint Mary College', "Mount St. Mary's College",
'Mount Union College', 'Mount Vernon Nazarene College',
'Muhlenberg College', 'Murray State University',
'Muskingum College', 'National-Louis University',
'Nazareth College of Rochester',
'New Jersey Institute of Technology',
'New Mexico Institute of Mining and Tech.', 'New York University',
'Newberry College', 'Niagara University',
11/21/2020 Project 3
'North Adams State College',

'North Carolina A. & T. State University',
'North Carolina State University at Raleigh',
'North Carolina Wesleyan College', 'North Central College',
'North Dakota State University', 'North Park College',
'Northeast Missouri State University', 'Northeastern University',
'Northern Arizona University', 'Northern Illinois University',
'Northwest Missouri State University',
'Northwest Nazarene College', 'Northwestern College',
'Northwestern University', 'Norwich University',
'Notre Dame College', 'Oakland University', 'Oberlin College',
'Occidental College', 'Oglethorpe University',
'Ohio Northern University', 'Ohio University',
'Ohio Wesleyan University', 'Oklahoma Baptist University',
'Oklahoma Christian University', 'Oklahoma State University',
'Otterbein College', 'Ouachita Baptist University',
'Our Lady of the Lake University', 'Pace University',
'Pacific Lutheran University', 'Pacific Union College',
'Pacific University', 'Pembroke State University',
'Pennsylvania State Univ. Main Campus', 'Pepperdine University',
'Peru State College', 'Pfeiffer College',
'Philadelphia Coll. of Textiles and Sci.', 'Phillips University',
'Piedmont College', 'Pikeville College', 'Pitzer College',
'Point Loma Nazarene College', 'Point Park College',
'Polytechnic University', 'Prairie View A. and M. University',
'Presbyterian College', 'Princeton University',
'Providence College', 'Purdue University at West Lafayette',
'Queens College', 'Quincy University', 'Quinnipiac College',
'Radford University', 'Ramapo College of New Jersey',
'Randolph-Macon College', "Randolph-Macon Woman's College",
'Reed College', 'Regis College',
'Rensselaer Polytechnic Institute', 'Rhodes College',
'Rider University', 'Ripon College', 'Rivier College',
'Roanoke College', 'Rockhurst College', 'Rocky Mountain College',
'Roger Williams University', 'Rollins College', 'Rosary College',
'Rowan College of New Jersey', 'Rutgers at New Brunswick',
'Rutgers State University at Camden',
'Rutgers State University at Newark', 'Sacred Heart University',
'Saint Ambrose University', 'Saint Anselm College',
'Saint Cloud State University', 'Saint Francis College IN',
'Saint Francis College', "Saint John's University",
"Saint Joseph's College IN", "Saint Joseph's College",
"Saint Joseph's University", 'Saint Joseph College',
'Saint Louis University', "Saint Mary's College",
"Saint Mary's College of Minnesota",
'Saint Mary-of-the-Woods College', "Saint Michael's College",
'Saint Olaf College', "Saint Peter's College",
'Saint Vincent College', 'Saint Xavier University',
'Salem-Teikyo University', 'Salem College',
'Salisbury State University', 'Samford University',
'San Diego State University', 'Santa Clara University',
'Sarah Lawrence College', 'Savannah Coll. of Art and Design',
'Schreiner College', 'Scripps College',
'Seattle Pacific University', 'Seattle University',
'Seton Hall University', 'Seton Hill College',
'Shippensburg University of Penn.', 'Shorter College',
'Siena College', 'Siena Heights College', 'Simmons College',
'Simpson College', 'Sioux Falls College', 'Skidmore College',
'Smith College', 'South Dakota State University',
'Southeast Missouri State University',
'Southeastern Oklahoma State Univ.', 'Southern California College',
11/21/2020 Project 3
'Southern Illinois University at Edwardsville',

'Southern Methodist University', 'Southwest Baptist University',
'Southwest Missouri State University',
'Southwest State University', 'Southwestern Adventist College',
'Southwestern College', 'Southwestern University',
'Spalding University', 'Spelman College', 'Spring Arbor College',
'St. Bonaventure University', "St. John's College",
'St. John Fisher College', 'St. Lawrence University',
"St. Martin's College", "St. Mary's College of California",
"St. Mary's College of Maryland",
"St. Mary's University of San Antonio", 'St. Norbert College',
"St. Paul's College", 'St. Thomas Aquinas College',
'Stephens College', 'Stetson University',
'Stevens Institute of Technology',
'Stockton College of New Jersey', 'Stonehill College',
'SUNY at Albany', 'SUNY at Binghamton', 'SUNY at Buffalo',
'SUNY at Stony Brook', 'SUNY College at Brockport',
'SUNY College at Oswego', 'SUNY College at Buffalo',
'SUNY College at Cortland', 'SUNY College at Fredonia',
'SUNY College at Geneseo', 'SUNY College at New Paltz',
'SUNY College at Plattsburgh', 'SUNY College at Potsdam',
'SUNY College at Purchase', 'Susquehanna University',
'Sweet Briar College', 'Syracuse University', 'Tabor College',
'Talladega College', 'Taylor University',
'Tennessee Wesleyan College', 'Texas A&M Univ. at College Station',
'Texas A&M University at Galveston', 'Texas Christian University',
'Texas Lutheran College', 'Texas Southern University',
'Texas Wesleyan University', 'The Citadel', 'Thiel College',
'Tiffin University', 'Transylvania University',
'Trenton State College', 'Tri-State University',
'Trinity College CT', 'Trinity College DC', 'Trinity College VT',
'Trinity University', 'Tulane University', 'Tusculum College',
'Tuskegee University', 'Union College KY', 'Union College NY',
'Univ. of Wisconsin at OshKosh',
'University of Alabama at Birmingham',
'University of Arkansas at Fayetteville',
'University of California at Berkeley',
'University of California at Irvine',
'University of Central Florida', 'University of Charleston',
'University of Chicago', 'University of Cincinnati',
'University of Connecticut at Storrs', 'University of Dallas',
'University of Dayton', 'University of Delaware',
'University of Denver', 'University of Detroit Mercy',
'University of Dubuque', 'University of Evansville',
'University of Florida', 'University of Georgia',
'University of Hartford', 'University of Hawaii at Manoa',
'University of Illinois - Urbana',
'University of Illinois at Chicago', 'University of Indianapolis',
'University of Kansas', 'University of La Verne',
'University of Louisville', 'University of Maine at Farmington',
'University of Maine at Machias',
'University of Maine at Presque Isle',
'University of Maryland at Baltimore County',
'University of Maryland at College Park',
'University of Massachusetts at Amherst',
'University of Massachusetts at Dartmouth', 'University of Miami',
'University of Michigan at Ann Arbor',
'University of Minnesota at Duluth',
'University of Minnesota at Morris',
'University of Minnesota Twin Cities', 'University of Mississippi',
'University of Missouri at Columbia',
11/21/2020 Project 3
'University of Missouri at Rolla',

'University of Missouri at Saint Louis', 'University of Mobile',
'University of Montevallo', 'University of Nebraska at Lincoln',
'University of New England', 'University of New Hampshire',
'University of North Carolina at Asheville',
'University of North Carolina at Chapel Hill',
'University of North Carolina at Charlotte',
'University of North Carolina at Greensboro',
'University of North Carolina at Wilmington',
'University of North Dakota', 'University of North Florida',
'University of North Texas', 'University of Northern Colorado',
'University of Northern Iowa', 'University of Notre Dame',
'University of Oklahoma', 'University of Oregon',
'University of Pennsylvania',
'University of Pittsburgh-Main Campus', 'University of Portland',
'University of Puget Sound', 'University of Rhode Island',
'University of Richmond', 'University of Rochester',
'University of San Diego', 'University of San Francisco',
'University of Sci. and Arts of Oklahoma',
'University of Scranton', 'University of South Carolina at Aiken',
'University of South Carolina at Columbia',
'University of South Florida', 'University of Southern California',
'University of Southern Colorado',
'University of Southern Indiana',
'University of Southern Mississippi',
'University of St. Thomas MN', 'University of St. Thomas TX',
'University of Tennessee at Knoxville',
'University of Texas at Arlington',
'University of Texas at Austin',
'University of Texas at San Antonio', 'University of the Arts',
'University of the Pacific', 'University of the South',
'University of Tulsa', 'University of Utah',
'University of Vermont', 'University of Virginia',
'University of Washington', 'University of West Florida',
'University of Wisconsin-Stout',
'University of Wisconsin-Superior',
'University of Wisconsin-Whitewater',
'University of Wisconsin at Green Bay',
'University of Wisconsin at Madison',
'University of Wisconsin at Milwaukee', 'University of Wyoming',
'Upper Iowa University', 'Ursinus College', 'Ursuline College',
'Valley City State University', 'Valparaiso University',
'Vanderbilt University', 'Vassar College', 'Villanova University',
'Virginia Commonwealth University', 'Virginia State University',
'Virginia Tech', 'Virginia Union University',
'Virginia Wesleyan College', 'Viterbo College', 'Voorhees College',
'Wabash College', 'Wagner College', 'Wake Forest University',
'Walsh University', 'Warren Wilson College', 'Wartburg College',
'Washington and Jefferson College',
'Washington and Lee University', 'Washington College',
'Washington State University', 'Washington University',
'Wayne State College', 'Waynesburg College', 'Webber College',
'Webster University', 'Wellesley College', 'Wells College',
'Wentworth Institute of Technology', 'Wesley College',
'Wesleyan University', 'West Chester University of Penn.',
'West Liberty State College', 'West Virginia Wesleyan College',
'Western Carolina University', 'Western Maryland College',
'Western Michigan University', 'Western New England College',
'Western State College of Colorado',
'Western Washington University', 'Westfield State College',
'Westminster College MO', 'Westminster College',
11/21/2020 Project 3
'Westminster College of Salt Lake City', 'Westmont College',

'Wheaton College IL', 'Westminster College PA',
'Wheeling Jesuit College', 'Whitman College', 'Whittier College',
'Whitworth College', 'Widener University', 'Wilkes University',
'Willamette University', 'William Jewell College',
'William Woods University', 'Williams College', 'Wilson College',
'Wingate College', 'Winona State University',
'Winthrop University', 'Wisconsin Lutheran College',
'Wittenberg University', 'Wofford College',
'Worcester Polytechnic Institute', 'Worcester State College',
'Xavier University', 'Xavier University of Louisiana',
'Yale University', 'York College of Pennsylvania'], dtype=object)
Check for null values

In [24]:
df1.isnull().sum()
Out[24]:
Names 0
Apps 0
Accept 0
Enroll 0
Top10perc 0
Top25perc 0
F.Undergrad 0
P.Undergrad 0
Outstate 0
Room.Board 0
Books 0
Personal 0
PhD 0
Terminal 0
S.F.Ratio 0
perc.alumni 0
Expend 0
Grad.Rate 0
dtype: int64
There is no null value present in dataset
Checks for Duplicates

In [25]:
dups=df1.duplicated().sum()
print("No od duplicated rows present is",dups)
No od duplicated rows present is 0
2.1) Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed]. The
inferences drawn from this should be properly documented.

11/21/2020 Project 3
Uni-variate analysis
Uni-variate analysis except Names column

11/21/2020 Project 3
In [26]:
fig, axes = plt.subplots(nrows=5,ncols=2)

fig.set_size_inches(12, 14)
a = sns.distplot(df1['Apps'] , ax=axes[0][0])
a.set_title("Application Distribution",fontsize=15)
a = sns.boxplot(df1['Apps'],orient = "v" , ax=axes[0][1])
a.set_title("Application Distribution",fontsize=15)
fig. tight_layout(pad=3.0)
a = sns.distplot(df1['Accept'] , ax=axes[1][0])
a.set_title("AcceptedApplication Distribution",fontsize=15)
a = sns.boxplot(df1['Accept'],orient = "v" , ax=axes[1][1])
a.set_title("Accepted Application Distribution",fontsize=15)
a = sns.distplot(df1['Enroll'] , ax=axes[2][0])
a.set_title("Total Enrolled Application Distribution",fontsize=15)
a = sns.boxplot(df1["Enroll"],orient = "v" , ax=axes[2][1])
a.set_title("Total Enrolled Application Distribution",fontsize=15)
a = sns.distplot(df1["Top10perc"] , ax=axes[3][0])
a.set_title("Top 10% Application Distribution",fontsize=15)
a = sns.boxplot(df1["Top10perc"],orient = "v" , ax=axes[3][1])
a = sns.distplot(df1["Top25perc"] , ax=axes[4][0])
a = sns.boxplot(df1["Top25perc"],orient = "v" , ax=axes[4][1])

11/21/2020 Project 3
Out[26]:
Text(0.5, 1.0, 'Top 25% Application Distribution')

11/21/2020 Project 3

11/21/2020 Project 3

11/21/2020 Project 3
In [27]:

a = sns.distplot(df1['F.Undergrad'] , ax=axes[0][0])
a.set_title("Full time graduate Distribution",fontsize=15)
a = sns.boxplot(df1['F.Undergrad'],orient = "v" , ax=axes[0][1])
a.set_title("Full time graduate Distribution",fontsize=15)
a = sns.distplot(df1['P.Undergrad'] , ax=axes[1][0])
a.set_title("Part time graduate Distribution",fontsize=15)
a = sns.boxplot(df1['P.Undergrad'],orient = "v" , ax=axes[1][1])
a.set_title("Part time graduate Distribution",fontsize=15)
a = sns.distplot(df1['Outstate'] , ax=axes[2][0])
a.set_title("Outstate graduate Distribution",fontsize=15)
a = sns.boxplot(df1['Outstate'],orient = "v" , ax=axes[2][1])
a.set_title("Outstate graduate Distribution",fontsize=15)
a = sns.distplot(df1['Room.Board'] , ax=axes[3][0])
a.set_title("Room Boarding Distribution",fontsize=15)
a = sns.boxplot(df1['Room.Board'],orient = "v" , ax=axes[3][1])
a.set_title("Room Boarding Distribution",fontsize=15)
a = sns.distplot(df1['Books'] , ax=axes[4][0])
a.set_title("Books Distribution",fontsize=15)
a = sns.boxplot(df1['Books'],orient = "v" , ax=axes[4][1])
a.set_title("Books Distribution",fontsize=15)

11/21/2020 Project 3
Out[27]:
Text(0.5, 1.0, 'Books Distribution')

11/21/2020 Project 3

11/21/2020 Project 3

11/21/2020 Project 3
In [28]:

a = sns.distplot(df1['Personal'] , ax=axes[0][0])
a.set_title("Personal Expenditure for students Distribution",fontsize=15)
a = sns.boxplot(df1['Personal'],orient = "v" , ax=axes[0][1])
a.set_title("Personal Expenditure for students Distribution",fontsize=15)
a = sns.distplot(df1['PhD'] , ax=axes[1][0])
a.set_title("PhD Prof Distribution",fontsize=15)
a = sns.boxplot(df1['PhD'],orient = "v" , ax=axes[1][1])
a.set_title("PhD Proff Distribution",fontsize=15)
a = sns.distplot(df1['Terminal'] , ax=axes[2][0])
a.set_title("Terminal Degree Prof Distribution",fontsize=15)
a = sns.boxplot(df1['Terminal'],orient = "v" , ax=axes[2][1])
a.set_title("Terminal Degree Prof Distribution",fontsize=15)
a = sns.distplot(df1['S.F.Ratio'] , ax=axes[3][0])
a.set_title("students to Prof ratio Distribution",fontsize=15)
a = sns.boxplot(df1['S.F.Ratio'],orient = "v" , ax=axes[3][1])
a.set_title("students to Prof ratio Distribution",fontsize=15)
a = sns.distplot(df1['perc.alumni'] , ax=axes[4][0])
a.set_title("% of alumni who donated Distribution",fontsize=15)
a = sns.boxplot(df1['perc.alumni'],orient = "v" , ax=axes[4][1])
a.set_title("% of alumni who donated Distribution",fontsize=15)

11/21/2020 Project 3
Out[28]:
Text(0.5, 1.0, '% of alumni who donated Distribution')

11/21/2020 Project 3

11/21/2020 Project 3
In [29]:

fig.set_size_inches(12,7)
a = sns.distplot(df1['Expend'] , ax=axes[0][0])
a.set_title("Instructional expend per student Distribution",fontsize=15)
a = sns.boxplot(df1['Expend'],orient = "v" , ax=axes[0][1])
a.set_title("Instructional expend per student Distribution",fontsize=15)
a = sns.distplot(df1['Grad.Rate'] , ax=axes[1][0])
a.set_title("Graduation rate Distribution",fontsize=15)
a = sns.boxplot(df1['Grad.Rate'],orient = "v" , ax=axes[1][1])
a.set_title("IGraduation rate Distribution",fontsize=15)
Out[29]:
Text(0.5, 1.0, 'IGraduation rate Distribution')
From the plots above it can be noticed that only Apps:"No of application recieved does'nt have any
outlier",otherwise all the attributes have outlers present
Multi-Variate Analysis

11/21/2020 Project 3
In [30]:
# Check for correlation of variable

df1.corr(method='pearson')
Out[30]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergra
Apps 1.000000 0.943451 0.846822 0.338834 0.351640 0.814491 0.39826
Accept 0.943451 1.000000 0.911637 0.192447 0.247476 0.874223 0.44127
Enroll 0.846822 0.911637 1.000000 0.181294 0.226745 0.964640 0.51306
Top10perc 0.338834 0.192447 0.181294 1.000000 0.891995 0.141289 -0.10535
Top25perc 0.351640 0.247476 0.226745 0.891995 1.000000 0.199445 -0.05357
F.Undergrad 0.814491 0.874223 0.964640 0.141289 0.199445 1.000000 0.57051
P.Undergrad 0.398264 0.441271 0.513069 -0.105356 -0.053577 0.570512 1.00000
Outstate 0.050159 -0.025755 -0.155477 0.562331 0.489394 -0.215742 -0.25351
Room.Board 0.164939 0.090899 -0.040232 0.371480 0.331490 -0.068890 -0.06132
Books 0.132559 0.113525 0.112711 0.118858 0.115527 0.115550 0.08120
Personal 0.178731 0.200989 0.280929 -0.093316 -0.080810 0.317200 0.31988
PhD 0.390697 0.355758 0.331469 0.531828 0.545862 0.318337 0.14911
Terminal 0.369491 0.337583 0.308274 0.491135 0.524749 0.300019 0.14190
S.F.Ratio 0.095633 0.176229 0.237271 -0.384875 -0.294629 0.279703 0.23253
perc.alumni -0.090226 -0.159990 -0.180794 0.455485 0.417864 -0.229462 -0.28079
Expend 0.259592 0.124717 0.064169 0.660913 0.527447 0.018652 -0.08356
Grad.Rate 0.146755 0.067313 -0.022341 0.494989 0.477281 -0.078773 -0.25700
we can see from the above matrix that lots of column have high correlation and are highly related to each
other. The maxiumn correlation ia among "Full time graduate" & "No of students enrolled". SO this dataset is
perfect for performing PCA for data reduction.

11/21/2020 Project 3
In [31]:
plt.subplots(figsize=(15,15))
sns.heatmap(df1.corr(), annot=True) # plot the correlation coefficients as a heatmap
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fc62b9250>

11/21/2020 Project 3
In [32]:
#Let us check for pair plots

sns.pairplot(df1,diag_kind='kde')
Out[32]:
<seaborn.axisgrid.PairGrid at 0x13fc6045490>
The correlation matrix and pairwise scatterplots indicate high correlation among
F.undergrad,Enroll,apps,accept,top10perc,top25perc etc such pairs of high and moderate correlations
indicate that dimension reduction must be considered for this data.
2.2) Scale the variables and write the inference for using the type of scaling function for this case study.

11/21/2020 Project 3
In [33]:
df2=df1.drop(["Names"],axis=1)
df2
Out[33]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room
0 1660 1232 721 23 52 2885 537 7440
1 2186 1924 512 16 29 2683 1227 12280
2 1428 1097 336 22 50 1036 99 11250
3 417 349 137 60 89 510 63 12960
4 193 146 55 16 44 249 869 7560
... ... ... ... ... ... ... ... ...
772 2197 1515 543 4 26 3089 2029 6797
773 1959 1805 695 24 47 2849 1107 11520
774 2097 1915 695 34 61 2793 166 6900
775 10705 2453 1317 95 99 5217 83 19840
776 2989 1855 691 28 63 2988 1726 4990
777 rows × 17 columns
In [110]:
# scaling the data using Z-score

# All variables must be on same scale, hence we can omit scaling.
# Standardization
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(data_new)
from scipy.stats import zscore
data_new=df2.apply(zscore)
data_new.to_excel("C:\\Users\\Shubham\\Downloads\\file.xlsx", index = False)

data_new.head()
Out[110]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstat
0 -0.346882 -0.321205 -0.063509 -0.258583 -0.191827 -0.168116 -0.209207 -0.74635
1 -0.210884 -0.038703 -0.288584 -0.655656 -1.353911 -0.209788 0.244307 0.45749
2 -0.406866 -0.376318 -0.478121 -0.315307 -0.292878 -0.549565 -0.497090 0.20130
3 -0.668261 -0.681682 -0.692427 1.840231 1.677612 -0.658079 -0.520752 0.62663
4 -0.726176 -0.764555 -0.780735 -0.655656 -0.596031 -0.711924 0.009005 -0.71650

11/21/2020 Project 3
In [72]:
data_new.boxplot(figsize=(20,3))
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fd4bb8550>
2.3) Comment on the comparison between covariance and the correlation matrix after scaling.

11/21/2020 Project 3
In [114]:
# Step 1 - Create covariance matrix
cov_matrix = np.cov(data_new.T)
print('Covariance Matrix \n%s', cov_matrix)

11/21/2020 Project 3
Covariance Matrix
%s [[ 1.00128866 0.94466636 0.84791332 0.33927032 0.35209304 0.815540
18
0.3987775 0.05022367 0.16515151 0.13272942 0.17896117 0.39120081
0.36996762 0.09575627 -0.09034216 0.2599265 0.14694372]
[ 0.94466636 1.00128866 0.91281145 0.19269493 0.24779465 0.87534985
0.44183938 -0.02578774 0.09101577 0.11367165 0.20124767 0.35621633
0.3380184 0.17645611 -0.16019604 0.12487773 0.06739929]
[ 0.84791332 0.91281145 1.00128866 0.18152715 0.2270373 0.96588274
0.51372977 -0.1556777 -0.04028353 0.11285614 0.28129148 0.33189629
0.30867133 0.23757707 -0.18102711 0.06425192 -0.02236983]
[ 0.33927032 0.19269493 0.18152715 1.00128866 0.89314445 0.1414708
-0.10549205 0.5630552 0.37195909 0.1190116 -0.09343665 0.53251337
0.49176793 -0.38537048 0.45607223 0.6617651 0.49562711]
[ 0.35209304 0.24779465 0.2270373 0.89314445 1.00128866 0.19970167
-0.05364569 0.49002449 0.33191707 0.115676 -0.08091441 0.54656564
0.52542506 -0.29500852 0.41840277 0.52812713 0.47789622]
[ 0.81554018 0.87534985 0.96588274 0.1414708 0.19970167 1.00128866
0.57124738 -0.21602002 -0.06897917 0.11569867 0.31760831 0.3187472
0.30040557 0.28006379 -0.22975792 0.01867565 -0.07887464]
[ 0.3987775 0.44183938 0.51372977 -0.10549205 -0.05364569 0.57124738
1.00128866 -0.25383901 -0.06140453 0.08130416 0.32029384 0.14930637
0.14208644 0.23283016 -0.28115421 -0.08367612 -0.25733218]
[ 0.05022367 -0.02578774 -0.1556777 0.5630552 0.49002449 -0.21602002
-0.25383901 1.00128866 0.65509951 0.03890494 -0.29947232 0.38347594
0.40850895 -0.55553625 0.56699214 0.6736456 0.57202613]
[ 0.16515151 0.09101577 -0.04028353 0.37195909 0.33191707 -0.06897917
-0.06140453 0.65509951 1.00128866 0.12812787 -0.19968518 0.32962651
0.3750222 -0.36309504 0.27271444 0.50238599 0.42548915]
[ 0.13272942 0.11367165 0.11285614 0.1190116 0.115676 0.11569867
0.08130416 0.03890494 0.12812787 1.00128866 0.17952581 0.0269404
0.10008351 -0.03197042 -0.04025955 0.11255393 0.00106226]
[ 0.17896117 0.20124767 0.28129148 -0.09343665 -0.08091441 0.31760831
0.32029384 -0.29947232 -0.19968518 0.17952581 1.00128866 -0.01094989
-0.03065256 0.13652054 -0.2863366 -0.09801804 -0.26969106]
[ 0.39120081 0.35621633 0.33189629 0.53251337 0.54656564 0.3187472
0.14930637 0.38347594 0.32962651 0.0269404 -0.01094989 1.00128866
0.85068186 -0.13069832 0.24932955 0.43331936 0.30543094]
[ 0.36996762 0.3380184 0.30867133 0.49176793 0.52542506 0.30040557
0.14208644 0.40850895 0.3750222 0.10008351 -0.03065256 0.85068186
1.00128866 -0.16031027 0.26747453 0.43936469 0.28990033]
[ 0.09575627 0.17645611 0.23757707 -0.38537048 -0.29500852 0.28006379
0.23283016 -0.55553625 -0.36309504 -0.03197042 0.13652054 -0.13069832
-0.16031027 1.00128866 -0.4034484 -0.5845844 -0.30710565]
[-0.09034216 -0.16019604 -0.18102711 0.45607223 0.41840277 -0.22975792
-0.28115421 0.56699214 0.27271444 -0.04025955 -0.2863366 0.24932955
0.26747453 -0.4034484 1.00128866 0.41825001 0.49153016]
[ 0.2599265 0.12487773 0.06425192 0.6617651 0.52812713 0.01867565
-0.08367612 0.6736456 0.50238599 0.11255393 -0.09801804 0.43331936
0.43936469 -0.5845844 0.41825001 1.00128866 0.39084571]
[ 0.14694372 0.06739929 -0.02236983 0.49562711 0.47789622 -0.07887464
-0.25733218 0.57202613 0.42548915 0.00106226 -0.26969106 0.30543094
0.28990033 -0.30710565 0.49153016 0.39084571 1.00128866]]
Comparing Correlation and Covarince Matrix

11/21/2020 Project 3
In [74]:
# Now without Scaling lets check out correlation matrix

df_corr =df2.copy()
df_corr.corr()
Out[74]:
Apps 1.000000 0.943451 0.846822 0.338834 0.351640 0.814491 0.39826
Accept 0.943451 1.000000 0.911637 0.192447 0.247476 0.874223 0.44127
Enroll 0.846822 0.911637 1.000000 0.181294 0.226745 0.964640 0.51306
Top10perc 0.338834 0.192447 0.181294 1.000000 0.891995 0.141289 -0.10535
Top25perc 0.351640 0.247476 0.226745 0.891995 1.000000 0.199445 -0.05357
F.Undergrad 0.814491 0.874223 0.964640 0.141289 0.199445 1.000000 0.57051
P.Undergrad 0.398264 0.441271 0.513069 -0.105356 -0.053577 0.570512 1.00000
Outstate 0.050159 -0.025755 -0.155477 0.562331 0.489394 -0.215742 -0.25351
Room.Board 0.164939 0.090899 -0.040232 0.371480 0.331490 -0.068890 -0.06132
Books 0.132559 0.113525 0.112711 0.118858 0.115527 0.115550 0.08120
Personal 0.178731 0.200989 0.280929 -0.093316 -0.080810 0.317200 0.31988
PhD 0.390697 0.355758 0.331469 0.531828 0.545862 0.318337 0.14911
Terminal 0.369491 0.337583 0.308274 0.491135 0.524749 0.300019 0.14190
S.F.Ratio 0.095633 0.176229 0.237271 -0.384875 -0.294629 0.279703 0.23253
perc.alumni -0.090226 -0.159990 -0.180794 0.455485 0.417864 -0.229462 -0.28079
Expend 0.259592 0.124717 0.064169 0.660913 0.527447 0.018652 -0.08356
Grad.Rate 0.146755 0.067313 -0.022341 0.494989 0.477281 -0.078773 -0.25700

11/21/2020 Project 3
In [75]:
#With standardisation (Without standardisation also, correlation matrix yields same res
ult)
data_new.corr()
Out[75]:
Apps 1.000000 0.943451 0.846822 0.338834 0.351640 0.814491 0.39826
Accept 0.943451 1.000000 0.911637 0.192447 0.247476 0.874223 0.44127
Enroll 0.846822 0.911637 1.000000 0.181294 0.226745 0.964640 0.51306
Top10perc 0.338834 0.192447 0.181294 1.000000 0.891995 0.141289 -0.10535
Top25perc 0.351640 0.247476 0.226745 0.891995 1.000000 0.199445 -0.05357
F.Undergrad 0.814491 0.874223 0.964640 0.141289 0.199445 1.000000 0.57051
P.Undergrad 0.398264 0.441271 0.513069 -0.105356 -0.053577 0.570512 1.00000
Outstate 0.050159 -0.025755 -0.155477 0.562331 0.489394 -0.215742 -0.25351
Room.Board 0.164939 0.090899 -0.040232 0.371480 0.331490 -0.068890 -0.06132
Books 0.132559 0.113525 0.112711 0.118858 0.115527 0.115550 0.08120
Personal 0.178731 0.200989 0.280929 -0.093316 -0.080810 0.317200 0.31988
PhD 0.390697 0.355758 0.331469 0.531828 0.545862 0.318337 0.14911
Terminal 0.369491 0.337583 0.308274 0.491135 0.524749 0.300019 0.14190
S.F.Ratio 0.095633 0.176229 0.237271 -0.384875 -0.294629 0.279703 0.23253
perc.alumni -0.090226 -0.159990 -0.180794 0.455485 0.417864 -0.229462 -0.28079
Expend 0.259592 0.124717 0.064169 0.660913 0.527447 0.018652 -0.08356
Grad.Rate 0.146755 0.067313 -0.022341 0.494989 0.477281 -0.078773 -0.25700
We can state that above three approaches yield the same eigenvectors and eigenvalue pairs:
1.Eigen decomposition of the covariance matrix after standardizing the data.
2.Eigen decomposition of the correlation matrix.
3.Eigen decomposition of the correlation matrix after standardizing the data.
Finally we can say that after scaling - the covariance and the correlation have the same values
2.4) Check the dataset for outliers before and after scaling. Draw your inferences from this exercise.
Outliers before scaling:

11/21/2020 Project 3
In [116]:
df2.boxplot(figsize=(20,10))
Out[116]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fd8ab5a00>
Outliers after scaling:

In [117]:
Out[117]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fd8d3be20>

11/21/2020 Project 3
In [38]:
def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

11/21/2020 Project 3
In [55]:
lratio,uratio=remove_outlier(data_new['Apps'])
data_new['Apps']=np.where(data_new['Apps']>uratio,uratio,data_new['Apps'])
data_new['Apps']=np.where(data_new['Apps']<lratio,lratio,data_new['Apps'])
lratio,uratio=remove_outlier(data_new['Accept'])
data_new['Accept']=np.where(data_new['Accept']>uratio,uratio,data_new['Accept'])
data_new['Accept']=np.where(data_new['Accept']<lratio,lratio,data_new['Accept'])
lratio,uratio=remove_outlier(data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']>uratio,uratio,data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']<lratio,lratio,data_new['Enroll'])
lratio,uratio=remove_outlier(data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']>uratio,uratio,data_new['Enroll'])
data_new['Enroll']=np.where(data_new['Enroll']<lratio,lratio,data_new['Enroll'])
lratio,uratio=remove_outlier(data_new['Top10perc'])
data_new['Top10perc']=np.where(data_new['Top10perc']>uratio,uratio,data_new['Top10perc'
])
data_new['Top10perc']=np.where(data_new['Top10perc']<lratio,lratio,data_new['Top10perc'
])
lratio,uratio=remove_outlier(data_new['Top25perc'])
data_new['Top25perc']=np.where(data_new['Top25perc']>uratio,uratio,data_new['Top25perc'
])
data_new['Top25perc']=np.where(data_new['Top25perc']<lratio,lratio,data_new['Top25perc'
])
lratio,uratio=remove_outlier(data_new['F.Undergrad'])
data_new['F.Undergrad']=np.where(data_new['F.Undergrad']>uratio,uratio,data_new['F.Unde
rgrad'])
data_new['F.Undergrad']=np.where(data_new['F.Undergrad']<lratio,lratio,data_new['F.Unde
rgrad'])
lratio,uratio=remove_outlier(data_new['P.Undergrad'])
data_new['P.Undergrad']=np.where(data_new['P.Undergrad']>uratio,uratio,data_new['P.Unde
rgrad'])
data_new['P.Undergrad']=np.where(data_new['P.Undergrad']<lratio,lratio,data_new['P.Unde
rgrad'])
lratio,uratio=remove_outlier(data_new['Outstate'])
data_new['Outstate']=np.where(data_new['Outstate']>uratio,uratio,data_new['Outstate'])
data_new['Outstate']=np.where(data_new['Outstate']<lratio,lratio,data_new['Outstate'])
lratio,uratio=remove_outlier(data_new['Room.Board'])
data_new['Room.Board']=np.where(data_new['Room.Board']>uratio,uratio,data_new['Room.Boa
rd'])
data_new['Room.Board']=np.where(data_new['Room.Board']<lratio,lratio,data_new['Room.Boa
rd'])
lratio,uratio=remove_outlier(data_new['Books'])
data_new['Books']=np.where(data_new['Books']>uratio,uratio,data_new['Books'])
data_new['Books']=np.where(data_new['Books']<lratio,lratio,data_new['Books'])
lratio,uratio=remove_outlier(data_new['Personal'])
data_new['Personal']=np.where(data_new['Personal']>uratio,uratio,data_new['Personal'])
data_new['Personal']=np.where(data_new['Personal']<lratio,lratio,data_new['Personal'])

11/21/2020 Project 3
lratio,uratio=remove_outlier(data_new['PhD'])
data_new['PhD']=np.where(data_new['PhD']>uratio,uratio,data_new['PhD'])
data_new['PhD']=np.where(data_new['PhD']<lratio,lratio,data_new['PhD'])
lratio,uratio=remove_outlier(data_new['Terminal'])
data_new['Terminal']=np.where(data_new['Terminal']>uratio,uratio,data_new['Terminal'])
data_new['Terminal']=np.where(data_new['Terminal']<lratio,lratio,data_new['Terminal'])
lratio,uratio=remove_outlier(data_new['S.F.Ratio'])
data_new['S.F.Ratio']=np.where(data_new['S.F.Ratio']>uratio,uratio,data_new['S.F.Ratio'
])
data_new['S.F.Ratio']=np.where(data_new['S.F.Ratio']<lratio,lratio,data_new['S.F.Ratio'
])
lratio,uratio=remove_outlier(data_new['perc.alumni'])
data_new['perc.alumni']=np.where(data_new['perc.alumni']>uratio,uratio,data_new['perc.a
lumni'])
data_new['perc.alumni']=np.where(data_new['perc.alumni']<lratio,lratio,data_new['perc.a
lumni'])
lratio,uratio=remove_outlier(data_new['Expend'])
data_new['Expend']=np.where(data_new['Expend']>uratio,uratio,data_new['Expend'])
data_new['Expend']=np.where(data_new['Expend']<lratio,lratio,data_new['Expend'])
lratio,uratio=remove_outlier(data_new['Grad.Rate'])
data_new['Grad.Rate']=np.where(data_new['Grad.Rate']>uratio,uratio,data_new['Grad.Rate'
])
data_new['Grad.Rate']=np.where(data_new['Grad.Rate']<lratio,lratio,data_new['Grad.Rate'
])
Data after outliers treatment:

In [57]:
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fd4d3d400>

11/21/2020 Project 3
The above boxplot shows the result after outliers treatment
2.5) Build the covariance matrix and calculate the eigenvalues and the eigenvector.

11/21/2020 Project 3
In [76]:
cov_matrix = np.cov(data_new.T)
print('Covariance Matrix \n%s', cov_matrix)
Covariance Matrix
%s [[ 1.00128866 0.94466636 0.84791332 0.33927032 0.35209304 0.815540
18
0.3987775 0.05022367 0.16515151 0.13272942 0.17896117 0.39120081
0.36996762 0.09575627 -0.09034216 0.2599265 0.14694372]
[ 0.94466636 1.00128866 0.91281145 0.19269493 0.24779465 0.87534985
0.44183938 -0.02578774 0.09101577 0.11367165 0.20124767 0.35621633
0.3380184 0.17645611 -0.16019604 0.12487773 0.06739929]
[ 0.84791332 0.91281145 1.00128866 0.18152715 0.2270373 0.96588274
0.51372977 -0.1556777 -0.04028353 0.11285614 0.28129148 0.33189629
0.30867133 0.23757707 -0.18102711 0.06425192 -0.02236983]
[ 0.33927032 0.19269493 0.18152715 1.00128866 0.89314445 0.1414708
-0.10549205 0.5630552 0.37195909 0.1190116 -0.09343665 0.53251337
0.49176793 -0.38537048 0.45607223 0.6617651 0.49562711]
[ 0.35209304 0.24779465 0.2270373 0.89314445 1.00128866 0.19970167
-0.05364569 0.49002449 0.33191707 0.115676 -0.08091441 0.54656564
0.52542506 -0.29500852 0.41840277 0.52812713 0.47789622]
[ 0.81554018 0.87534985 0.96588274 0.1414708 0.19970167 1.00128866
0.57124738 -0.21602002 -0.06897917 0.11569867 0.31760831 0.3187472
0.30040557 0.28006379 -0.22975792 0.01867565 -0.07887464]
[ 0.3987775 0.44183938 0.51372977 -0.10549205 -0.05364569 0.57124738
1.00128866 -0.25383901 -0.06140453 0.08130416 0.32029384 0.14930637
0.14208644 0.23283016 -0.28115421 -0.08367612 -0.25733218]
[ 0.05022367 -0.02578774 -0.1556777 0.5630552 0.49002449 -0.21602002
-0.25383901 1.00128866 0.65509951 0.03890494 -0.29947232 0.38347594
0.40850895 -0.55553625 0.56699214 0.6736456 0.57202613]
[ 0.16515151 0.09101577 -0.04028353 0.37195909 0.33191707 -0.06897917
-0.06140453 0.65509951 1.00128866 0.12812787 -0.19968518 0.32962651
0.3750222 -0.36309504 0.27271444 0.50238599 0.42548915]
[ 0.13272942 0.11367165 0.11285614 0.1190116 0.115676 0.11569867
0.08130416 0.03890494 0.12812787 1.00128866 0.17952581 0.0269404
0.10008351 -0.03197042 -0.04025955 0.11255393 0.00106226]
[ 0.17896117 0.20124767 0.28129148 -0.09343665 -0.08091441 0.31760831
0.32029384 -0.29947232 -0.19968518 0.17952581 1.00128866 -0.01094989
-0.03065256 0.13652054 -0.2863366 -0.09801804 -0.26969106]
[ 0.39120081 0.35621633 0.33189629 0.53251337 0.54656564 0.3187472
0.14930637 0.38347594 0.32962651 0.0269404 -0.01094989 1.00128866
0.85068186 -0.13069832 0.24932955 0.43331936 0.30543094]
[ 0.36996762 0.3380184 0.30867133 0.49176793 0.52542506 0.30040557
0.14208644 0.40850895 0.3750222 0.10008351 -0.03065256 0.85068186
1.00128866 -0.16031027 0.26747453 0.43936469 0.28990033]
[ 0.09575627 0.17645611 0.23757707 -0.38537048 -0.29500852 0.28006379
0.23283016 -0.55553625 -0.36309504 -0.03197042 0.13652054 -0.13069832
-0.16031027 1.00128866 -0.4034484 -0.5845844 -0.30710565]
[-0.09034216 -0.16019604 -0.18102711 0.45607223 0.41840277 -0.22975792
-0.28115421 0.56699214 0.27271444 -0.04025955 -0.2863366 0.24932955
0.26747453 -0.4034484 1.00128866 0.41825001 0.49153016]
[ 0.2599265 0.12487773 0.06425192 0.6617651 0.52812713 0.01867565
-0.08367612 0.6736456 0.50238599 0.11255393 -0.09801804 0.43331936
0.43936469 -0.5845844 0.41825001 1.00128866 0.39084571]
[ 0.14694372 0.06739929 -0.02236983 0.49562711 0.47789622 -0.07887464
-0.25733218 0.57202613 0.42548915 0.00106226 -0.26969106 0.30543094
0.28990033 -0.30710565 0.49153016 0.39084571 1.00128866]]

11/21/2020 Project 3
In [127]:
# Identify eigen values and eigen vector

eig_vals, eig_vecs = np.linalg.eig(cov_matrix.T)
print('\n Eigen Values \n %s', eig_vals)
print('Eigen Vectors \n %s', eig_vecs)

11/21/2020 Project 3
Eigen Values
%s [5.45052162 4.48360686 1.17466761 1.00820573 0.93423123 0.84849117
0.6057878 0.58787222 0.53061262 0.4043029 0.02302787 0.03672545
0.31344588 0.08802464 0.1439785 0.16779415 0.22061096]
Eigen Vectors
%s [[-2.48765602e-01 3.31598227e-01 6.30921033e-02 -2.81310530e-01
5.74140964e-03 1.62374420e-02 4.24863486e-02 1.03090398e-01
9.02270802e-02 -5.25098025e-02 3.58970400e-01 -4.59139498e-01
4.30462074e-02 -1.33405806e-01 8.06328039e-02 -5.95830975e-01
2.40709086e-02]
[-2.07601502e-01 3.72116750e-01 1.01249056e-01 -2.67817346e-01
5.57860920e-02 -7.53468452e-03 1.29497196e-02 5.62709623e-02
1.77864814e-01 -4.11400844e-02 -5.43427250e-01 5.18568789e-01
-5.84055850e-02 1.45497511e-01 3.34674281e-02 -2.92642398e-01
-1.45102446e-01]
[-1.76303592e-01 4.03724252e-01 8.29855709e-02 -1.61826771e-01
-5.56936353e-02 4.25579803e-02 2.76928937e-02 -5.86623552e-02
1.28560713e-01 -3.44879147e-02 6.09651110e-01 4.04318439e-01
-6.93988831e-02 -2.95896092e-02 -8.56967180e-02 4.44638207e-01
1.11431545e-02]
[-3.54273947e-01 -8.24118211e-02 -3.50555339e-02 5.15472524e-02
-3.95434345e-01 5.26927980e-02 1.61332069e-01 1.22678028e-01
-3.41099863e-01 -6.40257785e-02 -1.44986329e-01 1.48738723e-01
-8.10481404e-03 -6.97722522e-01 -1.07828189e-01 -1.02303616e-03
3.85543001e-02]
[-3.44001279e-01 -4.47786551e-02 2.41479376e-02 1.09766541e-01
-4.26533594e-01 -3.30915896e-02 1.18485556e-01 1.02491967e-01
-4.03711989e-01 -1.45492289e-02 8.03478445e-02 -5.18683400e-02
-2.73128469e-01 6.17274818e-01 1.51742110e-01 -2.18838802e-02
-8.93515563e-02]
[-1.54640962e-01 4.17673774e-01 6.13929764e-02 -1.00412335e-01
-4.34543659e-02 4.34542349e-02 2.50763629e-02 -7.88896442e-02
5.94419181e-02 -2.08471834e-02 -4.14705279e-01 -5.60363054e-01
-8.11578181e-02 -9.91640992e-03 -5.63728817e-02 5.23622267e-01
5.61767721e-02]
[-2.64425045e-02 3.15087830e-01 -1.39681716e-01 1.58558487e-01
3.02385408e-01 1.91198583e-01 -6.10423460e-02 -5.70783816e-01
-5.60672902e-01 2.23105808e-01 9.01788964e-03 5.27313042e-02
1.00693324e-01 -2.09515982e-02 1.92857500e-02 -1.25997650e-01
-6.35360730e-02]
[-2.94736419e-01 -2.49643522e-01 -4.65988731e-02 -1.31291364e-01
2.22532003e-01 3.00003910e-02 -1.08528966e-01 -9.84599754e-03
4.57332880e-03 -1.86675363e-01 5.08995918e-02 -1.01594830e-01
1.43220673e-01 -3.83544794e-02 -3.40115407e-02 1.41856014e-01
-8.23443779e-01]
[-2.49030449e-01 -1.37808883e-01 -1.48967389e-01 -1.84995991e-01
5.60919470e-01 -1.62755446e-01 -2.09744235e-01 2.21453442e-01
-2.75022548e-01 -2.98324237e-01 1.14639620e-03 2.59293381e-02
-3.59321731e-01 -3.40197083e-03 -5.84289756e-02 6.97485854e-02
3.54559731e-01]
[-6.47575181e-02 5.63418434e-02 -6.77411649e-01 -8.70892205e-02
-1.27288825e-01 -6.41054950e-01 1.49692034e-01 -2.13293009e-01
1.33663353e-01 8.20292186e-02 7.72631963e-04 -2.88282896e-03
3.19400370e-02 9.43887925e-03 -6.68494643e-02 -1.14379958e-02
-2.81593679e-02]
[ 4.25285386e-02 2.19929218e-01 -4.99721120e-01 2.30710568e-01
-2.22311021e-01 3.31398003e-01 -6.33790064e-01 2.32660840e-01
9.44688900e-02 -1.36027616e-01 -1.11433396e-03 1.28904022e-02
-1.85784733e-02 3.09001353e-03 2.75286207e-02 -3.94547417e-02
-3.92640266e-02]
[-3.18312875e-01 5.83113174e-02 1.27028371e-01 5.34724832e-01
11/21/2020 Project 3
1.40166326e-01 -9.12555212e-02 1.09641298e-03 7.70400002e-02

1.85181525e-01 1.23452200e-01 1.38133366e-02 -2.98075465e-02
4.03723253e-02 1.12055599e-01 -6.91126145e-01 -1.27696382e-01
2.32224316e-02]
[-3.17056016e-01 4.64294477e-02 6.60375454e-02 5.19443019e-01
2.04719730e-01 -1.54927646e-01 2.84770105e-02 1.21613297e-02
2.54938198e-01 8.85784627e-02 6.20932749e-03 2.70759809e-02
-5.89734026e-02 -1.58909651e-01 6.71008607e-01 5.83134662e-02
1.64850420e-02]
[ 1.76957895e-01 2.46665277e-01 2.89848401e-01 1.61189487e-01
-7.93882496e-02 -4.87045875e-01 -2.19259358e-01 8.36048735e-02
-2.74544380e-01 -4.72045249e-01 -2.22215182e-03 2.12476294e-02
4.45000727e-01 2.08991284e-02 4.13740967e-02 1.77152700e-02
-1.10262122e-02]
[-2.05082369e-01 -2.46595274e-01 1.46989274e-01 -1.73142230e-02
-2.16297411e-01 4.73400144e-02 -2.43321156e-01 -6.78523654e-01
2.55334907e-01 -4.22999706e-01 -1.91869743e-02 -3.33406243e-03
-1.30727978e-01 8.41789410e-03 -2.71542091e-02 -1.04088088e-01
1.82660654e-01]
[-3.18908750e-01 -1.31689865e-01 -2.26743985e-01 -7.92734946e-02
7.59581203e-02 2.98118619e-01 2.26584481e-01 5.41593771e-02
4.91388809e-02 -1.32286331e-01 -3.53098218e-02 4.38803230e-02
6.92088870e-01 2.27742017e-01 7.31225166e-02 9.37464497e-02
3.25982295e-01]
[-2.52315654e-01 -1.69240532e-01 2.08064649e-01 -2.69129066e-01
-1.09267913e-01 -2.16163313e-01 -5.59943937e-01 5.33553891e-03
-4.19043052e-02 5.90271067e-01 -1.30710024e-02 5.00844705e-03
2.19839000e-01 3.39433604e-03 3.64767385e-02 6.91969778e-02
1.22106697e-01]]
We can see there are only 4 eigen values greater than 1
2.6) Write the explicit form of the first PC (in terms of Eigen Vectors).
In order to decide which eigenvector(s) can dropped without losing too much information for the construction
of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with
the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can
be dropped. In order to do so, the common approach is to rank the eigenvalues from highest to lowest in
order choose the top k eigenvectors.

11/21/2020 Project 3
In [79]:
# Make a list of (eigenvalue, eigenvector) tuples

eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
# Sort the (eigenvalue, eigenvector) tuples from high to low

eig_pairs.sort(key=lambda x: x[0], reverse=True)
# Visually confirm that the list is correctly sorted by decreasing eigenvalues

print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])
Eigenvalues in descending order:

5.450521622150296
4.483606861940848
1.174667612947485
1.0082057299695018
0.9342312255505819
0.8484911715045005
0.6057878032794005
0.5878722195930824
0.530612624700581
0.4043028977516895
0.31344587981029803
0.22061096461638874
0.16779415216580862
0.14397849747566208
0.08802463699454342
0.03672544741045181
0.02302786863373005
2.7) Discuss the cumulative values of the eigenvalues. How does it help you to decide on the optimum
number of principal components? What do the eigenvectors indicate? Perform PCA and export the data of
the Principal Component scores into a data frame.
After sorting the eigenpairs, the next question is “how many principal components are we going to choose for
our new feature subspace?” A useful measure is the so-called “explained variance,” which can be calculated
from the eigenvalues. The explained variance tells us how much information (variance) can be attributed to
each of the principal components.

11/21/2020 Project 3
In [94]:
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
print(var_exp)
cum_var_exp = np.cumsum(var_exp)
print(cum_var_exp)
[32.02062819886918, 26.340214436112486, 6.900916554222489, 5.9229892229262

89, 5.488405110358481, 4.984700954557441, 3.558871491746649, 3.45362133699
92588, 3.1172336798217195, 2.375191525893793, 1.8414263209386883, 1.296041
4001235347, 0.9857541228001174, 0.8458423350830024, 0.5171255833731979, 0.
21575401007275807, 0.13528371610095027]
[ 32.0206282 58.36084263 65.26175919 71.18474841 76.67315352
81.65785448 85.21672597 88.67034731 91.78758099 94.16277251
96.00419883 97.30024023 98.28599436 99.13183669 99.64896227
99.86471628 100. ]
In [85]:
# Ploting
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'I
ndividual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative exp
lained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
The plot above clearly shows that most of the variance (32.02% of the variance to be precise) can be
explained by the first principal component. The second principal component explain (26.34%) while the third
and fourth principal components explain 6.9% & 5.9% respectively. Together, the first two principal
components contain 58.36% of the information.

11/21/2020 Project 3
In [135]:
pca = PCA(n_components= 17)

pca.fit_transform(data_new)
pc_comps = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10','PC11','PC12',
'PC13','PC14','PC15','PC16','PC17']
prop_var = np.round(pca.explained_variance_ratio_,2)
std_dev = np.round(np.sqrt(pca.explained_variance_),2)
cum_var = np.round(np.cumsum(pca.explained_variance_ratio_),2)
temp = pd.DataFrame(pc_comps,columns=['PCs'])
temp['Proportion Of Variance'] = prop_var
temp['Standard Deviation'] = std_dev
temp['Cumulative Proportion'] = cum_var
temp
Out[135]:
PCs Proportion Of Variance Standard Deviation Cumulative Proportion
0 PC1 0.32 2.33 0.32
1 PC2 0.26 2.12 0.58
2 PC3 0.07 1.08 0.65
3 PC4 0.06 1.00 0.71
4 PC5 0.05 0.97 0.77
5 PC6 0.05 0.92 0.82
6 PC7 0.04 0.78 0.85
7 PC8 0.03 0.77 0.89
8 PC9 0.03 0.73 0.92
9 PC10 0.02 0.64 0.94
10 PC11 0.02 0.56 0.96
11 PC12 0.01 0.47 0.97
12 PC13 0.01 0.41 0.98
13 PC14 0.01 0.38 0.99
14 PC15 0.01 0.30 1.00
15 PC16 0.00 0.19 1.00
16 PC17 0.00 0.15 1.00

11/21/2020 Project 3
In [139]:
# Obtain the screeplot

plt.plot(temp['Proportion Of Variance'],marker = 'o')
plt.xticks(np.arange(0,17),labels=np.arange(1,18))
plt.xlabel('#of principal components')
plt.ylabel('Proportion of variance explained')
Out[139]:
Text(0, 0.5, 'Proportion of variance explained')
Structure of Principal components & PC score

In [141]:
# Print first 5 PCs

pc_df_pcafunc =pd.DataFrame(np.round(pca.components_,2),index=pc_comps,columns=data_new
.columns)
pc_df_pcafunc.head(5)
Out[141]:
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room
PC1 0.25 0.21 0.18 0.35 0.34 0.15 0.03 0.29
PC2 0.33 0.37 0.40 -0.08 -0.04 0.42 0.32 -0.25
PC3 -0.06 -0.10 -0.08 0.04 -0.02 -0.06 0.14 0.05
PC4 0.28 0.27 0.16 -0.05 -0.11 0.10 -0.16 0.13
PC5 0.01 0.06 -0.06 -0.40 -0.43 -0.04 0.30 0.22

11/21/2020 Project 3
For each PC, the row of length 17 gives the weights with which the corresponding variables need to be
multiplied to get the PC. Note that the weights can be positive or negative.
In [144]:
# Find PC scores
pc = pca.fit_transform(data_new)
pca_df = pd.DataFrame(pc,columns=pc_comps)
np.round(pca_df.iloc[:6,:5],2)
Out[144]:
PC1 PC2 PC3 PC4 PC5
0 -1.59 0.77 -0.10 -0.92 -0.74
1 -2.19 -0.58 2.28 3.59 1.06
2 -1.43 -1.09 -0.44 0.68 -0.37
3 2.86 -2.63 0.14 -1.30 -0.18
4 -2.21 0.02 2.39 -1.11 0.68
5 -0.57 -1.50 0.02 0.07 -0.38
To check that the PCs are orthogonal, correlation matrix is computed

11/21/2020 Project 3
In [149]:
round(pca_df.corr(),5)
Out[149]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14
PC1 1.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0 0.0 -0.0 0.0 0.0 0.0 0.0
PC2 0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0
PC3 -0.0 0.0 1.0 0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 0.0
PC4 -0.0 0.0 0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0
PC5 0.0 -0.0 0.0 0.0 1.0 0.0 -0.0 0.0 -0.0 0.0 0.0 0.0 -0.0 -0.0
PC6 -0.0 -0.0 0.0 0.0 0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.0
PC7 0.0 -0.0 -0.0 -0.0 -0.0 0.0 1.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 0.0
PC8 -0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 0.0
PC9 0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 1.0 0.0 -0.0 0.0 0.0 0.0
PC10 -0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0 1.0 -0.0 0.0 -0.0 0.0
PC11 0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 -0.0 1.0 -0.0 -0.0 0.0
PC12 0.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 1.0 -0.0 0.0
PC13 0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 1.0 -0.0
PC14 0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 1.0
PC15 -0.0 0.0 0.0 -0.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0
PC16 0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0
PC17 0.0 -0.0 -0.0 0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 -0.0 -0.0
Let us now investigate the correlations among the first 5 PCs with the original 17 variables

11/21/2020 Project 3
In [155]:
result = pd.concat((data_new,pca_df),axis = 1).corr()

np.round(result.iloc[0:17,17:22],2)
Out[155]:
PC1 PC2 PC3 PC4 PC5
Apps 0.58 0.70 -0.07 0.28 0.01
Accept 0.48 0.79 -0.11 0.27 0.05
Enroll 0.41 0.85 -0.09 0.16 -0.05
Top10perc 0.83 -0.17 0.04 -0.05 -0.38
Top25perc 0.80 -0.09 -0.03 -0.11 -0.41
F.Undergrad 0.36 0.88 -0.07 0.10 -0.04
P.Undergrad 0.06 0.67 0.15 -0.16 0.29
Outstate 0.69 -0.53 0.05 0.13 0.21
Room.Board 0.58 -0.29 0.16 0.19 0.54
Books 0.15 0.12 0.73 0.09 -0.12
Personal -0.10 0.47 0.54 -0.23 -0.21
PhD 0.74 0.12 -0.14 -0.54 0.14
Terminal 0.74 0.10 -0.07 -0.52 0.20
S.F.Ratio -0.41 0.52 -0.31 -0.16 -0.08
perc.alumni 0.48 -0.52 -0.16 0.02 -0.21
Expend 0.74 -0.28 0.25 0.08 0.07
Grad.Rate 0.59 -0.36 -0.23 0.27 -0.11
. Among the correlations between the PCs and the constituent variables, the following are considerably large:
• PC1 and (Top 10perc, Top 25perc) • PC2 and (Enroll,F.Graduate)
In [ ]:

Advanced Statistics Jupyter File PDF

Uploaded by

Copyright:

Available Formats

Advanced Statistics Jupyter File PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Statistics Jupyter File PDF

Uploaded by

Copyright:

Available Formats

11/21/2020 Project 3

Index(['A', 'B', 'Volunteer', 'Relief'], dtype='object')

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 1/56

count 36.000000 36.000000 36.000000 36.000000

mean 2.000000 2.000000 2.500000 7.183333

std 0.828079 0.828079 1.133893 3.272090

min 1.000000 1.000000 1.000000 2.300000

25% 1.000000 1.000000 1.750000 4.675000

50% 2.000000 2.000000 2.500000 6.000000

75% 3.000000 3.000000 3.250000 9.325000

max 3.000000 3.000000 4.000000 13.500000

#Checking for null values

array([1, 2, 3], dtype=int64)

array([1, 2, 3], dtype=int64)

array([1, 2, 3, 4], dtype=int64)

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 2/56

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 3/56

Stating Null and Alternate Hypothesis

Define confidence level

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 4/56

Performing One way ANOVA between categorical

df sum_sq mean_sq F PR(>F)

Stating Null and Alternate Hypothesis

HO: The mean of Active ingredient B is equal to variable Relief

Define confidence level

Performing One way ANOVA between categorical

df sum_sq mean_sq F PR(>F)

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 6/56

Formulate the Hypothesis of the two-way ANOVA with

df sum_sq mean_sq F PR(>F)

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 7/56

The null hypothesis for the interactin is that “there is

The dataset Education - Post 12th Standard.csvView

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 8/56

Names Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad O

... ... ... ... ... ... ... ... ...

777 rows × 18 columns

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 9/56

Summary of the dataset

Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Un

count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777

mean 3001.638353 2018.804376 779.972973 27.558559 55.796654 3699.907336 855

std 3870.201484 2451.113971 929.176190 17.640364 19.804778 4850.420531 1522

min 81.000000 72.000000 35.000000 1.000000 9.000000 139.000000 1

25% 776.000000 604.000000 242.000000 15.000000 41.000000 992.000000 95

50% 1558.000000 1110.000000 434.000000 23.000000 54.000000 1707.000000 353

75% 3624.000000 2424.000000 902.000000 35.000000 69.000000 4005.000000 967

max 48094.000000 26330.000000 6392.000000 96.000000 100.000000 31643.000000 21836

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 10/56

localhost:8892/nbconvert/html/Downloads/Project 3.ipynb?download=false 11/56

array(['Abilene Christian University', 'Adelphi University',

'College of Santa Fe', 'College of St. Joseph',

'Indiana University at Bloomington', 'Indiana Wesleyan University',

'North Adams State College',

'Southern Illinois University at Edwardsville',

'University of Missouri at Rolla',

'Westminster College of Salt Lake City', 'Westmont College',

Check for null values

There is no null value present in dataset