Anova and Pca
Anova and Pca
Anova and Pca
1.State the null and the alternate hypothesis for conducting one-way ANOVA for
both Education and Occupation individually.
Ans) For the Education the null hypothesis is the mean value of Salary for different
categories of Education is same. The alternative hypothesis is the mean value of Salary
for different categories of education is different.
For the Occupation the null hypothesis is the mean value of Salary for different
categories of Occupation is same. The alternative hypothesis is the mean value of
Salary for different categories of Occupation is different.
Ans) The result of one- way ANOVA on Salary with respect to Education is shown
below:
From the result I observed that the p-val is less than 0.05 so I reject the null hypothesis
and accept the alternative hypothesis that the mean value of Salary is different for
different categories of Education.
Ans) The result of one- way ANOVA on Salary with respect to Occupation is shown
below:
From the result I observed that the p-val is less than 0.05 so I reject the null hypothesis
and accept the alternative hypothesis that the mean value of Salary is different for
different categories of Occupation.
Problem 1B
1.What is the interaction between two treatments? Analyze the effects of one
variable on the other (Education and Occupation) with the help of an interaction
plot.[hint: use the ‘pointplot’ function from the ‘seaborn’ function]
Ans) The interaction plot for Education and Occupation is shown below:
Since the lines in the interaction plot are not parallel so we can say that there is an
interaction between Education and Occupation. For Doctorate Education if the
Occupation is Prof-specialty then Salary has highest mean. Similarly for Bachelors
Education if Occupation is Sales then Salary has highest mean, and for HS-grad
Education the Salary has highest mean value for Prof-specialty.
2. Perform a two-way ANOVA based on Salary with respect to both Education and
Occupation (along with their interaction Education*Occupation). State the null
and alternative hypotheses and state your results. How will you interpret this
result?
Ans) The output for two-way ANOVA based on Salary with respect to both Education
and Occupation is shown below:
3. Explain the business implications of performing ANOVA for this particular case
study.
Ans) From business point of view, performing ANOVA for this particular case study
helps us to find different insights about the Salary of people with different Educations
and Occupations background. For example we can extract information like for which
Education category the Salary is highest and for which one the salary is lowest. Same
for Occupation that for which Occupation Category the salary is highest and for which
Occupation the Salary is lowest.
Hypothesis 1
For Education the null hypothesis is the mean value of Salary for different values of
Education is same. The alternative hypothesis is the mean value of Salary for different
values of Education is different. Since p value is less then 0.05 so I reject the null
hypothesis and acccept the alternative hypothesis that mean value of Salary is different
across different values of Education.
Hypothesis 2
For Occupation the null hypothesis is the mean value of Salary for different values of
Occupation is same. The alternative hypothesis is the mean value of Salary for different
values of Occupation is different. Since p value is 0.08 which is greater then 0.05 so I
accept the null hypothesis that mean value of Salary is same across different values of
Occupation.
Hypothesis 3
For interaction between Education and Occupation the null hypothesis is that there is no
interaction between Education and Occupation. The alternative hypothesis is that there
is an interaction between Education and Occupation. Since p value is less than 0.05 so
I reject the null hpothesis and accept the alternative hypothesis that there is an
interaction between Education and Occupation.
Problem 2
Ans) The distribution of each feature for univariate analysis is shown below
From the above plots I observed that the Accept, Apps, Books, Enroll, Expend,
P.Undergrad, Personal, S.F.Ratio, Top10perc and F.Undergrad columns are right
skewed. Grad.Rate, Outstate, and Top25perc follows normal distributions, and PhD and
Terminal are left skewed.
There are several variables which are worth to explore as bivariate analysis and the
scatter plots to show relation b/w them are shown below.
The scatter plot for relation between Apps and Accept is shown below:
From above scatter plot it is pretty evident that there is a strong positive relation
between Apps and Accept as the value of Accept increase with increase in Apps.
The scatter plot for relation between Enroll and Accept is shown below:
From above scatter plot it is pretty evident that there is a strong positive relation
between Enroll and Accept as the value of Accept increase with increase in Enroll.
The scatter plot for relation between Enroll and Accept is shown below:
From above scatter plot I observed that there is a strong positive relation between
Top10perc and Top25perc as the value of Top25perc increase with increase in
Top10perc.
2.Is scaling necessary for PCA in this case? Give justification and perform
scaling.
Ans) Since almost all the numerical variables in the dataset has different scales. So
scaling is necessary before PCA in this case so that PCA components can’t biased
towards any feature with greater variance. The scaling is performed in the coding
section and first few rows and columns of scaled data are shown below:
4.Check the dataset for outliers before and after scaling. What insight do you
derive here? [Please do not treat Outliers unless specifically asked to do so].
Ans) I used box plot for all numerical features to detect the outliers. The box plot for non
scaled data is shown below:
From above box plot it is pretty evident that there are too many outliers in the dataset as
shown in the plot as dot above the box plot for all the variables.
The box plot for detecting the outliers in the scaled data set is shown below:
From above plot it is pretty evident that there are too many outliers even after scaling
the data as shown in the small dot above the box plots for all the variables.
6. Perform PCA and export the data of the Principal Component (eigenvectors)
into a data frame with the original features
Ans) The PCA is performed in the coding section using two components and data of the
Principal Component Analysis is exported into data frame with original features. The
data frame is shown below:
7. Write down the explicit form of the first PC (in terms of the eigenvectors. Use
values with two places of decimals only). [hint: write the linear equation of PC in
terms of eigenvectors and corresponding features]
Ans) For PCA we first calculate the covariance of the given data. Then from covariance
we calculate the eigenvalues. The equation for finding the eigen values is given below:
Ax - λx = 0
Where A is the input data and λ is eigenvalues and x will be the eigenvectors. The
eigenvectors values for first feature up to two decimal places are shown below:
8. Consider the cumulative values of the eigenvalues. How does it help you to
decide on the optimum number of principal components? What do the
eigenvectors indicate?
Ans) The cumulative eigenvalues tells us how much information each component of pca
preserve from the original data. By using this value we can decide the optimum number
of principal components for preserving, let’s say 95% or 99% information from the
original data. The eigenvectors indicate the direction of the new features.