Pca
Pca
Pca
ANS : - Principal component analysis(PCA) refers to the process by which principal components are
computed and the subsequent use of these components in understanding the data. PCA is an
unsupervised approach, since it involves only a set of features X 1,X2,X3…..Xp. and no associated
response Y . Apart from producing derived variables for use in supervised learning problems, PCA
also serves as a tool for data visualization (visualization of the observations or visualization of the
variables).
In simple words, principal component analysis is a method of extracting important variables (in form
of components) from a large set of variables available in a data set. It extracts low dimensional set of
features from a high dimensional data set with a motive to capture as much information as possible.
With fewer variables, visualization also becomes much more meaningful. PCA is more useful when
dealing with 3 or higher dimensional data.
ANS : - The principal components are supplied with normalized version of original predictors. This is
because, the original predictors may have different scales. For example: Imagine a data set with
variables measuring units as gallons, kilometers, light years etc. It is definite that the scale of
variances in these variables will be large.
Performing PCA on un-normalized variables will lead to insanely large loadings for variables with
high variance. In turn, this will lead to dependence of a principal component on the variable with
high variance. This is undesirable.
ANS :- To reduce the dimension of Data we should consider the following operation must be
performed over the data before feeding it into the Machine learning model.
Remove the redundant dimension.
Only keep the most important dimension.
ANS :- Both Linear Discriminant Analysis (LDA) and PCA are linear transformation methods. PCA
yields the directions (principal components) that maximize the variance of the data, whereas LDA
also aims to find the directions that maximize the separation (or discrimination) between different
classes, which can be useful in pattern classification problem (PCA "ignores" class labels).
MCQ’s
1) Which of the following are good / recommended applications of PCA? Select all that apply.
A. To compress the data so it takes up less computer memory / disk space.
B. To reduce the dimension of the input data so as to speed up a learning algorithm.
C. Instead of using regularization, use PCA to reduce the number of features to reduce
overfitting.
D. To visualize high-dimensional data (by choosing k = 2 or k = 3).
Ans: A,B,D
2) Suppose you have a dataset {x(1), x(2), x(3),…….,x(m) } where x(i) € Rn. In order to visualize it, we apply
dimensionality reduction and get {z(1), z(2), z(3),…….,z(m) } where x(i) € Rk is k-dimensional. In a typical
setting, which of the following would you expect to be true? Check all that apply.
A. k > n
B. k ≤ n
C. k ≥ 4
D. k = 2 or k = 3 (since we can plot 2D or 3D data but don’t have ways to visualize higher
dimensional data)
Ans: B,D
3) Suppose you run PCA on the dataset below. Which of the following would be a reasonable
vector u(1) onto which to project the data? (By convention, we choose u(1) so that
√ 2 2
||u(1)||= (u (11) ) +(u(21 )) , the length of the vector u(1), equals 1.)
A. u(1) = [ 1 0 ]
B. u(1) = [ 0 1 ]
C. u(1) = [ 1/ √ 21/ √ 2 ]
D. u(1) = [ 1/ √ 2−1/ √ 2 ]
Ans: D
4) Suppose we run PCA with k = n, so that the dimension of the data is not reduced at all. (This is not
useful in practice but is a good thought exercise.) Recall that the percent / fraction of variance
n
∑ ❑ S ii
i=1
retained is given by: k Which of the following will be true? Check all that apply.
∑ ❑ S ii
i=1
A. Ureduce will be an n×n matrix.
B. xapprox = x for every example x.
C. The percentage of variance retained will be 100%.
n
∑ ❑ S ii
i=1
D. We have that k > 1.
∑ ❑ S ii
i=1
Ans: A,B,C
6) Imagine, you have 1000 input features and 1 target feature in a machine learning problem. You
have to select 100 most important features based on the relationship between input features and
the target features.
A. Yes
B. No
Solution: (A)
7) [ True or False ] It is not necessary to have a target variable for applying dimensionality
reduction algorithms.
A. TRUE
B. FALSE
Solution: (A)
8) I have 4 variables in the dataset such as – A, B, C & D. I have performed the following actions:
Step 1: Using the above variables, I have created two more variables, namely E = A + 3 * B and F = B
+ 5 * C + D.
Step 2: Then using only the variables E and F I have built a Random Forest model.
A. True
B. False
Solution: (A)
Yes, Because Step 1 could be used to represent the data into 2 lower dimensions.
9) Which of the following techniques would perform better for reducing dimensions of a data set?
D. None of these
Solution: (A)
If a columns have too many missing values, (say 99%) then we can remove such columns.
10) [ True or False ] Dimensionality reduction algorithms are one of the possible ways to reduce
the computation time required to build a model.
A. TRUE
B. FALSE
Solution: (A)
Reducing the dimension of data will take less time to train a model.
11) Which of the following algorithms cannot be used for reducing the dimensionality of data?
A. t-SNE
B. PCA
C. LDA False
D. None of these
Solution: (D)
12) [ True or False ] PCA can be used for projecting and visualizing data in lower dimensions.
A. TRUE
B. FALSE
Solution: (A)
Sometimes it is very useful to plot the data in lower dimensions. We can take the first 2 principal
components and then visualize the data using scatter plot.
13) The most popularly used dimensionality reduction algorithm is Principal Component Analysis
(PCA). Which of the following is/are true about PCA?
B. 1 and 3
C. 2 and 3
D. 1, 2 and 3
E. 1,2 and 4
Solution: (F)
C. Can’t Say
Solution: (B)
Higher k would lead to less smoothening as we would be able to preserve more characteristics in
data, hence less regularization.
15) In which of the following scenarios is t-SNE better to use than PCA for dimensionality reduction
while working on a local machine with minimal computational power?
Solution: (C)
t-SNE has quadratic time and space complexity. Thus it is a very heavy algorithm in terms of system
resource utilization.
16) Which of the following statement is true for a t-SNE cost function?
A. It is asymmetric in nature.
B. It is symmetric in nature.
Solution: (B)
Cost function of SNE is asymmetric in nature. Which makes it difficult to converge using gradient
decent. A symmetric cost function is one of the major differences between SNE and t-SNE.
Question 17
Imagine you are dealing with text data. To represent the words you are using word embedding
(Word2vec). In word embedding, you will end up with 1000 dimensions. Now, you want to reduce
the dimensionality of this high dimensional data such that, similar words should have a similar
meaning in nearest neighbor space.In such case, which of the following algorithm are you most
likely choose?
A. t-SNE
B. PCA
C. LDA
D. None of these
Solution: (A)
t-SNE stands for t-Distributed Stochastic Neighbor Embedding which consider the nearest
neighbours for reducing the data.
A. TRUE
B. FALSE
Solution: (A)
t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that
maps data from the input space to the map.
19) Which of the following statement is correct for t-SNE and PCA?
Solution: (D)
Option D is correct.
20) In t-SNE algorithm, which of the following hyper parameters can be tuned?
A. Number of dimensions
Solution: (D)
21) What is of the following statement is true about t-SNE in comparison to PCA?
A. When the data is huge (in size), t-SNE may fail to produce better results.
B. T-NSE always produces better result regardless of the size of the data
C. PCA always performs better than t-SNE for smaller size data.
D. None of these
Solution: (A)
Option A is correct
22) Xi and Xj are two distinct points in the higher dimension representation, where as Yi & Yj are
the representations of Xi and Xj in a lower dimension.
Which of the following must be true for perfect representation of xi and xj in lower dimensional
space?
C. p (j|i) = q (j|i)
Solution: (C)
The conditional probabilities for similarity of two points must be equal because similarity between
the points must remain unchanged in both higher and lower dimension for them to be perfect
representations.
B. LDA aims to minimize both distance between class and distance within class
C. LDA aims to minimize the distance between class and maximize the distance within class
D. LDA aims to maximize both distance between class and distance within class
Solution: (A)
Option A is correct.
A. If the discriminatory information is not in the mean but in the variance of the data
B. If the discriminatory information is in the mean but not in the variance of the data
D. None of these
Solution: (A)
Option A is correct
25) Which of the following comparison(s) are true about PCA and LDA?
B. 2 and 3
C. 1 and 3
D. Only 3
E. 1, 2 and 3
Solution: (E)
C. Can’t Say
D.None of above
Solution: (B)
When all eigen vectors are same in such case you won’t be able to select the principal components
because in that case all principal components are equal.
B. 2 and 3
C. 1 and 3
D. 1 ,2 and 3
Solution: (C)
Option C is correct
28) What happens when you get features in lower dimensions using PCA?
B. 1 and 4
C. 2 and 3
D. 2 and 4
Solution: (D)
When you get the features in lower dimensions then you will lose some information of data most of
the times and you won’t be able to interpret the lower dimension data.
29) Imagine, you are given the following scatterplot between height and weight.
A. ~ 0 degree
B. ~ 45 degree
C. ~ 60 degree
D. ~ 90 degree
Solution: (B)
B. 1 and 4
C. 2 and 3
D. 2 and 4
Solution: (D)
PCA is a deterministic algorithm which doesn’t have parameters to initialize and it doesn’t have local
minima problem like most of the machine learning algorithms has.
Question Context 31
The below snapshot shows the scatter plot of two features (X1 and X2) with the class information
(Red, Blue). You can also see the direction of PCA and LDA.
32) Which of the following method would result into better class prediction?
C. Can’t say
D. None of these
Solution: (B)
If our goal is to classify these points, PCA projection does only more harm than good—the majority
of blue and red points would land overlapped on the first principal component.hence PCA would
confuse the classifier.
33) Which of the following options are correct, when you are applying PCA on a image dataset?
B. 2 and 3
C. 3 and 4
D. 1 and 4
Solution: (C)
Option C is correct
33) Under which condition SVD and PCA produce the same projection result?
D. None of these
Solution: (B)
When the data has a zero mean vector, otherwise you have to center the data first before taking
SVD.
Question Context 34
Consider 3 data points in the 2-d space: (-1, -1), (0,0), (1,1).
35) What will be the first principal component for this data?
1. [ √ 2 /2 , √ 2/ 2 ]
2. (1/ √ 3, 1/ √ 3)
3. ([ -√ 2/ 2 , √ 2/ 2 ])
4. (- 1/ √ 3, – 1/ √ 3)
A. 1 and 2
B. 3 and 4
C. 1 and 3
D. 2 and 4
Solution: (C)
The first principal component is v = [ √ 2 /2 , √ 2/ 2 ] T (you shouldn’t really need to solve any SVD or
eigenproblem to see this). Note that the principal component should be normalized to have unit
length. (The negation v = [− √ 2/ 2 , − √ 2/ 2 ] T is also correct.)
36) If we project the original data points into the 1-d subspace by the principal component [ √ 2 /2,
√ 2 /2 ] T. What are their coordinates in the 1-d subspace?
A. (− √ 2 ), (0), (√ 2)
B. (√ 2 ), (0), (√ 2)
C. ( √ 2 ), (0), (-√ 2)
Solution: (A)
The coordinates of three points after projection should be z1 = x T 1 v = [−1, −1][ √ 2/ 2 , √ 2 /2 ] T = −
√ 2, z2 = x T 2 v = 0, z3 = x T 3 v = √ 2.
37) For the projected data you just obtained projections ( (− √ 2 ), (0), (√ 2) ). Now if we represent
them in the original 2-d space and consider them as the reconstruction of the original data points,
what is the reconstruction error? Context: 29-31:
A. 0%
B. 10%
C. 30%
D. 40%
Solution: (A)
The reconstruction error is 0, since all three points are perfectly located on the direction of the first
principal component. Or, you can actually calculate the reconstruction: z1 ·v.
38) In LDA, the idea is to find the line that best separates the two classes. In the given image which
of the following is a good projection?
A. LD1
B. LD2
C. Both
D. None of these
Solution: (A)
Question Context 39
PCA is a good technique to try, because it is simple to understand and is commonly used to reduce
the dimensionality of the data. Obtain the eigenvalues λ1 ≥ λ2 ≥ • • • ≥ λN and plot.
To see how f(M) increases with M and takes maximum value 1 at M = D. We have two graph given
below:
40) Which of the above graph shows better performance of PCA? Where M is first M principal
components and D is total number of features?
A. Left
B. Right
C. Any of A and B
D. None of these
Solution: (A)
PCA is good if f(M) asymptotes rapidly to 1. This happens if the first eigenvalues are big and the
remainder are small. PCA is bad if all the eigenvalues are roughly equal. See examples of both cases
in figure.
C. PCA explicitly attempts to model the difference between the classes of data. LDA on the other
hand does not take into account any difference in class.
D. Both don’t attempt to model the difference between the classes of data.
Solution: (A)
42) Which of the following can be the first 2 principal components after applying PCA?
B. 1 and 3
C. 2 and 4
D. 3 and 4
Solution: (D)
For the first two choices, the two loading vectors are not orthogonal.
43) Which of the following gives the difference(s) between the logistic regression and LDA?
1. If the classes are well separated, the parameter estimates for logistic regression can be
unstable.
2. If the sample size is small and distribution of features are normal for each class. In such case,
linear discriminant analysis is more stable than logistic regression.
A. 1
B. 2
C. 1 and 2
D. None of these
Solution: (C)
44) Which of the following offset, do we consider in PCA?
A. Vertical offset
B. Perpendicular offset
C. Both
D. None of these
Solution: (B)
We always consider residual as vertical offsets. Perpendicular offset are useful in case of PCA
45) Imagine you are dealing with 10 class classification problem and you want to know that at
most how many discriminant vectors can be produced by LDA. What is the correct answer?
A. 20
B. 9
C. 21
D. 11
E. 10
Solution: (B)
Answers:
1. True 2. True 3. False 4. False 5. True 6. True 7. False 8. True 9. True 10. True 11. True 12. True 13.
True 14. True 15. True 16. True 17. True 18. True 19. True 20. False 21. True 22. True 23. True