Practice Exam
Practice Exam
The plot below shows data points between 2 predictor variables (x1 and x2) for a 2-class
classification problem (red vs blue):
Select the method that should give you the lowest misclassification error on this data set among
the following
• Classification Tree
• Logistic Regression
2. You have been given a large data set with 200,000 data points and 50 predictor variables. The
response variable takes continuous values. You have implemented a Linear Regression model
and obtained some Mean Squared Error (MSE) value along with the following residual plot
(fitted values vs errors):
Now, you want to try the Decision Tree method. You assume that since it is a large data set, you
can speed up the computation by restricting the maximum depth of the tree. You split the data
into 70% training set and 30% test set. Next, you apply the Decision Tree method and obtain a
new MSE value, which is higher than the one obtained using the Linear Regression model. What
would you do next?
3. You are working on a classification project to identify whether an individual will have a certain
disease or not.
The predictors are the measurements obtained from the individual's blood test report. The
training data set contains 15,000 data samples and 10 predictor variables. You notice that 20
samples are missing random predictor variable values.
Upon further inspection, you find the following information:
1) the data set is balanced (i.e., it has a similar proportion of both the classes),
2) the maximum number of predictor variable values that are missing for any of the 20 samples
is 5,
3) none of the predictor variables are missing values of more than 2 samples, and
4) 11 out of the 20 samples belong to the same class.
How would you handle the missing values?
4. Which of the following statements are true with respect to the Random Forests method?
• 0.57
• 1
• 0.47
7. You are working on a 2-class classification problem where the objective is to determine whether
a customer is going to default on a loan or not. There are 2 predictor variables: the customer's
credit history and the amount of loan. Below is the table with 7 data points
You have decided to apply a tree-based model on this data set with Gini index as a measure to
select the variable to be split at the root node. Calculate and report the Gini index score for the
predictor variable: "Loan Amount"
• 0
• 0.29
• 0.24
8. You are working on a 2-class classification problem where the objective is to determine whether
a customer is going to default on a loan or not. There are 2 predictor variables: the customer's
credit history and the amount of loan. Below is the table with 7 data points.
You have decided to apply a tree-based model on this data set with Gini index as a measure to
select the variable to be split at the root node.
• Credit History
• Loan Amount
9. Which of the following statements is not true about the Tree-based models?
• The relationship between the predictor variables and the response variable has to be
expressed in a parametric form to obtain high accuracy.
• Tree-based models can handle data with missing values.
• Tree-based models can handle predictors which take categorical values.
10. Below is a list of tasks involved in modeling a predictive analytics project represented by
respective alphabets:
a - read data
c - split data into 20% testing data set and 80% training data set
d - apply linear regression method e -apply ridge regression method after standardizing the
predictor variable values
f - apply the lasso method after standardizing the predictor variable values
You are given a predictive analytics project to estimate house prices given 4 predictors: the
number of rooms, school ratings, crime rate, and nitric oxides concentration. The training data
set consists of 50,000 data samples. You suspect there is a linear relationship between the
predictor variables and the response variable. Your objective is to obtain a high prediction
accuracy and also keep the model interpretable. Pick the correct list of tasks involved and the
order in which you will execute them for this project.
• a-b-c-f-i
• a-b-c-e-i
• a-b-c-d-i
11. Coding-based: The data set used in the python file has 10 numeric predictor variables. The
response variable is a quantitative measurement of the disease (diabetes) progression one year
after the baseline values of the predictor variables are recorded. For the given data set in the
python file, report the mean value observed for the response value (target)
• 0
• 346
• 152.13
12. Coding-based: For the given data set in the python file, report the predictor variable that has the
largest mean value. For identification of this variable compare up to 3 decimal places of the
closest mean values.
• s4
• All the predictor variables have the same mean value (upto 3 decimal places).
• s2
• s1
13. Coding-based: Which of the following predictor variable seem to have a stronger linear
relationship with the response variable compared to the other 3 variable options?
• Bmi
• s4
• age
• sex
14. Coding-based: For the given data set in the python file, which of the following pairs of predictor
variables are correlated the most:
• s3-s5
• s2-s3
• s1-s2
• s2-s4
15. Coding-based: For the given data set in the python file, do the following: Type this in a cell:
random.seed(123) And then split the data into train and test data sets with a test_size=0.20 and
random_state=1 as parameters. Report the number of data samples in the training data set.
• 89
• 442
• 353
• 16:
16. For the given data set in the python file, fit a linear regression model to the training data set
obtained after splitting the data set with the same conditions as in the previous question
(random.seed(123), test_size=0.2 and random_state=1). Make predictions on the test data set
with the fit obtained on the training data set. Obtain the lowest mean squared error. Select the
closest value of the mean squared error that you obtained from the following:
• 20
• 5
• 2990
7. Coding-based: For the given data set in the python file, fit one of the shrinkage methods to the
training data set obtained after splitting the data set with the same conditions as in the previous
question (random.seed(123), test_size=0.2 and random_state=1). Make predictions on the test
data set with the fit obtained on the training data set. Obtain the lowest mean squared error. (In
order to obtain the lowest mean squared error, you may want to tune the the 'alpha'
parameter, which can take values from decimal points to integers) Select the closest value of the
mean squared error that you obtained from the following:
• 2000
• 2930
• 3500
8. Coding-based: For the given data set in the python file: Fit a decision tree regressor to the same
training data set obtained after splitting the data set with the same conditions as in an earlier
question (random.seed(123), test_size=0.2 and random_state=1). (Set the random_state=1 in
the regressor. You may tune the respective parameter: max_depth to attain the lowest error.)
Make predictions on the test data set with the fit obtained on the training data set. Obtain the
lowest mean squared error. Select the closest value of the mean squared error that you
obtained from the following:
• 6766
• 2990
• 4090
9. Coding-based: For the given data set in the python file: Fit a random forest regressor to the
same training data set obtained after splitting the data set with the same conditions as in an
earlier question (random.seed(123), test_size=0.2 and random_state=1 ) (Set the
random_state=1 in the regressor. You may tune the respective parameter: n_estimators to
attain the lowest error.) Make predictions on the test data set with the fit obtained on the
training data set. Obtain the lowest mean squared error. Select the closest value of the mean
squared error that you obtained from the following:
• 4284
• 3700
• 8232
10. Based on the results obtained using the linear and tree-based methods on the diabetes data set
in the python file, is the below statement True or False? The relationship between the predictor
variables and the response variable can be well-explained using if-then statements in the
predictor space.
• False
• True