Question Bank Data Science & Its Applications
Question Bank Data Science & Its Applications
Question Bank Data Science & Its Applications
1. What are the applications of a wordcloud Generator? What are the Prerequisites python
packages for word Cloud? Creating a Word Cloud in Python.
2. Discuss the Advantages and Limitations of Bi-grams, Tri-grams and N-grams models and
How do we implement this? Write a python function.
3. Compare Gibbs sampling with traditional sampling and bagging methods, provide a clear
understanding of its advantages and applications.
4. Discuss a technique called Latent Dirichlet Analysis (LDA) that is commonly used to
identify common topics in a set of documents. Write a python function for how to assigned a
topic and Most common words per topic to each document.
5. What is centrality measure? Discuss Betweenness centrality and PageRank centrality. Use
a undirected graph to define the eigenvector centrality in the way where two nodes of the
graph depends on both the number of neighbors.
6. How does a recommendation system use natural language processing? Explain the concept
of matrix factorisation in recommendation systems. Explain the concept of content-based
filtering and discuss its applications and limitations in the context of recommending items to
users. Provide examples of real-world applications and analyze the benefits they provide to
users.
7. Analyze the concept of matrix factorization and its application in the recommendation
systems. Discuss the advantages and limitations of matrix factorization compared to other
techniques and provide examples of real-world applications.
8. Explain how collaborative filtering works, and discuss the challenges and potential
solutions for improving its accuracy.
Module 4
1. a. Consider the following dataset. Write a program to demonstrate the working of the
decision tree based ID3 algorithm.
b. Consider the dataset spiral.txt The first two columns in the dataset corresponds to the co-
ordinates of each data point. The third column corresponds to the actual cluster label.
Compute the rand index for the following methods:
K – means Clustering
Single – link Hierarchical Clustering
Complete link hierarchical clustering.
Also visualize the dataset and which algorithm will be able to recover the true clusters.
2. Analyze the importance of training data in the performance of machine learning models.
Discuss the different types of training data, the considerations to take into account when
collecting and preprocessing the training data and provide examples of real-world
applications where training data plays a critical role.
3. Examine the concept of Mean Absolute Error (MAE) and its application in the evaluation
of regression models. Discuss the advantages and disadvantages of MAE compared to other
evaluation metrics and provide examples of how MAE can be used to improve model
performance.
4. What is Artificial Neural Network (ANN) and applications of ANN? How do Neural
Networks Work? Develop a Perceptron training algorithm. Explain how a Backpropagation
network learns to solve the XOR problem.
5. Define Perceptron. Write a Python program to create a multi-layer perceptron for the XOR
function.
6. What are Tensors? How do you approach feature selection for a deep learning. What
methods have you used in the past for hyperparameter tuning?
7. What are some common challenges found in deep learning models, and how would you
overcome them?
8. Describe your approach to handling overfitting and underfitting in a deep learning model
via regularization technique.
9. Compute Hierarchical Clustering and explain about how classification differ from
clustering.
10. Derive the Backpropagation rule considering the training rule for Output Unit weights
and Training Rule for Hidden Unit weights with example.
2. Train an SVM classifier on the iris dataset using sklearn. Try different kernels and the
associated hyper parameters. Train model with the following set of hyper parameters RBF
kernel, gamma=0.5, one-vs-rest classifier, no-feature-normalization. Also try
C=0.01,1,10C=0.01,1,10. For the above set of hyper parameters, find the best classification
accuracy along with total number of support vectors on the test data.