Data Mining Exam Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

lOMoARcPSD|15208953

Data mining exam questions

Data Mining (Assiut University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Heather Win ([email protected])
lOMoARcPSD|15208953

1. Data mining is best described as the process of


a. identifying patterns in data.
b. deducing relationships in data.
c. representing data.
d. simulating trends in data.
2. Data used to build a data mining model.
a. validation data
b. training data
c. test data
d. hidden data
3. Supervised learning and unsupervised clustering both require at least one
a. hidden attribute.
b. output attribute.
c. input attribute.
d. categorical attribute.
4. Supervised learning differs from unsupervised clustering in that supervised learning
requires
a. at least one input attribute.
b. input attributes to be categorical.
c. at least one output attribute.
d. ouput attriubutes to be categorical.
5. Which of the following is a valid production rule for the decision tree below?

Business
Appoint-
ment?

No Yes
Decision =
Temp wear slacks
above
70?

No Yes
Decision = Decision =
wear jeans wear shorts

a. IF Business Appointment = No & Temp above 70 = No


THEN Decision = wear slacks
b. IF Business Appointment = Yes & Temp above 70 = Yes
THEN Decision = wear shorts
c. IF Temp above 70 = No
THEN Decision = wear shorts

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

d. IF Business Appointment= No & Temp above 70 = No


THEN Decision = wear jeans

6. A statement to be tested.
a. theory
b. procedure
c. principle
d. hypothesis

7. Which statement about outliers is true?


a. Outliers should be identified and removed from a dataset.
b. Outliers should be part of the training dataset but should not be present in the test data.
c. Outliers should be part of the test dataset but should not be present in the training data.
d. The nature of the problem determines how outliers are used.
e. More than one of a,b,c or d is true.
8. Assume that we have a dataset containing information about 200 individuals. One
hundred of these individuals have purchased life insurance. A supervised data mining
session has discovered the following rule:

IF age < 30 & credit card insurance = yes


THEN life insurance = yes
Rule Accuracy: 70%
Rule Coverage: 63%

How many individuals in the class life insurance= no have credit card insurance and are less
than 30 years old?
a. 63
b. 70
c. 30
d. 27

9. unlike traditional production rules, association rules


a. allow the same variable to be an input attribute in one rule and an output attribute
in another rule.
b. allow more than one input attribute in a single rule.
c. require input attributes to take on numeric values.
d. require each rule to have exactly one categorical output attribute.
10. Given a rule of the form IF X THEN Y, rule confidence is defined as the conditional
probability that
a. Y is true when X is known to be true.
b. X is true when Y is known to be true.
c. Y is false when X is known to be false.
d. X is false when Y is known to be false.
11. Association rule support is defined as
a. the percentage of instances that contain the antecendent conditional items listed in the
association rule.

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

b. the percentage of instances that contain the consequent conditions listed in the association
rule.
c. the percentage of instances that contain all items listed in the association rule.
d. the percentage of instances in the database that contain at least one of the antecendent
conditional items listed in the association rule.
12. This approach is best when we are interested in finding all possible interactions among a
set of attributes.
a. decision tree
b. association rules
c. K-Means algorithm
d. genetic learning
13. This step of the KDD process model deals with noisy data.
a. Creating a target dataset
b. data preprocessing
c. data transformation
d. data mining

14. This data transformation technique works well when minimum and maximum values for a
real-valued attribute are known.
a. min-max normalization
b. decimal scaling
c. z-score normalization
d. logarithmic normalization

15. This technique uses mean and standard deviation scores to transform real-valued
attributes.
a. decimal scaling
b. min-max normalization
c. z-score normalization
d. logarithmic normalization

16. A data normalization technique for real-valued attributes that divides each numerical
value by the same power of 10.
a. min-max normalization
b. z-score normalization
c. decimal scaling
d. decimal smoothing

17. The correlation coefficient for two real-valued attributes is –0.85. What does this value
tell you?
a. The attributes are not linearly related.
b. As the value of one attribute increases the value of the second attribute also increases.
c. As the value of one attribute decreases the value of the second attribute increases.
d. The attributes show a curvilinear relationship.

18. most frequently occurring number in a set of values is called the ____.

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

a. Mean
b. Median
c. Mode
d. Range
19. ................... is an essential process where intelligent methods are applied to extract data patterns.

A) Data warehousing
B) Data mining
C) Text mining
D) Data selection
20. Which of the following is not a data mining functionality?

A) Characterization and Discrimination


B) Classification and regression
C) Selection and interpretation
D) Clustering and Analysis
21. The various aspects of data mining methodologies is/are ...................
i) Mining various and new kinds of knowledge
ii) Mining knowledge in multidimensional space
iii) Pattern evaluation and pattern or constraint-guided mining.
iv) Handling uncertainty, noise, or incompleteness of data
A) i, ii and iv only
B) ii, iii and iv only

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

C) i, ii and iii only


D) All i, ii, iii and iv

22. Task of inferring a model from labeled training data is called


A. Unsupervised learning
B. Supervised learning
C. Reinforcement learning

23. Discriminating between spam and ham e-mails is a classification task, true or
false?
A. True B. False

24. Which of the following is true for Classification?


a) A subdivision of a set
b) A measure of the accuracy
c) The task of assigning a classification
d) All of these

25. Data mining is?


a) time variant non-volatile collection of data
b) The actual discovery phase of a knowledge
c) The stage of selecting the right data
d) None of these

26. Which of the following is general characteristics or features of a target class of


data?
a) Data selection
b) Data discrimination
c) Data Classification
d) Data Characterization

27. What is noise?


a) component of a network

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

b) context of KDD and data mining


c) aspects of a data warehouse
d) None of these

28. A class of learning algorithm that tries to find an optimum classification of a set of
examples using the probabilistic theory is named as …
a) Bayesian classifiers
b) Dijkstra classifiers
c) doppler classifiers
d) all of these

29. Group of similar objects that differ significantly from other objects is named as …
a) classification
b) cluster
c) community
d) none of these

30. Bayesian classifiers is


A class of learning algorithm that tries to find an
A. optimum classification of a set of examples using the
probabilistic theory.
Any mechanism employed by a learning system to
B.
constrain the search space of a hypothesis
An approach to the design of learning algorithms that is
inspired by the fact that when people encounter new
C. situations, they often explain them by reference to
familiar experiences, adapting the explanations to fit the
new situation.
D. None of these

31. Classification is
A. A subdivision of a set of examples into a number of classes
B. A measure of the accuracy, of the classification of a concept that is given by a certain

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

theory
C. The task of assigning a classification to a set of examples
D. None of these

32. Binary attribute are

A. This takes only two values. In general, these values will be 0 and 1 and .they can
be coded as one bit
B. The natural environment of a certain species
C. Systems that can be used without knowledge of internal operations
D. None of these

33. Classification accuracy is


A. A subdivision of a set of examples into a number of classes

B. Measure of the accuracy, of the classification of a concept that is given by a


certain theory
C. The task of assigning a classification to a set of examples
D. None of these

34. Cluster is
A. Group of similar objects that differ significantly from other objects

B. Operations on a database to transform or simplify data in order to prepare it for a


machine-learning algorithm

C. Symbolic representation of facts or ideas from which information can potentially be


extracted

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

D. None of these

35. Data selection is


A. The actual discovery phase of a knowledge discovery process
B. The stage of selecting the right data for a KDD process

C. A subject-oriented integrated time variant non-volatile collection of data in support of


management
D. None of these

36. Discovery is

A. It is hidden within a database and can only be recovered if one is given certain clues
(an example IS encrypted information).

B. The process of executing implicit previously unknown and potentially useful


information from data

C. An extremely complex molecule that occurs in human chromosomes and that carries
genetic information in the form of genes.
D. None of these

37. KDD (Knowledge Discovery in Databases) is referred to


Non-trivial extraction of implicit previously unknown and potentially useful information
A.
from data

B. Set of columns in a database table that can be used to identify each record within this table
uniquely.
C. collection of interesting and useful patterns in a database
D. none of these

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

38. Naive prediction is


A. A class of learning algorithms that try to derive a Prolog program from examples
B. A table with n independent attributes can be seen as an n- dimensional space.

C. A prediction made using an extremely simple method, such as always predicting the
same output.
D. None of these

39. Classification problems are distinguished from estimation problems in that


a.classification problems require the output attribute to be numeric.
b.classification problems require the output attribute to be categorical.
c.classification problems do not allow an output attribute.
d.classification problems are designed to predict future outcome.

40. Consider discretizing a continuous attribute whose values are listed below:
3, 4, 5, 10, 20, 32, 43, 44, 46, 52, 59, 61
Which of the following number of bins is not possible for using equidepth bins?
A. 2
B. 4
C. 5
D. All of the above
41. Consider discretizing a continuous attribute whose values are listed below:
3, 4, 5, 10, 21, 32, 43, 44, 46, 52, 59, 67
Using equal-width partitioning and four bins, how many values are there in the first
bin (the bin with small values)?
A. 1
B. 2
C. 3
D. 4

42. A machine learning problem involves four attributes plus a class. The attributes
have 3, 2, 2, and 2 possible values each. The class has 3 possible values. How
many possible different examples are there?

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

10

A. 3
B. 6
C. 48
D. 72
43. Which of the following statements about Naive Bayes is incorrect?
A. Attributes are equally important.
B. Attributes are statistically dependent of one another given the class value.
C. Attributes are statistically independent of one another given the class value.
D. All of the above

49. In association rule mining the


generation of the frequent itermsets is the computational
intensive step
a.True b. False

1. Data Characterization is a summarization of the general characteristics or features


of a target class of data.

A) True
B) False

2. Data selection is a comparison of the general features of the target class data
objects against the general features of objects from one or multiple contrasting
classes.

A. True
B. False

50. The problem of finding hidden structure in unlabeled data is called


A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

11

51. The choice of a data mining tool is made at this step of the KDD
process.
a. goal identification
b. creating a target dataset
c. data preprocessing
d. data mining
52. Attibutes may be eliminated from the target dataset during this step of
the KDD process.
a. creating a target dataset
b. data preprocessing
c. data transformation
d. data mining

53. A common method used by some data mining techniques to deal


with missing data items during the learning process.
a) replace missing real-valued data items with class means
b) discard records with missing data
c) replace missing attribute values with the values found within
other similar instances
d) ignore missing attribute values

54. The term data mining was originally used to ______.


a. include most forms of data analysis in order to increase sales
b. describe the prices through which previously unknown patterns
in data were discovered
c. describe the analysis of huge datasets stored in data
warehouses
d. All of the above
55. What is a major characteristic of data mining?
a. Because of the large amounts of data and massive search
efforts, it is sometimes necessary to use serial processing for
data mining

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

12

b. The miner needs sophisticated programming skills.


c. Data mining tools are readily combined with spreadsheets and
other software developmenttools
d. Data are often buried within numerous small large databases,
which sometimes contain data fromseveral years.

56. What is true about data mining?

A. Data Mining is defined as the procedure of extracting information from huge sets of
data
B. Data mining also involves other processes such as Data Cleaning, Data Integration,
Data Transformation
C. Data mining is the procedure of mining knowledge from data.
D. All of the above
View Answer
Ans : D

57. The mapping or classification of a class with some predefined group or class is
known as?

A. Data Characterization
B. Data Discrimination
C. Data Set
D. Data Sub Structure
View Answer
Ans : B

58. The analysis performed to uncover interesting statistical correlations between


associated-attribute-value pairs is called?
A. Mining of Association
B. Mining of Clusters
C. Mining of Correlations
D. None of the above

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

13

View Answer
Ans : C
Explanation: Mining of Correlations : It is a kind of additional analysis performed to
uncover interesting statistical correlations between associated-attribute-value pairs or
between two item sets to analyze that if they have positive, negative or no effect on each
other.

59. __________ may be defined as the data objects that do not comply with the
general behavior or model of the data available.

A. Outlier Analysis
B. Evolution Analysis
C. Prediction
D. Classification
View Answer
Ans : A
Explanation: Outlier Analysis : Outliers may be defined as the data objects that do not
comply with the general behavior or model of the data available.

60. "Efficiency and scalability of data mining algorithms" issues comes under?

A. Mining Methodology and User Interaction Issues


B. Performance Issues
C. Diverse Data Types Issues
D. None of the above
View Answer
Ans : B
Explanation: In order to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.

61. What is the use of data cleaning?

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

14

A. to remove the noisy data


B. correct the inconsistencies in data
C. transformations to correct the wrong data.
D. All of the above
View Answer
Ans : D
Explanation: Data cleaning is a technique that is applied to remove the noisy data and
correct the inconsistencies in data. Data cleaning involves transformations to correct the
wrong data. Data cleaning is performed as a data preprocessing step while preparing the
data for a data warehouse.

62. Data Mining System Classification consists of?


A. Database Technology
B. Machine Learning
C. computer Vision
D. All of the above
View Answer
Ans : D
Explanation: A data mining system can be classified according to the following criteria :
Database Technology, Statistics, Machine Learning, Information Science, Visualization,
Other Disciplines

63. Which of the following is correct application of data mining?


A. Market Analysis and Management
B. Corporate Analysis & Risk Management
C. Fraud Detection
D. All of the above
View Answer
Ans : D
Explanation: Data mining is highly useful in the following domains : Market Analysis and
Management, Corporate Analysis & Risk Management, Fraud Detection

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

15

64. In Data Characterization, class under study is called as?

A. Study Class
B. Intial Class
C. Target Class
D. Final Class
View Answer
Ans : C
Explanation: Data Characterization : This refers to summarizing data of class under study.
This class under study is called as Target Class.

65. A sequence of patterns that occur frequently is known as?


A. Frequent Item Set
B. Frequent Subsequence
C. Frequent Sub Structure
D. All of the above
View Answer
Ans : B
Explanation: Frequent Subsequence : A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.

66. __________ refers to the description and model regularities or trends for objects
whose behavior changes over time.

A. Outlier Analysis
B. Evolution Analysis
C. Prediction
D. Classification
View Answer
Ans : B
Explanation: Evolution Analysis : Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

16

67. The first steps involved in the knowledge discovery is?


A. Data Integration
B. Data Selection
C. Data Transformation
D. Data Cleaning
View Answer
Ans : D
Explanation: The first steps involved in the knowledge discovery is Data Integration.

68. In which step of Knowledge Discovery, multiple data sources are combined?
A. Data Cleaning
B. Data Integration
C. Data Selection
D. Data Transformation
View Answer
Ans : B

69. The most commonly used algorithm to discover association rules by recursively
identifying frequent item sets
a. A priori algorithm
b. Ordinal data
c. Nominal data
d. Categorical data

70. A process that uses statistical, mathematical, artificial intelligence, and machine-
learning techniques to extract and identify useful information and subsequent
knowledge from large databases.
a. RapidMiner
b. Gini index
c. Sequence mining

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

17

d. Data mining

71. A machine learning process that performs rule induction or a related procedure to
establish knowledge from large databases
a. Categorical data
b. K fold cross validation
c. Numeric data
d. Knowledge discovery in databases

72. Commonly co-occurring groupings of things. AKA market-basket analysis.


a. Associations
b. Ratio data
c. Prediction
d. Classification

73. A type of data that represents the numeric values of specific variables. for
example age number of children etc
a. Ratio data
b. Numeric data
c. Nominal data
d. Interval data

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

18

74. The measure of how often products or services appear together in the same
transaction. The proportion of transactions in the dataset that contain all of the
products and/or services mentioned in a specific role.
a. SEMMA
b. Entropy
c. Lift
d. Support
75. The act of telling about the future
a. Regression
b. Prediction
c. Classification
d. Associations
76. A metric that is used in economics to measure the diversity of a population
a. RapidMiner
b. Confidence
c. Gini index
d. Data mining
77. The splitting mechanism used in id3
a. Ratio data
b. Associations
c. Interval data
d. Information gain

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

19

78. Supervised induction used to analyze the historical data stored in a database and
to automatically generate a model that can predict future behavior
a. Classification
b. Associations
c. Clustering
d. Prediction

79. The number of iterations in apriori ___________ Select one:


a. increases with the size of the data
b. decreases with the increase in size of the data
c. increases with the size of the maximum frequent set
d. decreases with increase in size of the maximum frequent set

c: increases with the size of the maximum frequent set

80. To determine association rules from frequent item sets Select one:
a. Only minimum confidence needed
b. Neither support not confidence needed
c. Both minimum support and confidence are needed
d. Minimum support is needed
Feedback: Both minimum support and confidence are needed

81. If {A,B,C,D} is a frequent itemset, candidate rules which is not possible is Select one:
a. C –> A
b. D –>ABCD
c. A –> BC d.
B –> ADC
82. Feedback: D –>ABCD
86. Noise is a random error or variance in measured variables.
Ans: Noise

83. ______ routines attempt to fill in missing values, smooth out noise while
identifying outlines, and correct inconsistencies in the data.

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

20

Ans: Data cleaning


84. ________ is used to refer to systems and technologies that provide the business
with the means for decision-makers to extract personalized meaningful
information about their business and industry.
Ans: Business Intelligence
85. In Smoothing by bin means each value in a bin is replaced by the mean value of
the bin.
Ans: Smoothing by bin means
86. ______ regression involves finding the “best” line to fit two variables so that
one variable can be used to predict the other.
Ans: Linear
87. _____ works to remove the noise from the data that includes techniques like
binning, clustering, and regression.
Ans: Smoothing
88. Redundancies can be detected by correlation analysis. (True/False)
Ans: True

89. The ______ technique uses encoding mechanisms to reduce the data set size.
Ans: Data compression
90. In which Strategy of data reduction redundant attributes are detected.
A. Date cube aggregation
B. Numerosity reduction
C. Data compression
D. Dimension reduction
Ans: D. Dimension reduction
91. The _____ rule can be used to segment numeric data into relatively uniform,
“natural” intervals.
Ans: 3-4-5
92. Oracle, SQL/Server, DB2 are examples for _____________.
Ans: DBMS

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

21

93. Data Base Management System (DBMS) supports query languages. (True/False)
Ans: True
94. The _____ item sets find all sets of items (items sets) whose support is greater
than the user-specified minimum support, σ.
Ans: Frequent set
95. A frequent set is a _______ if it is a frequent set and no superset of this is a
frequent set.
Ans: Maximal frequent set
96. ____________ techniques is used to detect relationships or associations between
specific values of categorical variables in large data sets.
Ans: Association rule mining
97. A Decision Tree is a _____________ model.
Ans: Predictive model
98. Using a decision tree, only categorical variables would be modelled.
(True/False).
Ans: False
99. Clustering is an unsupervised learning method (True/false).
Ans: False
100. For a given transaction database T, a ____ is an expression of the form X
=> Y, where X and Y are subsets of A and X => Y holds with confidence Ʈ, if
Ʈ% of transactions in D support X also support Y.
Ans: Association rule
101. The _______ rule describes associations between quantitative items or
attributes.
Ans: Quantitative association
102. The ____ step eliminates the extensions of (k-1) – itemsets, which are not
found to be frequent, from being considered for counting support.
Ans: Pruning
103. In the first phase of Partition algorithm, the algorithm logically divides the
database into a number of ______.
Ans: non – overlapping partitions.

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

22

104. The a priori algorithm operates in a _________ and __________.


Ans: bottom-up, breadth-first search method.
Classification methods is robustness if it can handle noise and missing values
105. ___________ algorithm works like a train running over the data, with
stops at intervals M between transactions. When the train reaches the end of the
transaction file it completes one path.
Ans: DIC Algorithm
106. FP–Tree Growth Algorithm can be implemented in __________ Phases.
Ans: Two
107. FP – tree stands for ____________ .
Ans: Frequent pattern tree
108. Data mining systems should provide capabilities to mine association rules
at multiple levels of abstraction and traverse easily among different abstraction
spaces (True/False).
Ans: True
109. Which one of the following is alternative search strategies for mining
multiple-level associations with reduced support?
a) Level – by level independent
b) Level – cross-filtering by a single item
c) Level – cross-filtering by k – itemset:
d) All the above
Ans: d) All the above\
110. Which of the following is NOT a common binning strategy?
a) Equiwidth binning,
b) Equidepth binning,
c) Homogeneity – based binning,
d) Equilength binning
Ans: d) Equilength binning
111. Association rules that involve two or more dimension or predicates can be
referred to as _______.
Ans: Multidimensional association rules.
112. An algorithm that performs a series of “walks” through itemset space is
called a _________.
Ans: Random walk algorithm.

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

23

113. What are knowledge type constraints?


Ans: They specify the type of knowledge to be mined.
114. A standard measure of within-cluster similarity is _______.
Ans: variance
115. The process of grouping a set of physical or abstract objects into classes of
similar objects is called Cluster .
Ans: True
116. Data classification is a Two step process.
Ans: true
117. Prediction can be viewed as the construction and use of a model to assess
the class of an unlabeled sample, or to assess the value or value ranges of an
attribute that a given sample is likely to have.
Ans: True
118. Pre-processing of data removes or reduces noise (by applying smoothing
techniques) and the treatment of missing values.
Ans: true
119. Scalability method refers to the ability to construct the model efficiently
given a large amount of data.
true/false
120. The basic algorithm for decision tree induction is a greedy algorithm.
true/false
121. The information gain measure is used to select the test attribute at each
node in the tree.
true/false
122. Data mining can be used to help predict future patient behaviour and to
improve treatment programs (True/False).
Ans: True

123. Data mining can be used to improve ___________.

a) Efficiency

Downloaded by Heather Win ([email protected])


lOMoARcPSD|15208953

24

b) Quality of data
c) Marketing
d) All the above
Ans: D. All the above.
124. To improve accuracy, data mining programs are used to analyze audit data
and extract features that can distinguish normal activities from intrusions.
(True/False)
Ans: True
125. Patient Rule Induction Method (PRIM) and Weighted Item Sets (WIS), is
a type of Association rule technique
(True/False)

Downloaded by Heather Win ([email protected])

You might also like