CS 8031 Data Mining and Data Warehousing Tutorial
CS 8031 Data Mining and Data Warehousing Tutorial
B.I.T., Mesra
Sub.: CS 8031
BE_VII>DMDW.M04
Tutorial Sheet-I
Module - 1
1.
What is data mining? In your answer, address the following
a)
Is it another type?
b)
Is it a simple transformation of technology developed from database,
statistics, and machine learning ?
c)
Explain how the evolution of database technology led to data mining.
d)
Describe the steps involved in data mining when viewed as a process
of knowledge discovery.
2.
3.
4.
text
5.
Define
each
of the following
data
mining
functionalities:
characterization,
discrimination,
association,
classification,
prediction, clustering, and evolution analysis. Give examples of each
data mining functionality, using a real-life database that you are
familiar with.
6.
7.
8.
9.
regarding
data
mining
Module - 2
10. State why, for the integration of multiple heterogeneous information
sources, many companies in industry prefer the update-driven approach
(which constructs and uses data warehouses), rather than the querydriven (which applies wrappers and integrators). Describe situations
where the query-driven approach is preferable over the update-driven
approach.
11. Briefly compare the following concepts. You may use an example to
303166908.doc
13.
14.
15.
303166908.doc
16.
17.
18.
What are the differences between the three main types of data
warehouse usage: information processing, analytical processing, and
data mining? Discuss the motivation behind OLAP mining (OLAM).
Module 3
19.
20.
In real-world data, tuples with missing values for some attributes are
a common occurrence. Describe various methods for handling this
problem.
21.
Suppose that the data for analysis include the attribute age. The
values for the data tuples are (in increasing order): 13, 15, 16,
20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35,
40, 45, 46, 53, 70.
a)
b)
c)
and
age
19,
36,
Use smoothing by bin means to smooth the above data ,using a bin
depth of 3. Illustrate your steps. Comment on the effect of this
technique for the given data.
How might you determine outliers in the data?
What other methods are there for data smoothing?
22.
23.
303166908.doc
c)
24. List and describe the 5 primitives for specifying a data mining task.
25. Describe why concept hierarchies are useful in data mining?
Module - 4
26.
The 4 major types of concept hirerachies are : schema hierarchies, setgrouping hierarchies, operation-derived hierachies and rule-based
hirearchies.
a)
b)
27.
presented
303166908.doc
TUTORIAL SHEET II
MODULE - 4
28.Discuss the importance of establishing standardization date mining query language. What are
some of the potential benefit and challenges? Inralveel in such took? List a few of the recent
proposal in this area?
29. Describe the differences between the following crehiteetane for the integration of the data.
Mining system with database or data wore home system: on coupling, loose coupling semi tight
coupling and tight coupling. Stall which crepitate you think is most popular and why.
MODULE-1
30. (a) what is relation database.
(b) What is transactional database?
(c) What is online-Analytical processing?
MODULE-2
31.A popular data ware heaves implementation is constrict a multidimensional database, known as
a data cube. Unfortunately this may often generate a huge yet very sparse multidimensional matrix.
(a) Present an example illustrating such a huge and sparse data cube.
(b) Design an implementation method that can be elegantly overcome this sparse matrix
Problem note that yet reel to explain your data structure in detail and discuss the sparse needle
or will as how retrieve data from your structures.
(c) Modify your design in (b) to handle inerenental data updates. Gives the easeneing behind
your new design.
MODULE-3
32. Use the flow chart summaries the following procedures for attribute subset selection
(a) Step wise foreword selection.
(b) Step wise Back word selection
(c) A combination of back word elimination and foreword selection.
Module-5
(33) For Class Character section, what are the major differences between a data enbe bored
implementation and relational implementation such as attribute oriented in diction?
Discuss which method is most efficient and under what condition then is so.
(34) Suppers that the following table is derived by attribute-oriented
Induction.
Class
birth free
count
Candor
180
Programmer
303166908.doc
Others
120
DBA
Canada
Others
20
80
(a) Transform the table in to Eros stab. Showing the associated t-weight and
d-weight.
(b) Map the class programmer in to a quantitative. Abstractive rule for example,
X, programmer (X)=) (birth fleece (X)= Canada -----) [t: x%, d: y%]V (..)
[t: w%, d : z]
(35) Suppose that the data for analyses include the attribute age. The age. Value
For the data tepees are
13,15,16,16,19,20,20,21,22,22,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,
70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the datas modality?
(c) What is the midrange of the data?
(d) Can you find the first quartile (Q1) and third quartile (Q2) of the data?
(e) Give the five numbers summery of the data.
(f) Shows a box flot of the data.
(g) How is a quartile flot?
Module 6
(36) The apriority algorithm makes use of prior knowledge of subset support properties.
(a) Prove that all non-empty subset of a frequent item set must also be frequent.
(b) Prove that the support of any non-empty subsets of item sets must be as great as the
support of S.
(c) Given frequent item set L and subset S of L, prove that the confidence of the rule
S (1-s) cannot be more than the confidence of S (1-s), where S is a
subset of S.
(d) A partitioning rauiator of Aprieri subdivides the transactions of a database D in to
N nonolpping partitioning. Prove that any item set that is frequent in D must be
frequent in at lust one petition of D.
(37) A database has four transactions. Let min-sub=60%.and min-conf =80%
TID
date
item-bought
T 100
10/15/99
{K,A,D,B}
T 200
10/15/99
{D,A,C,E,B}
T 300
10/15/99
{C,A,B,E}
T 400
10/15/99
{B,A,D}
(a) Find all frequent item set using Aprion and FP-growth, uspeetingely . Compare the
efficiency of the two mining processes.
(b) List all of the strong association rules, matching the following meta rule, where X is
a variable. Reprinting customers, and item denote valuables repenting items(eg. A,
B, etc):
X trasaefion, busy(X, item1) busy(X, item2 )=) busy (X,item3) [S,C]
303166908.doc
(38) Suppose that frequent item set are saved for a large transaction database, DB. Discuss
how to efficiently mine the (goral) association rules under the same minimum support
turnsole, it a set of new transactions, doffed as ADB, is (incrementally) cereal in ?
(39) Proposal and outline a level-shoved mining approach to mining multilevel association
rule in which and item is enfold by its level position, and as initial scan of the olatatabase
collects the count for each item of each concept level, identifying frequent and sub
frequent items. Comment on the professing cost of mining multi level association with
this me toed in comprising to mining single level association.
(40) When mining cross level Association rules, suppose it is tound that the item set{IBM
desktop computer, printer} dose not satiety minimum support can this information be
need to prune the mining of a descenelent itemset seethes {IBM desktop computer/w
printer} beige a general rule enplaning how this information may be used for purring
the search Space.
(41) Prove that each entry in the following table correctly characterizes its corresponding rule
constraint for frequent item set mining.
Rule Constraint
Ant monotone
Monotone
securest
(a) V S
no
yes
yes
(b) S V
yes
no
yes
(c) min (s) V no
yes
yes
(d) range (s) V yes
no
no
(e) Varian (s) V Convertible
Convertible no
Module-7
(42) Briefly outline the major steps of decision tree classification .
(43) Why is tree preening useful in decision tree inculcation ? What is draw back of using a
separate set of samples to evaluate purring.
(44) The following table shows the midterm and find exam grades obtained for student in a
database course.
X
Midterm exam
72
50.
81
74
94
86
59
83
65
33
88
81
303166908.doc
Y
Final exam
84
(a) Plot the data. Do X and Y seem to have a linear relationship.
63
77
(b) Use the method of least squares to find an equation for the
78
fraction of the students final exam. Grade based on the
90
students midterm grade in the course.
75
(c) Predict the final exam grade of a student who received an 86 on
49
midterm exam.
79
77
52
74
90
(45) What is boosting? State whey it may improve the accuracy of decision tree induction.
(46) Show that accuracy is a fun of sensitivity and specificity, that it --
Pos
Pos
accuracy = sensitivity { Pos+neg } + specificity
(pos+neg).
Module-8
(47) Briefly outline how to compute the dissimilarity between objects described by the
Following type of variable.
(a) Asymmetric be nay variables
(b) Nominal variables.
(c) Ratio-scaled variable.
(d) Numerical variables.
(48) Given two objects rap rental by the topples (22,1,42,10) and (20,0,36,8):
(a) Computer the Euclidean distended between the two objects.
(b) Computer the mandate is thence between the two objects.
(c) Computer the minnows distended between the two objects, wring q=3.
(49) Suppose the data mining took to leister the following eight points (with (x,y) reporting logon)
into 3 clusters.
A1(2,10), A2(2,5),A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2),C2(4,9).
The distance fun is Euclidean distance. Suppose we initially assign A1,B1 and C1 as the
Centers of each cluster, respectively. Use the K-means algorithms to show only.
(a) The three cluster enter softer the first round exam. and.
(b) The final three clusters.
(50) Data cubes and multidimensional database contain categorical, ordinal, and numerical data
in heretical or agree grate forms. Basal on what you have leaned at the clustering method , design
a clustering method that find clusters in large data cube effectively and efficiently.
Module - 9
(51) Suppose that a chain restatement would like to mine customers Consumption behavior
related to major sport events, such as Every time there is a Canucks hockey game on TV, the
sales
of ken turkey Fried chicken will go up 20% one her before the match.
(a) Describe a method to find such pattern efficiently
(b) Most time related association mining algorithms. Use Apron- like algorithms to mine such
Patterns. An alternative database projection bared frequent pattern (FP) growth method,
Is efficient a mining frequent item sets. Can you external the FP growth method to find such
time related patters efficiently.
(52) Suppose that a power station stores data about power consumption levels by time and by
region,
and power usage in formation per customer in each region. Disuses how to ashes the fallowing
problems in such a time series database.
(a) Final similar power consumption curve fragments for a given region on Fridays.
(b) Every time a power consumption curve rises sharply what may happen within 20 minutes?
(c) How can we find the most influential features that distinguish a stable power consumption
303166908.doc
303166908.doc