Data Mining PDF
Data Mining PDF
ODS -operational data store problem‖To create an ODS you Periodically copy data into it from
the live OLTP system reporting tools An ODS can be an integration point or real an operational
system It’s not enough for full enterprise processing Difference between ODS and
–Major task of traditional relational DBMS
–Day-to-day operations: purchasing, inventory, banking, man ufacturing, payroll, registration,
accounting, etc.
–The time horizon for the data operational systems
--Operational database: current value data
– Data warehouse data: provide information from a his torical perspective (e.g., past 5-10 years)
–Every key structure in the data warehouseContains a n element of time, explicitly or implicitly
–But the key of operational data may or may not contain ―time element‖
1b. What is ETL? Explain the steps in ETL (04 Marks) Jan 2015, (07 Marks)
Jan 2014, (06 Marks) June/ July 2015, (04 Marks) June/ July 2016
the process of extracting data from source systems and bringing it into the data warehouse is
commonly called ETL, which stands for extraction, transformation, and loading. Note that ETL
refers to a broad process, and not three well-defined steps. The acronym ETL is perhaps too
implistic, because it omits the transportation phase and implies that each of the other phases of
the process is distinct. Nevertheless, the entire process is known as ETL.
The methodology and tasks of ETL have been well known for many years, and are not
necessarily unique to data warehouse environments: a wide variety of proprietary applications
and database systems are the IT backbone of any enterprise. Data has to be shared between
applications or systems, trying to integrate them, giving at least two applications the same picture
of the world. This data sharing was mostly addressed by mechanisms similar to what we now
call ETL.
What happens during the ETL process? The following tasks are the main actions in the process.
Extraction of Data
During extraction, the desired data is identified and extracted from many different sources,
including database systems and applications. Very often, it is not possible to identify the specific
subset of interest, therefore more data than necessary has to be extracted, so the identification of
the relevant data will be done at a later point in time. Depending on the source system's
capabilities (for example, operating system resources), some transformations may take place
during this extraction process. The size of the extracted data varies from hundreds of kilobytes
up to gigabytes, depending on the source system and the business situation. The same is true for
the time delta between two (logically) identical extractions: the time span may vary between
days/hours and minutes to near real-time. Web server log files, for example, can easily grow to
hundreds of megabytes in a very short period of time.
Transportation of Data
The emphasis in many of the examples in this section is scalability. Many long-time users of
Oracle Database are experts in programming complex data transformation logic using
PL/SQL.implementations that take advantage of Oracle's new SQL functionality, especially for
ETL and the parallel query infrastructure.
1c. What are the guide lines for implementing the data warehouse.
05Marks) Jan 2014, (08 Marks) Jan 2015
The limited reporting in the source systems
The desire to use a better and more powerful reporting tool than what the source systems
Only a few people have the security to access the source systems and you want to allow
others to generate reports
A company owns many retail stores each of which track orders in its own database and
you want to consolidate the databases to get real-time inventory levels throughout the day
You need to gather data from various source systems to get a true picture of a customer
so you have the latest info if the customer calls customer service. Custom data such as
customer info, support history, call logs, and order info. Or medical data to get a true
picture of a patient so the doctor has the latest info throughout the day: outpatient
department records, hospitalization records, diagnostic records, and pharmaceutical
purchase records
Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions. Data warehouse systems are
valuable tools in today’s competitive, fast-evolving world. In the last several years, many firms
have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that
with competition mounting in every industry, data warehousing is the latest must-have marketing
weapon - a way to retain customers by learning more about their needs.
Data warehouses have been defined in many ways, making it difficult to formulate a rigorous
definition. Loosely speaking, a data warehouse refers to a database that is maintained separately
from an organization’s operational databases. Data warehouse systems allow for the integration
2a.Explain ODS (Operational Data Store) and its structure with a neat figure.
(07 Marks) June/ July 2014, (06 Marks) June/ July 2015
Data warehouses have been defined in many ways, making it difficult to formulate a rigorous
definition. Loosely speaking, a data warehouse refers to a database that is maintained separately
from an organization’s operational databases. Data warehouse systems allow for the integration
of a variety of application systems. They support information processing by providing a solid
platform of consolidated historical data for analysis.
An ODS is targeted for the lowest granular queries whereas a data warehouse is usually
used for complex queries against summary-level or on aggregated data
An ODS is meant for operational reporting and supports current or near real-time
reporting requirements whereas a data warehouse is meant for historical and trend
analysis reporting usually on a large volume of data
An ODS contains only a short window of data, while a data warehouse contains the entire
history of data
An ODS provides information for operational and tactical decisions on current or near
real-time data while a data warehouse delivers feedback for strategic decisions leading to
overall system improvements
In an ODS the frequency of data load could be every few minutes or hourly whereas in a
data warehouse the frequency of data loads could be daily, weekly, monthly or quarterly
4. Discuss the benefits of implementing a data warehouse (04 Marks) June/ July 2016
• Historical Intelligence
Unit -2
1b. Explain the operation of data cube with suitable examples. (08 Marks) Jan 2014
Users of decision support systems often see data in the form of data cubes. The cube is used to
represent data along some measure of interest. Although called a "cube", it can be 2-dimensional,
3-dimensional, or higher-dimensional. Each dimension represents some attribute in the database
and the cells in the data cube represent the measure of interest. For example, they could contain a
count for the number of times that attribute combination occurs in the database, or the minimum,
maximum, sum or average value of some attribute. Queries are performed on the cube to retrieve
decision support information.
Example: We have a database that contains transaction information relating company sales of a
part to a customer at a store location. The data cube formed from this database is a 3-dimensional
representation, with each cell (p,c,s) of the cube representing a combination of values from part,
customer and store-location. A sample data cube for this combination is shown in Figure 1. The
contents of each cell is the count of the number of times that specific combination of values
occurs together in the database. Cells that appear blank in fact have a value of zero. The cube can
then be used to retrieve information within the database about, for example, which store should
be given a certain part to sell in order to make the greatest sales.
OLAP (On-line Analytical Processing) deals with Historical Data or Archival Data. OLAP is
characterized by relatively low volume of transactions. Queries are often very complex and
involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP
applications are widely used by Data Mining techniques. In OLAP database there is aggregated,
historical data, stored in multi-dimensional schemas (usually star schema). Sometime query need
to access large amount of data in Management records like what was the profit of your company
in last year.
3a. Why multidimensional views of data and data cubes are used? With a neat diagram,
explain data cube implementations. (10 Marks) Jan 2015, (06 Marks) June/ July 2016
The multidimensional view of data is in some way a natural view of any enterprise for managers
The triangle diagram in Figure shows that as we go higher in the triangle hierarchy the managers
need for detailed information declines
1.Precompute and store all: This means that millions of aggregates will need to be computed and
stored. Although this is the best solution as far as query response time is concerned, the solution
is impractical since resources required to compute the aggregates and to store themwill be
prohibitively large for a large data cube. Indexing large amounts of data is also expensive.
2. Pre compute (and store) none: This means that the aggregates are computed on the fly using
the raw data whenever a query is posed. This approach does not require additional space for
storing the cube but the query response time is likely to be very poor for large data cubes.
3.Precompute and store some: This means that we pre compute and store the means that we pre-
compute and store the most frequently queried aggregates and compute others as the need arises
Aggregates from the pre computed aggregates and will be necessary to access the database ( e.g.
the data warehouse) to compute the remaining aggregates. The more aggregates we are able to
pre -compute the better the query performance.
3b. What are data cube operations? Explain. (10 Marks) Jan 2015 (4 Marks) Jan 2016
Takes the current aggregation level of fact values and does a further aggregation on one
or more of the dimensions.
Equivalent to doing GROUP BY to this dimension by using attribute hierarchy.
Decreases a number of dimensions - removes row headers.
Opposite of roll-up.
Summarizes data at a lower level of a dimension hierarchy, thereby viewing data in a
more specialized level within a dimension.
Increases a number of dimensions - adds new headers
Sets one or more dimensions to specific values and keeps a subset of dimensions for selected
Rotates the data axis to view the data from different perspectives.
Groups data with different dimensions.
the number of users. Full pre-computation of aggregates helps but is often not practical due to
the large number of aggregates. One approach is to pre-compute the most commonly queried
aggregates and compute the remaining on-the-fly.
Analytic: An OLAP system must provide rich analytic functionality and it is expected that most
OLAP queries can be answered without any programming. The system should be able to cope
with any relevant queries for the application and the user. Often the analysis will be using the
vendor’s own tools although OLAP software capabilities differ widely between products in the
Shared: An OLAP system is shared resource although it is unlikely to be shared by hundreds of
users. An OLAP system is likely to be accessed only by a select group of managers and may be
used merely by dozens of users. Being a shared system, an OLAP system should be provide
adequate security for confidentiality as well as integrity.
Multidimensional: This is the basic requirement. Whatever OLAP software is being used, it
must provide a multidimensional conceptual view of the data. It is because of the
multidimensional view of data that we often refer to the data as a cube. A dimension often has
hierarchies that show parent / child relationships between the members of a dimension. The
multidimensional structure should allow such hierarchies.
Information: OLAP systems usually obtain information from a data warehouse. The system
should be able to handle a large amount of input data. The capacity of an OLAP system to handle
information and its integration with the data warehouse may be critical.
4b. Explain Codd's OLAP rules. (10 Marks) June/ July 2015
2. Accessibility (OLAP as a mediator): The OLAP software should be sitting between data
sources (e.g. data warehouse) and an OLAP front-end.
3. Batch extraction vs. interpretive: An OLAP system should provide multidimensional data
staging plus pre calculation of aggregates in large multidimensional databases.
4. Multi-user support: Since the OLAP system is shared, the OLAP software should provide
many Normal database operations including retrieval, update, concurrency control, integrity and
5. Storing OLAP results: OLAP results data should be kept separate from source data. Read-
write OLAP applications should not be implemented directly on live transaction data if OLAP
source systems are supplying information to the OLAP system directly.
6. Extraction of missing values: The OLAP system should distinguish missing values from zero
values. A large data cube may have a large number of zeros as well as some missing values. If a
distinction is not made between zero values and missing values, the aggregates are likely to be
computed incorrectly.
7. Treatment of missing values: An OLAP system should ignore all missing values regardless of
their source. Correct aggregate values will be computed once the missing values are ignored.
9. Generic dimensionality: An OLAP system should treat each dimension as equivalent in both
is structure and operational capabilities. Additional operational capabilities may be granted to
selected dimensions but such additional functions should be grantable to any dimension.
10. Unlimited dimensions and aggregation levels: An OLAP system should allow unlimited
dimensions and aggregation levels. In practice, the number of dimensions is rarely more than 10
and the number of hierarchies rarely more than six
Unit -3
While large-scale information technology has been evolving separate transaction and analytical
systems, data mining provides the link between the two. Data mining software analyzes
relationships and patterns in stored transaction data based on open-ended user queries. Several
types of analytical software are available: statistical, machine learning, and neural networks.
Generally, any of four types of relationships are sought:
Classes: Stored data is used to locate data in predetermined groups. For example, a
restaurant chain could mine customer purchase data to determine when customers visit
and what they typically order. This information could be used to increase traffic by
having daily specials.
Clusters: Data items are grouped according to logical relationships or consumer
preferences. For example, data can be mined to identify market segments or consumer
Associations: Data can be mined to identify associations. The beer-diaper example is an
example of associative mining.
Sequential patterns: Data is mined to anticipate behavior patterns and trends. For
example, an outdoor equipment retailer could predict the likelihood of a backpack being
purchased based on a consumer's purchase of sleeping bags and hiking shoes.
Data mining consists of five major elements:
Extract, transform, and load transaction data onto the data warehouse system.
Store and manage the data in a multidimensional database system.
–Principle Component Analysis
–Singular Value Decomposition
–Others: supervised and non-linear techniques feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
–duplicate much or all of the information contained in one or more other attributes
–Example: purchase price of a product and the amount of sales tax paid
Irrelevant features
–contain no information that is useful for the data mining task at hand
–Example: students' ID is often irrelevant to the task ofpredicting students' GPA Feature Subset
Brute-force approch:‹
Try all possible feature subsets as input to data mining algorithm
–Embedded approaches:‹
Feature selection occurs naturally as part of the data mining algorithm
–Filter approaches:‹
Features are selected before data mining algorithm is run
–Wrapper approaches:‹
Use the data mining algorithm as a black box to find best subset of attributes
Feature Creation
Create new attributes that can capture the important information in a data set much more
efficiently than the original attributes
Feature Extraction
–Mapping Data to New Space
–Feature Construction combining features
2a. Explain four types of attributes with statistical operations and examples.
(06 Marks) June/ July 2014
There are different types of attributes
Examples: ID numbers, eye color, zip codes
Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall,
medium, short}
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Examples: temperature in Kelvin, length, time, counts
2b. Two binary vectors are given below: (04 Marks) June/ July 2014
X = (1,0,0,0, 0, 0, 0, 0, 0, 0)
Y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)
Calculate (i) SMC (ii) Jaccord similarly coefficient and hamming distance.
Automated prediction of trends and behaviors: Data mining automates the process of finding
predictive information in a large database. Questions that traditionally required extensive hands-
on analysis can now be directly answered from the data. A typical example of a predictive
problem is targeted marketing. Data mining uses data on past promotional mailings to identify
the targets most likely to maximize return on investment in future mailings. Other predictive
problems include forecasting bankruptcy and other forms of default, and identifying segments of
a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns: Data mining tools sweep through
databases and identify previously hidden patterns. An example of pattern discovery is the
analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.
Using massively parallel computers, companies dig through volumes of data to discover patterns
about their customers and products. For example, grocery chains have found that when men go
to a supermarket to buy diapers, they sometimes walk out with a six-pack of beer as well. Using
that information, it's possible to lay out a store so that these items are closer.
AT&T, A.C. Nielson, and American Express are among the growing ranks of companies
implementing data mining techniques for sales and marketing. These systems are crunching
through terabytes of point-of-sale data to aid analysts in understanding consumer behavior and
promotional strategies. Why? To gain a competitive advantage and increase profitability!
Similarly, financial analysts are plowing through vast sets of financial records, data feeds, and
other information sources in order to make investment decisions. Health-care organizations are
examining medical records to understand trends of the past so they can reduce costs in the future.
•Remove outliers
–Find and remove those values that are significantly different from the others
•Resolve conflicts
–Merge information from different data sources Find duplicate records and identify the correct
Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
• Data integration
• Data transformation
• Data reduction
• Data discretization
– Part of data reduction but with particular importance, especially for numerical
– E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
– equipment malfunction
• Hierarchical clustering is often performed but tends to define partitions of data sets rather
than ―clusters‖
• Hierarchical aggregation
– An index tree hierarchically divides a data set into partitions by value range of
some attributes
– Thus an index tree with aggregates stored at each node is a hierarchical histogram
i) Euclidean distance
5. For the following vectors X & Y. Calculate the Cosine, Correlation, Euclidean and
Jaccordsimilarity. X = (1, 1, 0, 1, 0, 1) ; Y = (1, 1, 1, 0, 0, 1). (10 Marks) June/ July 2016
Unit -4
For example: ―70% of customers who purchase 2% milk will also purchase whole wheat bread.‖
Data mining using association rules is the process of looking for strong rules:
1. Find the large itemsets (i.e. most frequent combinations of items) Most frequently used
algorithm: Apriori algorithm.
2. Generate association rules for the above itemsets.
How to measure the strength of an association rule?
1. Using support/confidence
2. Using dependence framework
Support shows the frequency of the patterns in the rule; it is the percentage of transactions that
contain both A and B, i.e.
Support = Probability(A and B)
Support = (# of transactions involving A and B) / (total number of transactions).
Confidence is the strength of implication of a rule; it is the percentage of transactions that contain B
if they contain A, ie Confidence = Probability (B if A) = P(B/A)
Confidence = (# of transactions involving A and B) / (total number of transactions that have A).
Customer Item purchasedItem
1 pizza beer
2 salad soda
3 pizza soda
4 salad tea
sometimes we should not ignore them, for example, if the transaction is valuable and generates a
large revenue, or if the products repel each other.
2 . Consider the following transaction data set '0' shows 9 transactions and list of items
using Apriori algo frequent itemset minimum support = 2 (10 Marks) June/ July 2014
3a.Explain FP - growth algorithm for discovering frequent item sets. What are its limitations?
(08 Marks) Jan 2015, (8 Marks) June/ July 2016
For each item, construct its conditional pattern-base, and then its conditional FP-
Repeat the process on each newly created conditional FP-tree
Until the resulting FP-tree is empty, or it contains only one path (single path will
generate all the combinations of its sub-paths, each of which is a frequent pattern)
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in the FP-tree
2) Construct conditional FP-tree from each conditional pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far
If the conditional FP-tree contains a single path, simply enumerate all the patterns
3b. What is Apriori algorithm? How it is used to find frequent item sets? Explain.
(08 Marks) Jan 2015
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
Apriori algorithm:
o Uses prior knowledge of frequent itemset properties.
o It is an iterative algorithm known as level-wise search.
o The search proceeds level-by-level as follows:
First determine the set of frequent 1-itemset; L1
Second determine the set of frequent 2-itemset using L1:
o The complexity of computing Li is O(n) where n is the number of
transactions in the transaction database.
o Reduction of search space:
In the worst case what is the number of itemsets in a level Li?
Apriori uses ―Apriori Property‖:
Apriori Property:
It is an anti-monotone property: if a set cannot pass a test, all
of its supersets will fail the same test as well.
It is called anti-monotone because the property is monotonic
in the context of failing a test.
All nonempty subsets of a frequent itemset must also be
An itemset I is not frequent if it does not satisfy the minimum
support threshold:
P(I) < min_sup
Itemset sup
Database TDB Itemset sup
{A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C3 Itemset L3
3rd scan Itemset sup
{B, C, E}
{B, C, E} 2
3c. List the measures used for evaluating association patterns. (04 Marks) Jan 2015
Find the frequent itemsets: the sets of items that have minimum support
A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent
Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association rules.
4b. What is FP - Growth algorithm? In what way it is used to find frequency itemsets?
(03 Marks) June/ July 2015
No candidate generation, no candidate test
Use compact data structure
Eliminate repeated database scan
Basic operation is counting and FP-tree building
never breaks a long pattern of any transaction
preserves complete information for frequent pattern mining
Unit -5
1b. What is rule based classifier? Explain how a rule based classifier works.
(8 Marks) Jan 2014
Rule-Based Classifier
Classify records by using a collection of if…then…rules
Rule: (Condition) → y
– where
• Condition is a conjunctions of attributes
• y is the class label
– LHS: rule antecedent or condition
– RHS: rule consequent
– Examples of classification rules:
• (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
• (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No
A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
Name Blood Type Give Birth Can Fly Live in Water Class hawk warm no yes no ?
1c. Write the algorithm for k-nearest neighbour classification. (4 Marks) Jan 2014
Requires three things
– The set of stored records
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other training records
– Identify k nearest neighbors
– Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by
taking majority vote)
2a. Define classification. Draw a neat figure and explain general approach for solving
classification model. (06 Marks) June/ July 2014
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or mathematical
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from
the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
Test set is independent of training set, otherwise over-fitting will occur
2b. Mention the three impurity measures for selecting best splits.
(04 Marks) June/ July 2014
Consider a training set that contains 60 +ve examples and 100 -ve examples, for each oftbe
following candidate rules.
Ruel ri: Covers 50 +ve examples and 5 -ve examples.
Ruel r2: Covers 2 the examples and No -ve examples.
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
Depends on number of ways to split
2-way split
Multi-way split
Multi-way split: Use as many partitions as distinct values.
Binary split: Divides values into two subsets. Need to find optimal partitioning.
2c. Determine which is the best and worst candidate rule according to,
i) Rule accuracy
ii) Likelihood ratio statistic.
iii) Laplace measure. 10 Marks) June/ July 2014
3a. How decision trees are used for classification? Explain decision tree induction
algorithm for classification. (10 Marks) Jan 2015
3b. How to improve accuracy of classification? Explain. (05 Marks) Jan 2015
Boosting increases classification accuracy
Applicable to decision trees or Bayesian classifier
Learn a series of classifiers, where each classifier in the series pays more attention to the
examples misclassified by its predecessor
Boosting requires only linear time and constant space
Assign every example an equal weight 1/N
For t = 1, 2, …, T Do
Obtain a hypothesis (classifier) h(t) under w(t)
Calculate the error of h(t) and re-weight the examples based on the error
Normalize w(t+1) to sum to 1
Output a weighted sum of all the hypothesis, with each hypothesis weighted according to
its accuracy on the training set
Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label
correctly and the accuracy of the predictor refers to how well a given predictor can guess the
value of predicted attribute for a new data.
Speed − This refers to the computational cost in generating and using the classifier or
Robustness − It refers to the ability of classifier or predictor to make correct predictions from
given noisy data.
Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently;
given large amount of data.
Classification is the task of assigning objects to one of several predefined categories.It’s a task
of mapping an input attribute set x into class label Y
predicts categorical class labels
classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
models continuous-valued functions, i.e., predicts unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis treatment effectiveness analysis
1. Decision tree induction is a Non parametric approach for building classification models.
2. Finding an optimal decision tree is NP complete problem.,ie greedy ,topdown recursive
uses all the approaches.
3. Computationally inexpensive.
4. Smaller sized trees are easy to interpret.
5. Expressive representation for learning discrete valued function.
6. Robust to the presence of noise
7. Redundant attributes does not affect the accuracy of decision trees
8. At leaf nodes no. of records may be too small to make decisions about class
representation of nodes known as data fragmentation problem. Solution is disallow
further splitting.
4c. Explain sequential covering algorithm in rule -based classifier.
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
5. Consider a training data set that contains 100 positive examples and 400 negative
examples. (10 Marks) June/ July 2016
i) Rule accuracy
Unit -6
1a. What is Bayes Theorm? Show how it is used for classification (06marks) Jan 2014
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
Incremental: Each training example can incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be combined with observed data.
Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
Standard: Even when Bayesian methods are computationally intractable, they can provide a
standard of optimal decision making against which other methods can be measured
Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes
MAP (maximum posteriori) hypothesis
Practical difficulty: require initial knowledge of many probabilities, significant
computational cost
1b. Discuss the methods for estimating predictive accuracy of classification method.
(10 marks) Jan 2014,(10 Marks) June/ July 2016
1c. What are the two approaches for extending the binary classifiers to extend to handle
multi class problems. (4 marks) Jan 2014, (10 Marks) June/ July 2014
Each training point belongs to one of N different classes. The goal is to construct a function
which, given a new data point, will correctly predict the class to which the new point belongs.
Some classification algorithms/models have been adaptated to the multi-label task, without
requiring problem transformations. Examples of these include:
boosting: AdaBoost.MH and AdaBoost.MR are extended versions of AdaBoost for multi-
label data.
k-nearest neighbors: the ML-kNN algorithm extends the k-NN classifier to multi-label
decision trees: "Clare" is an adapted C4.5 algorithm for multi-label classification; the
modification involves the entropy calculations.MMC, MMDT, and SSC refined MMDT,
can classify multi-labeled data based on multi-valued attributes without transforming the
attributes into single-values. They are also named multi-valued and multi-labeled
decision tree classification methods.
kernel methods for vector output
neural networks: BP-MLL is an adaptation of the popular back-propagation algorithm for
multi-label learning.
2a. For the given confusion matrix below for three classes. Find sensitivity and specificity
metrics to estimate predictive accuracy of classification methods.
(10 Marks) June/ July 2014
3a. What are Baysian classifiers? Explain Baye's theorem for classification.
(10 Marks) Jan 2015
3b. How rule based classifiers are used for classification? Explain. (10 Marks) Jan 2015
Characteristics of Rule-Based Classifier
Mutually exclusive rules
– Classifier contains mutually exclusive rules if the rules are independent of each
– Every record is covered by at most one rule
– Exhaustive rules
– Classifier has exhaustive coverage if it accounts for every possible combination of
attribute values
– Each record is covered by at least one rule y
• Rules are rank ordered according to their priority
– An ordered rule set is known as a decision list
Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio
between the number of correct predictions and the total number of predictions (the number of
test data points).
4c. Consider the following training set for predicting the loan default problem: Find the
conditional independence for given training set using Bayes theorem for classification.
(08 Marks) June/ July 2015
• Normal distribution: 1
(120110) 2
( Ai ij ) 2
– For (Income, Class=No):
1 2 ij2
P( Ai | c j ) e
– If Class=No 2 ij2
Unit -7
1a. List and explain four distance measures to compute the distance between a pair of
points and find out the distance between two objects represented by attribute.
(08 marks) Jan 2014
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j2 ip jp
Simple matching coefficient (invariant, if the binary variable is symmetric):
d (i, j) bc
a b c d
d (i, j) bc
a b c
1b. Explain the cluster analysis methods briefly.
(08 marks) Jan 2014, (08 Marks) June/ July 2015
Partitioning algorithms: Construct various partitions and then evaluate them by some
Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and the idea is to find the
best fit of that model to each other
2a. Explain K means clustering method and alogorithm. (10 Marks) June/ July
2014, (10 Marks) Jan 2015, (12 Marks) June/ July 2015, (5 Marks) June/ July 2016
2b. What is Hierarchical clustering method? Explain the algorithms for computing
distances between clusters. (10 Marks) June/ July 2014, (10 Marks) June/ July 2016
Use distance matrix as clustering criteria. This method does not require the number of clusters k
as an input, but needs a termination condition
3. How density based methods are used for clustering? Explain with example.
(10 Marks) Jan 2015
Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from p and DBSCAN visits
the next point of the database.
Continue the process until all of the points have been processed.
Unit -8
b.text mining (5 Marks) Jan 2015 , (05 Marks) June/ July 2016
Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to
the process of deriving high-quality information from text. High-quality information is typically
derived through the devising of patterns and trends through means such as statistical pattern
learning. Text mining usually involves the process of structuring the input text (usually parsing,
along with the addition of some derived linguistic features and the removal of others, and
subsequent insertion into a database), deriving patterns within the structured data, and finally
evaluation and interpretation of the output. 'High quality' in text mining usually refers to some
combination of relevance, novelty, and interestingness. Typical text mining tasks include text
categorization, text clustering, concept/entity extraction, production of granular taxonomies,
sentiment analysis, document summarization, and entity relation modeling (i.e., learning
relations between named entities).
Text analysis involves information retrieval, lexical analysis to study word frequency
distributions, pattern recognition, tagging/annotation, information extraction, data mining
techniques including link and association analysis, visualization, and predictive analytics. The
overarching goal is, essentially, to turn text into data for analysis, via application of natural
language processing (NLP) and analytical methods.
implemented on top of the DBMS Illustra and are being ported to Informix Universal Server.
Algorithms for Spatial Data Mining
New algorithms for spatial characterization and spatial trend analysis were developed.
For spatial characterization it is important that class membership of a database object is not only
determined by its non-spatial attributes but also by the attributes of objects in its neighborhood.
In spatial trend analysis, patterns of change of some non-spatial attributes in the neighborhood of
a database object are determined.
Spatial Trend Detection in GIS
Spatial trends describe a regular change of non-spatial attributes when moving away from
certain start objects. Global and local trends can be distinguished. To detect and explain such
spatial trends, e.g. with respect to the economic power, is an important issue in economic
Spatial Characterization of Interesting Regions
Another important task of economic geography is to characterize certain target regions
such as areas with a high percentage of retirees. Spatial characterization does not only consider
the attributes of the target regions but also neighboring regions and their properties.
Temporal Data Mining (TDM) deals with the problem of mining patterns from temporal data,
which can be either symbolic sequences or numerical time series. It has the capability to look for
interesting correlations or rules in large sets of temporal data, which might be overlooked when
the temporal component is ignored or treated as a simple numeric, attribute .Currently TDM is a
fast expanding field with many research results reported and many new temporal data mining
analysis methods or prototypes developed recently. There are two factors that contribute to the
popularity of temporal data mining. The first factor is an increase in the volume of temporal data
stored, as many real world applications deal with huge amount of temporal data. The second
factor is the mounting recognition in the value of temporal data.
In many application domains, temporal data are now being viewed as invaluable assets from
which hidden knowledge can be derived, so as to help understand the past and/or plan for the
future. TDM covers a wide spectrum of paradigms for knowledge modeling and discovery. Since
temporal data mining is relatively a new field of research, there is no widely accepted taxonomy
The web content mining approach of using the traditional search engine has migrated into
intelligent agent-based mining and database-driven mining, where intelligent software agents for
specific tasks support the search for more relevant web contents by taking domain characteristics
and user profiles into consideration more intelligently. They also help users interpret the
discovered web contents.
Web content mining is related but different from data mining and text mining. It is related to data
mining because many data mining techniques can be applied in Web content mining. It is related
to text mining because much of the web contents are texts. However, it is also quite different
from data mining because Web data are mainly semi-structured and/or unstructured, while data
mining deals primarily with structured data. Web content mining is also different from text
mining because of the semi-structure nature of the Web, while text mining focuses on
unstructured texts. Web content mining thus requires creative applications of data mining and/or
text mining techniques and also its own unique approaches. In the past few years, there was a
rapid expansion of activities in the Web content mining area. This is not surprising because of
the phenomenal growth of the Web contents and significant economic benefit of such mining.
However, due to the heterogeneity and the lack of structure of Web data, automated discovery of
targeted or unexpected knowledge information still present many challenging research problems.
we will examine the following important Web content mining problems and discuss existing
techniques for solving these problems. Some other emerging problems will also be surveyed.
internal structure, they are still considered "unstructured" because the data they contain doesn't
fit neatly in a database.
Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the
amount of unstructured data in enterprises is growing significantly — often many times faster
than structured databases are growing.
Mining Unstructured Data
Many organizations believe that their unstructured data stores include information that could
help them make better business decisions. Unfortunately, it's often very difficult to analyze
unstructured data. To help with the problem, organizations have turned to a number of different
software solutions designed to search unstructured data and extract important information. The
primary benefit of these tools is the ability to glean actionable information that can help a
business succeed in a competitive environment.
Because the volume of unstructured data is growing so rapidly, many enterprises also turn to
technological solutions to help them better manage and store their unstructured data. These can
include hardware or software solutions that enable them to make the most efficient use of their
available storage space.
Unstructured Data and Big Data
As mentioned above, unstructured data is the opposite of structured data. Structured data
generally resides in a relational database, and as a result, it is sometimes called relational data.
This type of data can be easily mapped into pre-designed fields. For example, a database
designer may set up fields for phone numbers, zip codes and credit card numbers that accept a
certain number of digits. Structured data has been or can be placed in fields like these. By
contrast, unstructured data is not relational and doesn't fit into these sorts of pre-defined data
Semi-Structured Data
In addition to structured and unstructured data, there's also a third category: semi-structured data.
Semi-structured data is information that doesn't reside in a relational database but that does have
some organizational properties that make it easier to analyze. Examples of semi-structured data
might include XML documents and NoSQL databases.
The term big data is closely associated with unstructured data. Big data refers to extremely large
datasets that are difficult to analyze with traditional tools. Big data can include both structured
and unstructured data, but IDC estimates that 90 percent of big data is unstructured data. Many
of the tools designed to analyze big data can handle unstructured data.
Unstructured Data Management
Organizations use of variety of different software tools to help them organize and manage
unstructured data. These can include the following:
Big data tools
Software like Hadoop can process stores of both unstructured and structured data that are
extremely large, very complex and changing rapidly.
Business intelligence software
Also known as BI, business intelligence is a broad category of analytics, data mining, dashboards
and reporting tools that help companies make sense of their structured and unstructured data for
the purpose of making better business decisions.
Data integration tools
These tools combine data from disparate sources so that they can be viewed or analyzed from a
single application. They sometimes include the capability to unify structured and unstructured
Document management systems
Also called enterprise content management systems, a DMS can track, store and share
unstructured data that is saved in the form of document files.
Information management solutions
This type of software tracks structured and unstructured enterprise data throughout its lifecycle.
Search and indexing tools
These tools retrieve information from unstructured data files such as documents, Web pages and
Unstructured Data Technology
A group called the Organization for the Advancement of Structured Information Standards
(OASIS) has published the Unstructured Information Management Architecture (UIMA)
standard. The UIMA "defines platform-independent data representations and interfaces for
software components or services called analytics, which analyze unstructured information and
assign semantics to regions of that unstructured information."
Many industry watchers say that Hadoop has become the de facto industry standard for
managing Big Data. This open source project is managed by the Apache Software Foundation.
Data mining has been used in a wide range of applications. However, the possible objectives of
data mining,which are often called tasks of data mining, can be classified into some broad
categories:prediction,classification, clustering, search and retrieval, and pattern discovery. This
categorization follows the categorization of data mining tasks extended to temporal data mining
Prediction:Prediction is the task of explicitly modelling variable dependencies to predict a
Classification:Classification is the task of assigning class labels to the data according to a model
learned from the training data where the classes are known. Classification is one of the most
common tasks in supervised learning, but it has not received much attention in temporal data
mining In sequence classification, each sequence presented to the system is assumed to belong
to one of predefined classes and the goal is to automatically determine the corresponding
category for a given input sequence.
Clustering: Clustering is the process of finding intrinsic groups, called clusters, in the data.
Clustering of time series is concerned with grouping a collection of time series (or sequences)
based on their similarity. Time series clustering has been shown effective in providing useful
information in various domains .Clustering of sequences is relatively less explored but is
becoming increasingly important in data mining applications such as web usage
mining and bioinformatics .
Searching and Retrieval Searching and retrieval are concerned with efficiently locating
subsequences or sub series in large databases of sequences or time series. In data mining, query
based searches are more concerned with the problem of efficiently locating approximate
matching than exact matching, known as content - based retrieval
3. Explain the concept of finding similar web pages and finger printing in detail
(10 Marks) June/ July 2016
likelihood of jumping to a page chosen at random from the entire web, they will reach Page E
through the collection to adjust approximate PageRank values to more closely reflect the
theoretical true value.
A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is commonly
expressed as a "50% chance" of something happening. Hence, a PageRank of 0.5 means there is
a 50% chance that a person clicking on a random link will be directed to the document with the
0.5 PageRank
finger printing
In short, fingerprinting is the capability of a site to identify or re-identify a visiting user, user
agent or device via configuration settings or other observable characteristics.
ingerprinting can be used as a security measure (e.g. as means of authenticating the user).
However, fingerprinting is also a potential threat to users' privacy on the Web. This document
does not attempt to provide a single unifying definition of "privacy" or "personal data", but we
highlight how browser fingerprinting might impact users' privacy. For example, browser
fingerprinting can be used to:
identify a user
correlate a user’s browsing activity within and across sessions
track users without transparency or control
The privacy implications associated with each use case are discussed below. Following from the
practice of security threat model analysis, we note that there are distinct models of privacy
threats for fingerprinting. Defenses against these threats differ, depending on the particular
privacy implication and the threat model of the user.
Passive fingerprinting is browser fingerprinting based on characteristics observable in the
contents of Web requests, without the use of any code executing on the client side.
For active fingerprinting, we also consider techniques where a site runs JavaScript or other code
on the local client to observe additional characteristics about the browser. Techniques for active
fingerprinting might include accessing the window size, enumerating fonts or plug-ins,
evaluating performance characteristics, or rendering graphical patterns. Key to this distinction is
that active fingerprinting takes place in a way that is potentially detectable on the client.