Data Mining Notes1

Data Mining –Introduction
There is a huge amount of data available in the Information Industry. This data is of no use until it
is converted into useful information. It is necessary to analyze this huge amount of data and extract
useful information from it.
Extraction of information is not the only process we need to perform; data mining also involves
other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining,
Pattern Evaluation and Data Presentation. Once all these processes are over, we would be able
to use this information in many applications such as Fraud Detection, Market Analysis, Production
Control, Science Exploration, etc.
What is Data Mining?

Data Mining is defined as extracting information from huge sets of data. In other words, we can say
that data mining is the procedure of mining knowledge from data.
Functionalities of Data Mining

Data mining functions are used to define the trends or correlations contained in data
mining activities.
In comparison, data mining activities can be divided into 2 categories:
1. Descriptive Data Mining:
It includes certain knowledge to understand what is happening within the data
without a previous idea. The common data features are highlighted in the data set.
For examples: count, average etc.
2. Predictive Data Mining:
It helps developers to provide unlabeled definitions of attributes. Based on previous
tests, the software estimates the characteristics that are absent.
For example: Judging from the findings of a patient’s medical examinations that is he
suffering from any particular disease.
Data Mining Functionality:
1. Class/Concept Descriptions:
Classes or definitions can be correlated with results. In simplified, descriptive and yet
accurate ways, it can be helpful to define individual groups and concepts.
These class or concept definitions are referred to as class/concept descriptions.
• Data Characterization:
This refers to the summary of general characteristics or features of the class that is
under the study. For example. To study the characteristics of a software product
whose sales increased by 15% two years ago, anyone can collect these type of data
related to such products by running SQL queries.
• Data Discrimination:
It compares common features of class which is under study. The output of this
process can be represented in many forms. Eg., bar charts, curves and pie charts.
2. Mining Frequent Patterns, Associations, and Correlations:
Frequent patterns are nothing but things that are found to be most common in the data.
There are different kinds of frequency that can be observed in the dataset.
• Frequent item set:
This applies to a number of items that can be seen together regularly for eg: milk and
sugar.
• Frequent Subsequence:
This refers to the pattern series that often occurs regularly such as purchasing a
phone followed by a back cover.
• Frequent Substructure:
It refers to the different kinds of data structures such as trees and graphs that may
be combined with the itemset or subsequence.
Association Analysis:
The process involves uncovering the relationship between data and deciding the rules of
the association. It is a way of discovering the relationship between various items. for
example, it can be used to determine the sales of items that are frequently purchased
together.
Correlation Analysis:
Correlation is a mathematical technique that can show whether and how strongly the pairs
of attributes are related to each other. For example, Heighted people tend to have more
weight.
Classification of data mining
The information or knowledge extracted so can be used for any of the following applications −
• Market Analysis
• Fraud Detection
• Customer Retention
• Production Control
• Science Exploration
Data Mining Applications

Data mining is highly useful in the following domains −
• Market Analysis and Management

• Corporate Analysis & Risk Management
• Fraud Detection
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid
Market Analysis and Management

Listed below are the various fields of market where data mining is used −
• Customer Profiling − Data mining helps determine what kind of people buy what kind of
products.
• Identifying Customer Requirements − Data mining helps in identifying the best products
for different customers. It uses prediction to find the factors that may attract new customers.
• Cross Market Analysis − Data mining performs Association/correlations between product
sales.
• Target Marketing − Data mining helps to find clusters of model customers who share the
same characteristics such as interests, spending habits, income, etc.
• Determining Customer purchasing pattern − Data mining helps in determining customer
purchasing pattern.
• Providing Summary Information − Data mining provides us various multidimensional
summary reports.
Corporate Analysis and Risk Management

Data mining is used in the following fields of the Corporate Sector −
• Finance Planning and Asset Evaluation − It involves cash flow analysis and prediction,
contingent claim analysis to evaluate assets.
• Resource Planning − It involves summarizing and comparing the resources and spending.
• Competition − It involves monitoring competitors and market directions.
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect
frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call, time
of the day or week, etc. It also analyzes the patterns that deviate from expected norms
Data Mining - Issues

Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −
• Mining Methodology and User Interaction

• Performance Issues
• Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −
• Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad
range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in concise terms but at multiple
levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.
Diverse Data Types Issues

• Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kind of data.
• Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them
adds challenges to data mining.
Data Preprocessing in Data Mining

Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a useful
and efficient format.
Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
• (a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various
ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
• (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses
data reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute.the attribute
having p-value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are:Wavelet transforms and PCA (Principal Componenet Analysis).
Unit II
Data Mining Task Primitives:
Data Mining Task Primitives Each user will have a data mining task in mind, that is, some form of data
analysis that he or she would like to have performed. A data mining task can be specified in the form of a
data mining query, which is input to the data mining system. A data mining query is defined in terms of
data mining task primitives. These primitives allow the user to interactively communicate with the data
mining system during discovery in order to direct the mining process, or examine the findings from different
angles or depths. The data mining primitives specify the following, as illustrated in Figure 1.13.
The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in
which the user is interested. This includes the database attributes or data warehouse dimensions of interest
(referred to as the relevant attributes or dimensions).
The kind of knowledge to be mined: This specifies the data mining functions to be performed, such as
characterization, discrimination, association or correlation analysis, classification, prediction, clustering,
outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about the domain to be
mined is useful for guiding the knowledge discovery process and for evaluating the patterns found. Concept
hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels
of abstraction. An example of a concept hierarchy for the attribute (or dimension) age is shown in Figure
1.14. User beliefs regarding relationships in the data are another form of background knowledge.
The interestingness measures and thresholds for pattern evaluation: They may be used to guide the mining
process or, after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have
different interestingness measures. For example, interestingness measures for association rules include
support and confidence. Rules whose support and confidence values are below user-specified thresholds
are considered uninteresting.
The expected representation for visualizing the discovered patterns: This refers to the form in which
discovered patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and
cubes
Data Mining Query Language
The Data Mining Query Language (DMQL) was proposed by Han, Fu, Wang, et al. for the DBMiner
data mining system. The Data Mining Query Language is actually based on the Structured Query
Language (SQL). Data Mining Query Languages can be designed to support ad hoc and interactive
data mining. This DMQL provides commands for specifying primitives. The DMQL can work with
databases and data warehouses as well. DMQL can be used to define data mining tasks.
Particularly we examine how to define data warehouses and data marts in DMQL.
Syntax for Task-Relevant Data Specification

Here is the syntax of DMQL for specifying task-relevant data −
use database database_name
or
use data warehouse data_warehouse_name

in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list
Syntax for Specifying the Kind of Knowledge

Here we will discuss the syntax for Characterization, Discrimination, Association, Classification,
and Prediction.
Characterization
The syntax for characterization is −
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum, or count%.
For example −
Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%
Discrimination
The syntax for Discrimination is −
mine comparison [as {pattern_name]}
For {target_class } where {target condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
For example, a user may define big spenders as customers who purchase items that cost $100 or
more on an average; and budget spenders as customers who purchase items at less than $100 on
an average. The mining of discriminant descriptions for customers from each of these categories
can be specified in the DMQL as −
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count
Association
The syntax for Association is−
mine associations [ as {pattern_name} ]
{matching {metapattern} }
For Example −
mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
where X is key of customer relation; P and Q are predicate variables; and W, Y, and Z are object
variables.
Classification
The syntax for Classification is −
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
For example, to mine patterns, classifying customer credit rating where the classes are determined
by the attribute credit_rating, and mine classification is determined as
classifyCustomerCreditRating.
analyze credit_rating
Prediction
The syntax for prediction is −
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
Syntax for Concept Hierarchy Specification

To specify concept hierarchies, use the following syntax −
use hierarchy <hierarchy> for <attribute_or_dimension>
We use different syntaxes to define different types of hierarchies such as−
-schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
-
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50

level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) ≤ $250))

level_1: high_profit_margin < level_0: all
Syntax for Interestingness Measures Specification

Interestingness measures and thresholds can be specified by the user with the statement −
with <interest_measure_name> threshold = threshold_value
For Example −
with support threshold = 0.05
with confidence threshold = 0.7
Syntax for Pattern Presentation and Visualization Specification

We have a syntax, which allows users to specify the display of discovered patterns in one or more
forms.
display as <result_form>
For Example −
display as table
Full Specification of DMQL

As a market manager of a company, you would like to characterize the buying habits of customers
who can purchase items priced at no less than $100; with respect to the customer's age, type of
item purchased, and the place where the item was purchased. You would like to know the
percentage of customers having that characteristic. In particular, you are only interested in
purchases made in Canada, and paid with an American Express credit card. You would like to view
the resulting descriptions in the form of a table.
use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made
from customer C, item I, purchase P, items_sold S, branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100
with noise threshold = 5%
display as table
Data Generalization In Data Mining – Summarization Based Characterization
What is Concept Description?
• Descriptive vs. predictive data mining
Descriptive mining: describes concepts or task-relevant data sets in concise, summarily,

informative, discriminative forms
Predictive mining: Based on data and analysis, constructs models for the database, and predicts the
trend and properties of unknown data
• Concept description: Characterization: provides a concise and succinct summarization of the
given collection of data
Comparison: provides descriptions comparing two or more collections of data
Concept Description vs. OLAP
Concept description:
-Can handle complex data types of the attributes and their aggregations .
-A more automated process
OLAP:
– Restricted to a small number of dimension and measure types
– User-controlled process.
Data Generalization and Summarization-based Characterization
Data generalization – A process which abstracts a large set of task-relevant data in a database from
a low conceptual level to higher ones.
Approaches: CONCEPTUAL LEVELS
• Data cube approach(OLAP approach)
• Attribute-oriented induction approach
Characterization: Data Cube Approach:
• Perform computations and store results in data cubes
• Strength:
– An efficient implementation of data generalization
– Computation of various kinds of measures • e.g., count( ), sum( ), average( ), max( )
– Generalization and specialization can be performed on a data cube by roll-up and drill-down •
Limitations:
– handle only dimensions of simple nonnumeric data and measures of simple aggregated numeric
values.
– Lack of intelligent analysis, can’t tell which dimensions should be used and what levels should
the generalization reach
Attribute-Oriented Induction
• Proposed in 1989 (KDD ‘89 workshop)
• Not confined to categorical data nor particular measures.
• How it is done?
– Collect the task-relevant data( initial relation) using a relational database query
– Perform generalization by attribute removal or attribute generalization.
– Apply aggregation by merging identical, generalized tuples and accumulating their respective
counts.
– Interactive presentation with users.
Analytical Characterization : Analysis of Attribute Relevance

Introduction
“What if am not sure which attribute to include or class characterization and class comparison ? I may end
up specifying too many attributes, which could slow down the: system considerably .” Measures of attribute
relevance analysis can be used to help identify irrelevant or weakly relevant attributes that can be excluded
from the concept description process. The incorporation of this pre-processing step into class
characterization or comparison is referred to as analytical characterization or analytical comparison,
respectively . This section describes a general method of attribute relevance analysis and its integration with
attribute-oriented induction.
The first limitation of class characterization for multidimensional data analysis in Data warehouses and
OLAP tools is the handling of complex objects . The second Limitation is the lack of an automated
generalization process: the user must explicitly Tell the system which dimension should be included in the
class characterization and to How high a level each dimension should be generalized . Actually , the user
must specify each step of generalization or specification on any dimension.
Usually , it is not difficult for a user to instruct a data mining system regarding how high level each
dimension should be generalized . For example , users can set attributegeneralization thresholds for this , or
specify which level a given dimension should reach ,such as with the command “generalize dimension
location to the country level”. Even without explicit user instruction , a default value such as 2 to 8 can be
set by the data mining system , which would allow each dimension to be generalized to a level that contains
only 2 to 8 distinct values. If the user is not satisfied with the current level of generalization, she can specify
dimensions on which drill-down or roll-up operations should be applied.
It is nontrivial, howesver, for users to determine which dimensions should be included in the analysis of
class characteristics. Data relations often contain 50 to 100 attributes , and a user may have little knowledge
regarding which attributes or dimensions should be selected for effective data mining. A user may include
too few attributes in the analysis, causing the resulting mined descriptions to be incomplete. On the other
hand, a user may introduce too many attributes for analysis (e.g. , by indicating “in relevance to *”, which
includes all the attributes in the specified relations).
Methods should be introduced to perform attribute (or dimension )relevance Analysis in order to filter out
statistically irrelevant or weakly relevant attributes, and retain or even rank the most relevant attributes for
the descriptive mining task at hand. Class characterization that includes the analysis of attribute/dimesnsion
relevance is called analytical characterization. Class comparison that includes such analysis is called
analytical comparison.
Intuitively, an attribute or dimension is considered highly relevant with respect to a Given class if it is likely
that the values of the attribute or dimension may be used to Distinguish the class from others. For example,
it is unlikely that the color of an Automobile can be used to distinguish expensive from cheap cars, but the
model , make, style, and number of cylinders are likely to be more relevant attributes. Moreover, even
within the same dimension, different levels of concepts may have dramatically different powers for
distinguishing a class from others.
For example, in the birth_date dimension, birth_day and birth_month are unlikely to be relevant to the salary
of employees. However, the birth_decade (i.e. , age interval) may be highly relevant to the salary of
employees. This implies that the analysis of dimension relevance should be performed at multi-levels of
abstraction, and only the most relevant levels of a dimension should be included in the analysis. Above we
said that attribute/ dimension relevance is evaluated based on the ability of the attribute/ dimension to
distinguish objects of a class from others. When mining a class comparison (or discrimination), the target
class and the contrasting classes are Explicitly given in the mining query. The relevance analysis should be
performed by Comparison of these classes, as we shall see below. However, when mining class
Characteristics, there is only one class to be characterized. That is, no contrasting class is specified. It is
therefore not obvious what the contrasting class should be for use in of comparable data in the database that
excludes the set of data to be characterized. For example, to characterize graduate students, the contrasting
class can be composed of the set of undergraduate students.
Methods of Attribute Relevance Analysis:
There have been many studies in machine learning, statistics, fuzzy and rough set Theories, and so on , on
attribute relevance analysis. The general idea behind attribute Relevance analysis is to compute some
measure that is used to quantify the relevance of an attribute with respect to a given class or concept. Such
measures include information gain, the Gini index, uncertainity, and correlation coefficients. Here we
introduce a method that integrates an information gain analysis technique With a dimension-based data
analysis method. The resulting method removes the less informative attributes, collecting the more
informative ones for use in concept description analysis.
Data Collection:
Collect data for both the target class and the contrasting class by query processing. For class comparison, the
user in the data-mining query provides both the target class and the contrasting class. For class
characterization, the target class is the class to be characterized, whereas the contrasting class is the set of
comparable data that are not in the target class.
Preliminary relevance analysis using conservative AOI:
This step identifies a Set of dimensions and attributes on which the selected relevance measure is to be
Applied. Since different levels of a dimension may have dramatically different Relevance with respect to a
given class, each attribute defining the conceptual levels of the dimension should be included in the
relevance analysis in principle. Attribute-oriented induction (AOI)can be used to perform some preliminary
relevance analysis on the data by removing or generalizing attributes having a very large number of distinct
values (such as name and phone#). Such attributes are unlikely to be found useful for concept description.
To be conservative , the AOI performed here should employ attribute generalization thresholds that are set
reasonably large so as to allow more (but not all)attributes to be considered in further relevance analysis by
the selected measure (Step 3 below). The relation obtained by such an application of AOI is called the
candidate relation of the mining task.
Remove irrelevant and weakly attributes using the selected relevance analysis measure:
Evaluate each attribute in the candidate relation using the selected relevance analysis measure. The
relevance measure used in this step may be built into the data mining system or provided by the user. For
example, the information gain measure described above may be used. The attributes are then sorted(i.e.,
ranked )according to their computed relevance to the data mining task. Attributes that are not relevant or are
weakly relevant to the task are then removed. A threshold may be set to define “weakly relevant.” This step
results in an initial Target class working relation and an initial contrasting class working relation.
Generate the concept description using AOI:
Perform AOI using a less Conservative set of attribute generalization thresholds. If the descriptive mining
Task is class characterization, only the initial target class working relation is included here. If the descriptive
mining task is class comparison, both the initial target class working relation and the initial contrasting class
working relation are included. The complexity of this procedure is the induction process is perfomed twice,
that Is, in preliminary relevance analysis (Step 2)and on the initial working relation (Step4). The statistics
used in attribute relevance analysis with the selected measure (Step 3) may be collected during the scanning
of the database in Step 2
Mining Class Comparisons: Discriminating Between Different Classes
Introduction: In many applications, users may not be interested in having a single class (or
concept) described or characterized, but rather would prefer to mine a description that compares or
distinguishes one class (or concept) from other comparable classes (or concepts).Class
discrimination or comparison (hereafter referred to as class comparison) mines descriptions that
distinguish a target class from its contrasting classes. Notice that the target and contrasting classes
must be comparable in the sense that they share similar dimensions and attributes. For example, the
three classes, person, address, and item, are not comparable.
However, the sales in the last three years are comparable classes, and so are computer science
students versus physics students. Our discussions on class characterization in the previous sections
handle multilevel data summarization and characterization in a single class. The techniques
developed can be extended to handle class comparison across several comparable classes. For
example, the attribute generalization process described for class characterization can be modified
so that the generalization is performed synchronously among all the classes compared. This allows
the attributes in all of the classes to be generalized to the same levels of abstraction. Suppose, for
instance, that we are given the All Electronics data for sales in 2003 and sales in 2004 and would
like to compare these two classes. Consider the dimension location with abstractions at the city,
province or state, and country levels. Each class of data should be generalized to the same location
level. That is, they are synchronously all generalized to either the city level, or the province or state
level, or the country level. Ideally, this is more useful than comparing, say, the sales in Vancouver
in 2003 with the sales in the United States in 2004 (i.e., where each set of sales data is generalized
to a different level). The users, however, should have the option to overwrite such an automated,
synchronous comparison
with their own choices, when preferred.
“How is class comparison performed?” In general, the procedure is as follows:
1. Data collection: The set of relevant data in the database is collected by query processing and is
partitioned respectively into a target class and one or a set of contrasting class(es).
2. Dimension relevance analysis: If there are many dimensions, then dimension relevance analysis
should be performed on these classes to select only the highly relevant dimensions for further
analysis. Correlation or entropy-based measures can be used for this step (Chapter 2).
3. Synchronous generalization: Generalization is performed on the target class to the level

controlled by a user- or expert-specified dimension threshold, which results in a prime target class
relation. The concepts in the contrasting class(es) are generalized to the same level as those in the
prime target class relation, forming the prime contrasting class(es) relation.
4. Presentation of the derived comparison: The resulting class comparison description can be
visualized in the form of tables, graphs, and rules. This presentation usually includes a
“contrasting” measure such as count% (percentage count) that reflects the comparison between the
target and contrasting classes. The user can adjust the comparison description by applying drill-
down, roll-up, and other OLAP operations to the target and contrasting classes, as desired.
The above discussion outlines a general algorithm for mining comparisons in databases. In
comparison with characterization, the above algorithm involves synchronous generalization of the
target class with the contrasting classes, so that classes are simultaneously
Example
Task - Compare graduate and undergraduate students using the discriminant rule.
for this, the DMQL query would be.
use University_Database
mine comparison as “graduate_students vs_undergraduate_students”
in relevance to name, gender, program, birth_place, birth_date, residence, phone_no, GPA
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student
Now from this, we can formulate that

• attributes = name, gender, program, birth_place, birth_date, residence, phone_no, and GPA.
• Gen(ai) = concept hierarchies on attributes ai.
• Ui = attribute analytical thresholds for attributes ai.
• Ti = attribute generalization thresholds for attributes ai.
• R = attribute relevance threshold.
1. Data collection -Understanding Target and Contrasting classes.
2. Attribute relevance analysis - It is used to remove attributes name, gender, program, phone_no.
3. Synchronous generalization - It is controlled by user-specified dimension thresholds, a prime

target, and contrasting class(es) relations/cuboids.
4. Drill down, roll up and other OLAP operations on target and contrasting classes to adjust levels
of abstractions of resulting description.
Unit 3
Association Rule Mining
ssociation rule mining is a procedure which aims to observe frequently occurring patterns,
correlations, or associations from datasets found in various kinds of databases such as relational
databases, transactional databases, and other forms of repositories.
An association rule has 2 parts:
• an antecedent (if) and
• a consequent (then)
An antecedent is something that’s found in data, and a consequent is an item that is found in
combination with the antecedent. Have a look at this rule for instance:
“If a customer buys bread, he’s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the consequent. Simply put, it can
be understood as a retail store’s association rule to target their customers better. If the above rule is
a result of a thorough analysis of some data sets, it can be used to not only improve customer
service but also improve the company’s revenue.
Association rules are created by thoroughly analyzing data and looking for frequent if/then
patterns. Then, depending on the following two parameters, the important relationships are
observed:
1. Support: Support indicates how frequently the if/then relationship appears in the database.
2. Confidence: Confidence tells about the number of times these relationships have been found to be
true.
So, in a given transaction with multiple items, Association Rule Mining primarily tries to find the
rules that govern how or why such products/items are often bought together. For example, peanut
butter and jelly are frequently purchased together because a lot of people like to make PB&J
sandwiches.
A Beginner’s Guide to Data Science and Its Applications
Association Rule Mining is sometimes referred to as “Market Basket Analysis”, as it was the first
application area of association mining. The aim is to discover associations of items occurring
together more often than you’d expect from randomly sampling all the possibilities. The classic
anecdote of Beer and Diaper will help in understanding this better.
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is
Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show associations
between items.It allows retailers to identify relationships between the items that people buy
together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item based on
the occurrences of other items in the transaction.
Before we start defining the rule, let us first see the basic definition
Support COUNT(σ)– Frequency of occurrence of a itemset.
Here σ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal to minsup threshold.
Association Rule – An implication expression of the form X -> Y, where X and Y are any 2
itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –
• Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction.It is a measure of how frequently the collection
of items occur together as a percentage of all transactions.
• Support = σ (X+Y) ÷ total –
It is interpreted as fraction of transactions that contain both X and Y.
• Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as the no of
transactions that includes all items in {A} to the no of transactions that includes all items in
{A}.
• Conf(X=>Y) = Supp(XUY) ÷ Supp(X) –
It measures how often each item in Y appears in transactions that contains items in X also.
• Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence,
assuming that the itemsets X and Y are independent of each other.The expected confidence is
the confidence divided by the frequency of {Y}.
• Lift(X=>Y) = Conf(X=>Y) ÷ Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as expected, greater than 1
means they appear together more than expected and less than 1 means they appear less than
expected.Greater lift values indicate stronger association.
Example – From the above table, {Milk, Diaper}=>{Beer}
s= σ({Milk, Diaper, Beer}) / |T|
= 2/5
= 0.4
c= σ(Milk, Diaper, Beer) / (Milk, Diaper)

= 2/3
= 0.67
l= Supp({Milk, Diaper, Beer}) / Supp({Milk, Diaper})*Supp({Beer})

= 0.4/(0.6*0.6)
= 1.11
The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consists of a large number of transaction records which
list all items bought by a customer on a single purchase. So the manager could know if certain
groups of items are consistently purchased together and use this data for adjusting store layouts,
cross-selling, promotions based on statistics.
Mining single-dimensional Boolean association rules from

transactional databases and multilevel Association rule
What is association mining?
Association mining aims to extract interesting correlations, frequent patterns, associations or casual
structures among sets of items or objects in transaction databases, relational database or other data
repositories. Association rules are widely used in various areas such as telecommunication networks, market
and risk management, inventory control, cross-marketing, catalog design, loss-leader analysis, clustering,
classification, etc.
Examples:
Rule form: “Bodyhead *support, confidence+”. Buys
(x,”diapers”) buys(x, “beers”) *0.5%, 60%+

major(x, “cs”) ^ takes(x, “DB”) grade(x, “A”) *1%, 75%+
Association rule: basic concepts:
• Given: (1) database of transaction, (2) each transaction is a list of items (purchased by
a customer in visit)
• Find: all rules that correlate the presence of one set of items with that of another set
of items.
⎯ E.g., 98% of people who purchase tires and auto accessories also get automotive
services done.
⎯ E.g., Market Basket Analysis
This process analyzes customer buying habits by finding associations between the
different items that customers place in their “Shopping Baskets”. The discovery of such
associations can help retailers develop marketing strategies by gaining insight into which
items are frequently purchased together by customer.
• Applications
⎯ *maintenance agreement (what the store should do to boost maintenance
agreement sales)
⎯ Home electronics * (what other products should the store stocks up?)
⎯ Attached mailing in direct marketing
⎯ Detecting “ping-pong” ing of patients, faulty “collisions”
RULE Measures: supports and confidence
Support: percentage of transaction in D that contain AUB.
Confidence: percentage of transaction in D containing A that also contains B.
Support (A B) = p (A B)
Confidence (A B) = P (B/A).
Rules that satisfy both a minimum supports threshold (min_sup) and a minimum
confidence threshold (min_conf) are called strong
LET minimum, support 50%, and minimum confidence 50%, we have
A C (50%,66.6%)
C A (50%, 100%)
In general, association rules mining can be viewed as a two-

step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min_sup.
2. Generate strong association rules from the frequent itemsets:
By definition, these rules must satisfy minimum support and minimum confidence.
Classification of association rules mining:
• Based on the level of abstraction involved in the rules set:

⎯ Single level association rules refer items or attribute at only one level.
Buys (X, “computer”) buys(X, “HP printer”)
⎯ Multi-level association rules reference items or attribute at different
levels of abstraction.
Buys(X, “laptop computer”)  buys(X, “HP printer”)
• Based on the number of data dimensions involved in the rules:

⎯ Single dimensional Association rule is an association rule in which items or
attribute reference only one dimension.
Buys (X, “computer”)  buys (X,”antivirus software”)
⎯ Multidimensional association rule reference two or more dimensions age

(X, “30….39”)^income(X, “42k…48k”)buys(X, “high resolution TV”)
• Based on the types of the values handled in rule:
⎯ Boolean association rule involve associations between the presence and
absence of items.
buys (X, “SQLServer”) ^ buys (X, “DMBook”)  buys(X, “DBMiner”)
⎯ Quantitative association rule describe association between quantitative

items or attributes.
Age (X, “30…39”) ^ income(X, “42k…49k”)  buys(X, “PC”)
Data Mining - Classification & Prediction
There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −
• Classification
• Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky, or a prediction model to predict the expenditures
in dollars of potential customers on computer equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
• A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric
prediction.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us understand the
working of classification. The Data Classification process includes two steps −
• Building the Classifier or Model

• Using Classifier for Classification
Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples and their
associated class labels.
• Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.
Using Classifier for Classification

In this step, the classifier is used for classification. Here the test data is used to estimate the
accuracy of classification rules. The classification rules can be applied to the new data tuples
if the accuracy is considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring
value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
• Data Transformation and reduction − The data can be transformed by any of the
following methods.
o Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within
a small specified range. Normalization is used when in the learning step, the
neural networks or the methods involving measurements are used.
o Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
Note − Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.
Comparison of Classification and Prediction Methods
Here is the criteria for comparing the methods of Classification and Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class
label correctly and the accuracy of the predictor refers to how well a given predictor
can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the classifier or
predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor understand
Data Mining - Decision Tree Induction
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf
node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer
at a company is likely to buy a computer or not. Each internal node represents a test on an
attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows −
• It does not require any domain knowledge.

• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or
outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
The cost complexity is measured by the following two parameters −
• Number of leaves in the tree, and

• Error rate of the tree
Data Mining - Bayesian Classification
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
• Posterior Probability [P(H/X)]
• Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)
Bayesian Belief Network
Bayesian Belief Networks specify joint conditional probability distributions. They are also
known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
• A Belief Network allows class conditional independencies to be defined between
subsets of variables.
• It provides a graphical model of causal relationship on which learning can be
performed.
• We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −
• Directed acyclic graph

• A set of conditional probability tables
Directed Acyclic Graph
• Each node in a directed acyclic graph represents a random variable.

• These variable may be discrete or continuous valued.
• These variables may correspond to the actual attribute given in the data.
Directed Acyclic Graph Representation
The following diagram shows a directed acyclic graph for six Boolean variables.
The arc in the diagram allows representation of causal knowledge. For example, lung cancer
is influenced by a person's family history of lung cancer, as well as whether or not the person
is a smoker. It is worth noting that the variable PositiveXray is independent of whether the
patient has a family history of lung cancer or that the patient is a smoker, given that we know
the patient has lung cancer.
Conditional Probability Table
The conditional probability table for the values of the variable LungCancer (LC) showing
each possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker
(S) is as follows −
Definition Predictive Data Mining means?

Predictive data mining is data mining that is done for the purpose of using business
intelligence or other data to forecast or predict trends. This type of data mining can help
business leaders make better decisions and can add value to the efforts of the analytics team.
Unit -4
Web Mining
Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is
discovering useful information from the World-Wide Web and its usage patterns.
Applications of Web Mining:
1. Web mining helps to improve the power of web search engine by classifying the web
documents and identifying the web pages.
2. It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g., FatLens,
Become etc.
3. Web mining is used to predict user behavior.
4. Web mining is very useful of a particular Website and e-service e.g., landing page optimization.
Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.
1. Web Content Mining:
Web content mining is the application of extracting useful information from the content of the
web documents. Web content consist of several types of data – text, image, audio, video etc.
Content data is the group of facts that a web page is designed. It can provide effective and
interesting patterns about user needs. Text documents are related to text mining, machine
learning and natural language processing. This mining is also known as text mining. This type
of mining performs scanning and mining of the text, images and groups of web pages according
to the content of the input.
2. Web Structure Mining:
Web structure mining is the application of discovering structure information from the web. The
structure of the web graph consists of web pages as nodes, and hyperlinks as edges connecting
related pages. Structure mining basically shows the structured summary of a particular website.
It identifies relationship between web pages linked by information or direct link connection. To
determine the connection between two commercial websites, Web structure mining can be very
useful.
3. Web Usage Mining:

Web usage mining is the application of identifying or discovering interesting usage patterns
from large data sets. And these patterns enable you to understand the user behaviors or
something like that. In web usage mining, user access data on the web and collect data in form
of logs. So, Web usage mining is also called log mining.
Comparison Between Data mining and Web mining:
POINTS DATA MINING WEB MINING
Data Mining is the
process that Web Mining is the
attempts to process of data
discover pattern mining techniques to
and hidden automatically
knowledge in large discover and extract
data sets in any information from
Definition system. web documents.
Web Mining is very
Data Mining is very useful for a
useful for web page particular website
Application analysis. and e-service.

Target Data scientist and Data scientists along
Users data engineers. with data analysts.
Data Mining is
access data Web Mining is access
Access privately. data publicly.
In Web Mining get
the information from
In Data Mining get structured,
the information unstructured and
from explicit semi-structured web
Structure structure. pages.
Clustering,
classification,
regression,
prediction, Web content mining,
Problem optimization and Web structure
Type control. mining.

Special tools for web
It includes tools like mining are Scrapy,
machine learning PageRank and
Tools algorithms. Apache logs.
It includes
application level
It includes knowledge, data
approaches for data engineering with
cleansing, machine mathematical
learning algorithms. modules like
Statistics and statistics and
Skills probability. probability.
Spatial data is associated with geographic locations such as cities,towns etc. A spatial
database is optimized to store and query data representing objects. These are the
objects which are defined in a geometric space.
Characteristics of Spatial Database

A spatial database system has the following characteristics
• It is a database system
• It offers spatial data types (SDTs) in its data model and query language.
• It supports spatial data types in its implementation, providing at least spatial
indexing and efficient algorithms for spatial join.
Example
A road map is a visualization of geographic information. A road map is a 2-dimensional
object which contains points, lines, and polygons that can represent cities, roads, and
political boundaries such as states or provinces.
In general, spatial data can be of two types −
• Vector data: This data is represented as discrete points, lines and polygons
• Rastor data: This data is represented as a matrix of square cells.
The spatial data in the form of points, lines, polygons etc. is used by many different
databases as shown above.
Spatial and Geographical data

Spatial data support in database is important for efficiently storing, indexing and querying of
data on the basis of spatial location. For example, suppose that we want to store a set of
polygons in a database and to query the database to find all polygons that intersect a given
polygon. We cannot use standard index structures, such as B-trees or hash indices, to answer
such a query efficiently. Efficient processing of the above query would require special-purpose
index structures, such as R-trees for the task.
Two types of Spatial data are particularly important:
Computer-aided-design (CAD)data, which include spatial information about how objects-
such as building, cars, or aircraft-are constructed. Other important example of computer-aided-
design databases are integrated-circuit and electronic-device layouts.
CAD systems traditionally stored data in memory during editing or other processing, and wrote
the data back to a file at the end of a session of editing. The drawbacks of such a schema
include cost(programming complexity, as well as time cost) of transforming data from one
form to anther, and the need to read in an entire file even if only parts of it are required. For
large design of an entire airplane, it may be impossible to hold the complete design in memory.
Designers of object oriented database were motivated in large part by the database
requirements of CAD systems. Object-oriented database represent components of design as
objects, and the connections between the objects indicate how the design is structure.
Geographic data such as road maps, land-usage maps, topographic elevation maps, political
maps showing boundaries, land-ownership maps, and so on. Geographical information
system are special purpose databases for storing geographical data. Geographical data are differ
from design data in certain ways. Maps and satellite images are typical examples of geographic
data. Maps may provide not only location information associated with locations such as
elevations. Soil type, land type and annual rainfall.
Spatial Data Types and Models

Spatial data is the data collected through with physical real life locations like towns, cities,
islands etc. Spatial data are basically of three different types and are wisely used in commercial
sectors :
1. Map data :
Map data includes different types of spatial features of objects in map, e.g – an object’s shape
and location of object within map. The three basic types of features are points, lines, and
polygons (or areas).
• Points –
Points are used to represent spatial characteristics of objects whose locations correspond
to single 2-D coordinates (x, y, or longitude/latitude) in the scale of particular application.
For examples : Buildings, cellular towers, or stationary vehicles. Moving vehicles and
other moving objects can be represented by sequence of point locations that change over
time.
• Lines –
Lines represent objects having length, such as roads or rivers, whose spatial
characteristics can be approximated by sequence of connected lines.
• Polygons –
Polygons are used to represent characteristics of objects that have boundary, like states,
lakes, or countries.
2. Attribute data :
It is the descriptive data that GIeographic Information Systems associate with features in the
map. For example, in map representing countries within an Indian state (ex – Odisha or
Mumbai).
Attributes- Population, largest city/town, area in square miles, and so on.
3. Image data :
It includes camera created data like satellite images and aerial photographs. Objects of interest,
such as buildings and roads, can be identified and overlaid on these images. Aerial and satellite
images are typical examples of raster data.
Models of Spatial Information :
It is divided into two categories :
• Field :
These models are used to model spatial data that is continuous in nature, e.g. terrain elevation,
air quality index, temperature data, and soil variation characteristics.
• Object :
These models have been used for applications such as transportation networks, land parcels,
buildings, and other objects that possess both spatial and non-spatial attributes.
A spatial application is modeled using either field or an object based model, which
depends on the requirements and the traditional choice of model for the application.
Example – High traffic analysing system, etc
ABSTRACT
Temporal Data Mining is a rapidly evolving area of research that is at the intersection of
several disciplines, including statistics, temporal pattern recognition, temporal databases,
optimisation, visualisation, high-performance computing, and parallel computing. This paper
is first intended to serve as an overview of the temporal data mining in research and
applications.
INTRODUCTION
Temporal Data Mining is a rapidly evolving area of research that is at the intersection of
several disciplines, including statistics (e.g., time series analysis), temporal pattern
recognition, temporal databases, optimisation, visualisation, high-performance computing,
and parallel computing. This paper is intended to serve as an overview of the temporal data
mining in research and applications. In addition to providing a general overview, we motivate
the importance of temporal data mining problems within Knowledge Discovery in Temporal
Databases (KDTD) which include formulations of the basic categories of temporal data
mining methods, models, techniques and some other related areas. The paper is structured as
follows. Section 2 discusses the definitions and tasks of temporal data mining. Section 3
discusses the issues on temporal data mining techniques. Section 4 discusses two major
problems of temporal data mining, those of similarity and periodicity. Section 5 provides an
overview of time series temporal data mining. Section 6 moves onto a discussion of several
important challenges in temporal data mining and outlines our general distribution theory for
answering some those challenges. The last section concludes the paper with a brief summary
Temporal Data Mining :

Temporal data refers to the extraction of implicit, non-trivial and potentially useful abstract
information from large collection of temporal data. It is concerned with the analysis of temporal
data and for finding temporal patterns and regularities in sets of temporal data tasks of temporal
data mining are –
• Data Characterization and Comparison
• Cluster Analysis
• Classification
• Association rules
• Prediction and Trend Analysis
• Pattern Analysis
Difference between Spatial and Temporal Data Mining:

SPATIAL DATA MINING TEMPORAL DATA MINING
It requires space. It requires time.
Temporal mining is the
Spatial mining is the extraction of
extraction of knowledge about
knowledge/spatial occurrence of an event
relationship and whether they follow
interesting measures that Cyclic , Random
are not explicitly stored ,Seasonal variations
in spatial database. etc.

It deals with implicit or
It deals with spatial explicit Temporal
(location , Geo- content , from large
referenced) data. quantities of data.
Spatial databases
reverses spatial objects Temporal data mining
derived by spatial data. comprises the subject
types and spatial as well as its utilization
association among such in modification of
objects. fields.
It aims at mining new
It includes finding and unknown
characteristic rules, knowledge, which
discriminate rules, takes into account the
association rules and temporal aspects of
evaluation rules etc. data.
It is the method of
identifying unusual and
unexplored data but It deals with useful
useful models from knowledge from
spatial databases. temporal data.

Examples –
An association rule
which looks like –
“Any Person who buys
a car also buys steering
lock”. By temporal
aspect this rule would
be – ” Any person who
Examples – buys a car also buys a
Determining hotspots , steering lock after that
Unusual locations. “.
Concept of Time in database

Last Updated: 10-04-2020
A database is used to model the state of some aspect of the real world outside in the form of
relation. In general, database system store only one state that is the current state of the real
world, and do not store data about previous and past states, except perhaps as audit trails. If the
current state of the real world changes, the database gets modified and updated, and
information about the past state gets lost.
However, in most of the real life applications, it is necessary to store and retrieve information
about old states. For example, a student database must contain information about the previous
performance history of that student for preparing the final result. An autonomous robotic
system must store information about present and previous data of sensors from the environment
for effective Action.
Example:
ID name dept name salary from to
10101 Srinivasan Comp. Sci. 61000 2007/1/1 2007/12/31
10101 Srinivasan Comp. Sci. 65000 2008/1/1 2008/12/31
12121 Wu Finance 82000 2005/1/1 2006/12/31
12121 Wu Finance 87000 2007/1/1 2007/12/31
12121 Wu Finance 90000 2008/1/1 2008/12/31
98345 Kim Elec. Eng. 80000 2005/1/1 2008/12/31
In the above example, to simplify the representation, every row has only one time interval
associated with it; thus, a row is represented once for every disjoint time interval in which it is
true. Intervals that are given here are a combination of attributes from and to; an actual
implementation would have a structured type, which is known as Interval, that contains both
fields.
There are few important terminologies used in the concept of time in database:
1. Temporal database :
Databases that store information about states of the real world across time are known as
temporal databases.
2. Valid time :
Valid time denotes the time period during which a fact is true with respect to the real world.
3. Transaction time :
Transaction time is the time period during which a fact is stored in the databases.
4. Temporal relation :
Temporal relation is one where each tuple has an associated time when it is true; the time may
be either valid time or transaction time.
5. Bi-temporal relation :
Both valid time and transaction time can be stored, in which case the relation is said to be a Bi-
temporal relation.
Descriptive Statistical Measures For Mining In Large

Databases
A Descriptive statistic is a statistical summary that quantitatively describes or summarizes

features of a collection of information on, while descriptive statistics is the process of using
and analyzing those statistics. Descriptive statistics are distinguished from inferential
statistics (or inductive statistics) by its aim to summarize a sample.
There are several descriptive statistical measures to mine in large databases in data mining i.e
used for knowledge discovery in large databases.
These measures are listed down below.
• Measuring Central Tendency.

• Measuring the Dispersion of Data.
• Boxplot Analysis.
• Visualization of Boxplot Dispersion.
• Histogram Analysis.
• Quantile Plot.
• Quantile-Quantile Plot.
• Scatter Plot.
• Loess Curve.
Measuring the Central Tendency
Mean:
• It is the Arithmetic average of the given data.

• For Weighted mean, we use this formula, x=(sum(wi*xi)/sum(wi).
Median:
• It is a holistic measure of data.

• Given in order, It is nothing but the middlemost value of the dispersed data.
• If there are odd no values then the middle value will be the median.
• If there are even no values then the median is the average of two middle values.
• It can also be estimated by using,
Mode:
• It is nothing but the value that occurs most frequently in the data.
• If there is only one mode in the data then it is a unimodal data.
• If there are two modes in the data then it is bimodal data.
• If there are three modes in the data then it is trimodal data.
• The empirical formula of mode is, median-mode=3*(mean-median).
Boxplot Analysis
• In this type of analysis, visualization of data is represented with a box.

• The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ.
• The median is marked by a line within the box
• Whiskers: two lines outside the box extend to Minimum and Maximum
Histogram Analysis
• It is a graph that displays basic statistical class descriptions.
Frequency histograms
• It is a univariate graphical method.
• It consists of a set of rectangles that reflect the counts or frequencies of the classes present in
the given data.
Quantile Plot
• It displays all of the data (allowing the user to assess both the overall behavior and unusual
occurrences).
• It plots quantile information
• For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the
data are below or equal to the value xi.
Quantile-Quantile Plot
It is graphs of the quantiles of one univariate which shows distribution against the
corresponding quantiles of another.
• It allows the user to view whether there is a shift in going from one distribution to another.
Scatter Plot
• It provides a first look at bivariate data to see clusters of points, outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in the plane.
UNIT V
A Data Warehousing (DW) is process for collecting and managing data from varied sources
to provide meaningful business insights. A Data warehouse is typically used to connect and
analyze business data from heterogeneous sources. The data warehouse is the core of the BI
system which is built for data analysis and reporting.
It is a blend of technologies and components which aids the strategic use of data. It is
electronic storage of a large amount of information by a business which is designed for query
and analysis instead of transaction processing. It is a process of transforming data into
information and making it available to users in a timely manner to make a difference.
Data warehouse can be controlled when the user has a shared way of explaining the trends that
are introduced as specific subject. Below are major characteristics of data warehouse:
1. Subject-oriented –
A data warehouse is always a subject oriented as it delivers information about a theme
instead of organization’s current operations. It can be achieved on specific theme. That
means the data warehousing process is proposed to handle with a specific theme which
is more defined. These themes can be sales, distributions, marketing etc.
A data warehouse never put emphasis only current operations. Instead, it focuses on
demonstrating and analysis of data to make various decision. It also delivers an easy
and precise demonstration around particular theme by eliminating data which is not
required to make the decisions.
2. Integrated –
It is somewhere same as subject orientation which is made in a reliable format.
Integration means founding a shared entity to scale the all similar data from the
different databases. The data also required to be resided into various data warehouse in
shared and generally granted manner.
A data warehouse is built by integrating data from various sources of data such that a
mainframe and a relational database. In addition, it must have reliable naming
conventions, format and codes. Integration of data warehouse benefits in effective
analysis of data. Reliability in naming conventions, column scaling, encoding structure
etc. should be confirmed. Integration of data warehouse handles various subject related
warehouse.
3. Time-Variant –
In this data is maintained via different intervals of time such as weekly, monthly, or
annually etc. It founds various time limit which are structured between the large
datasets and are held in online transaction process (OLTP). The time limits for data
warehouse is wide-ranged than that of operational systems. The data resided in data
warehouse is predictable with a specific interval of time and delivers information from
the historical perspective. It comprises elements of time explicitly or implicitly. Another
feature of time-variance is that once data is stored in the data warehouse then it cannot
be modified, alter, or updated.
4. Non-Volatile –
As the name defines the data resided in data warehouse is permanent. It also means that
data is not erased or deleted when new data is inserted. It includes the mammoth
quantity of data that is inserted into modification between the selected quantity on
logical business. It evaluates the analysis within the technologies of warehouse.
In this, data is read-only and refreshed at particular intervals. This is beneficial in

analysing historical data and in comprehension the functionality. It does not need
transaction process, recapture and concurrency control mechanism. Functionalities such
as delete, update, and insert that are done in an operational application are lost in data
warehouse environment. Two types of data operations done in the data warehouse are:
• Data Loading
• Data Access
How Datawarehouse works?
A Data Warehouse works as a central repository where information arrives from one or more
data sources. Data flows into a data warehouse from the transactional system and other
relational databases.
Data may be:
1. Structured
2. Semi-structured
3. Unstructured data
The data is processed, transformed, and ingested so that users can access the processed data
in the Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A
data warehouse merges information coming from different sources into one comprehensive
database.
By merging all of this information in one place, an organization can analyze its customers
more holistically. This helps to ensure that it has considered all the information available.
Data warehousing makes data mining possible. Data mining is looking for patterns in the data
that may lead to higher sales and profits.
Types of Data Warehouse
Three main types of Data Warehouses (DWH) are:
1. Enterprise Data Warehouse (EDW):
Enterprise Data Warehouse (EDW) is a centralized warehouse. It provides decision support

service across the enterprise. It offers a unified approach for organizing and representing
data. It also provide the ability to classify data according to the subject and give access
according to those divisions.
2. Operational Data Store:

Operational Data Store, which is also called ODS, are nothing but data store required when
neither Data warehouse nor OLTP systems support organizations reporting needs. In ODS,
Data warehouse is refreshed in real time. Hence, it is widely preferred for routine activities
like storing records of the Employees.
3. Data Mart:
A data mart is a subset of the data warehouse. It specially designed for a particular line of
business, such as sales, finance, sales or finance. In an independent data mart, data can collect
directly from sources.
General stages of Data Warehouse
Earlier, organizations started relatively simple use of data warehousing. However, over time,
more sophisticated use of data warehousing begun.
The following are general stages of use of the data warehouse (DWH):
Offline Operational Database:
In this stage, data is just copied from an operational system to another server. In this way,
loading, processing, and reporting of the copied data do not impact the operational system's
performance.
Offline Data Warehouse:
Data in the Datawarehouse is regularly updated from the Operational Database. The data in
Datawarehouse is mapped and transformed to meet the Datawarehouse objectives.
Real time Data Warehouse:
In this stage, Data warehouses are updated whenever any transaction takes place in
operational database. For example, Airline or railway booking system.
Integrated Data Warehouse:
In this stage, Data Warehouses are updated continuously when the operational system
performs a transaction. The Datawarehouse then generates transactions which are passed
back to the operational system.
Components of Data warehouse
Four components of Data Warehouses are:
Load manager: Load manager is also called the front component. It performs with all the
operations associated with the extraction and load of data into the warehouse. These
operations include transformations to prepare the data for entering into the Data warehouse.
Warehouse Manager: Warehouse manager performs operations associated with the
management of the data in the warehouse. It performs operations like analysis of data to
ensure consistency, creation of indexes and views, generation of denormalization and
aggregations, transformation and merging of source data and archiving and baking-up data.
Query Manager: Query manager is also known as backend component. It performs all the
operation operations related to the management of user queries. The operations of this Data
warehouse components are direct queries to the appropriate tables for scheduling the
execution of queries.
Why We Need Data Warehouse? Advantages & Disadvantages
Advantages of Data Warehouse (DWH):
• Data warehouse allows business users to quickly access critical data from some
sources all in one place.
• Data warehouse provides consistent information on various cross-functional activities.
It is also supporting ad-hoc reporting and query.
• Data Warehouse helps to integrate many sources of data to reduce stress on the
production system.
• Data warehouse helps to reduce total turnaround time for analysis and reporting.
• Restructuring and Integration make it easier for the user to use for reporting and
analysis.
• Data warehouse allows users to access critical data from the number of sources in a
single place. Therefore, it saves user's time of retrieving data from multiple sources.
• Data warehouse stores a large amount of historical data. This helps users to analyze
different time periods and trends to make future predictions.
Disadvantages of Data Warehouse:
• Not an ideal option for unstructured data.

• Creation and Implementation of Data Warehouse is surely time confusing affair.
• Data Warehouse can be outdated relatively quickly
• Difficult to make changes in data types and ranges, data source schema, indexes, and
queries.
• The data warehouse may seem easy, but actually, it is too complex for the average
users.
• Despite best efforts at project management, data warehousing project scope will
always increase.
• Sometime warehouse users will develop different business rules.
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data

communication processing and presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are characterized by standard vital
components.
Production applications such as payroll accounts payable product purchasing and inventory
control are designed for online transaction processing (OLTP). Such applications gather
detailed data from day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an
activity recently dubbed online analytical processing (OLAP). These include applications
such as forecasting, profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or via OLTP applications.
In contrast, a warehouse database is updated from operational systems periodically, usually
during off-hours. As OLTP data accumulates in production databases, it is regularly
extracted, filtered, and then loaded into a dedicated warehouse server that is accessible to
users. As the warehouse is populated, it must be restructured tables de-normalized, data
cleansed of errors and redundancies and new fields and keys added to reflect the needs to the
user for sorting, combining, and summarizing data.
Data warehouses and their architectures very depending upon the elements of an
organization's situation.
Three common architectures are:
o Data Warehouse Architecture: Basic

o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic

Operational System
An operational system is a method used in data warehousing to refer to a system that is used
to process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in
the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work
with particular instances of data more accessible. For example, author, data build, and data
changed, and file size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers
for strategic decision-making. These customers interact with the warehouse using end-client
access tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools

o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Data Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it into the warehouse.
We can do this programmatically, although data warehouses uses a staging area (A place
where data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming
from multiple source systems, especially for enterprise data warehouses where all relevant
data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from source systems is
copied.
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups within our
organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses that can
provided information for reporting and analysis on a section, unit, department or operation in
the company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In this
example, a financial analyst wants to analyze historical data for purchases and sales or mine
historical information to make predictions about customer behavior.
What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book. In other words, we can say that metadata is the summarized data that leads us to
detailed data. In terms of data warehouse, we can define metadata as follows.
• Metadata is the road-map to a data warehouse.
• Metadata in a data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision support system to locate
the contents of a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions of a given
data warehouse. Along with this metadata, additional metadata is also created for time-
stamping any extracted data, the source of extracted data.
Categories of Metadata
Metadata can be broadly categorized into three categories −

• Business Metadata − It has the data ownership information, business definition, and
changing policies.
• Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
• Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse
is different from the warehouse data, yet it plays an important role. The various roles of
metadata are explained below.
• Metadata acts as a directory.
• This directory helps the decision support system to locate the contents of the data
warehouse.
• Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
• Metadata helps in summarization between current detailed data and highly summarized
data.
• Metadata also helps in summarization between lightly detailed data and highly
summarized data.
• Metadata is used for query tools.
• Metadata is used in extraction and cleansing tools.
• Metadata is used in reporting tools.
• Metadata is used in transformation tools.
• Metadata plays an important role in loading functions.
The following diagram shows the roles of metadata.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following
metadata −
• Definition of data warehouse − It includes the description of structure of data
warehouse. The description is defined by schema, view, hierarchies, derived data
definitions, and data mart locations and contents.
• Business metadata − It contains has the data ownership information, business
definition, and changing policies.
• Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
• Data for mapping from operational environment to data warehouse − It includes
the source databases and their contents, data extraction, data partition cleaning,
transformation rules, data refresh and purging rules.
• Algorithms for summarization − It includes dimension algorithms, data on
granularity, aggregation, summarizing, etc.
Challenges for Metadata Management
The importance of metadata can not be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation, and ensures the accuracy of calculations. Metadata also
enforces the definition of business terms to business end-users. With all these uses of
metadata, it also has its challenges. Some of the challenges are discussed below.
• Metadata in a big organization is scattered across the organization. This metadata is
spread in spreadsheets, databases, and applications.
• Metadata could be present in text files or multimedia files. To use this data for
information management solutions, it has to be correctly defined.
• There are no industry-wide accepted standards. Data management solution vendors
have narrow focus.
• There are no easy and accepted methods of passing metadata.
Data Catalog and Why Do You Need One?
Simply put, a data catalog is an organized inventory of data assets in the organization. It uses
metadata to help organizations manage their data. It also helps data professionals collect,
organize, access, and enrich metadata to support data discovery and governance.
Data Catalog Definition and Analogy
We gave a short definition of a data catalog above, as something that uses metadata to help
organizations manage their data. But let’s expand upon that with the analogy of a library.
When you go to a library and you need to find a book, you use their catalog to discover
whether the book is there, which edition it is, where it’s located, a description—everything
you need so that you can decide whether you want it, and if you do, how to go and find it.
That’s what many object stores, databases, and data warehouses offer today.
But now, think back to the analogy of that library and the catalog. And now expand the
power of that catalog to cover every library in the country. Imagine that you have just one
interface and suddenly, you can find every single library in the country that has the copy of
the book you’re seeking, and you can find all the details you’d ever want on each one of
those books.
That’s what an enterprise data catalog does for all of your data. It gives you a single,
overarching view and deeper visibility into all of your data, not just each data store at a time.
Perhaps you might wonder—why would you need a view like that?
Challenges a Data Catalog Can Address
With more data than ever before, being able to find the right data has become harder than it
ever has been. At the same time, there are also more rules and regulations than ever before—
with GDPR being just one of them. So not only is data access becoming a challenge, but data
governance has become a challenge as well. It’s critical to understand the kind of data that
you have now, who is moving it, what it’s being used for, and how it needs to be protected.
But you also have to avoid putting too many layers and wrappers around your data—because
data is useless if it’s too difficult to be used. Unfortunately, there are many challenges with
finding and accessing the right data. These include:
• Wasted time and effort on finding and accessing data

• Data lakes turning into data swamps
• No common business vocabulary
• Hard to understand structure and variety of “dark data”
• Difficult to assess provenance, quality, trustworthiness
• No way to capture tribal or missing knowledge
• Difficult to reuse knowledge and data assets
• Manual and ad-hoc data prep efforts
Data Catalog Users
All of these data management issues frustrate users such as data engineers, data scientists,
data stewards, and chief data officers. All of these groups of people want easy access to
trusted data. Here are just a few of the challenges that they face:
Data engineers want to know how any changes will affect the system as a whole. They might
ask:
• What will be the impact of a schema change in our CRM application?

• How different are the Peoplesoft and HCM data structures?
Data scientists want easy access to data and they want to know more about the quality of the
data. They are looking for information such as:
• Where can I find and explore some geo-location data?

• How can I easily access the data in the data lake?
Data stewards are charged with a managed data process. They care about concepts,
agreements between stakeholders, and managing the lifecycle of the data itself. They will ask
questions such as:
• Are we really improving the quality of our operational data?

• Have we defined standards for important key data elements?
Chief Data Officers care about who is doing what in the organization. They’re typically not
the ones using a data catalog, but they still want to know answers to questions such as:
• Who can access customers’ personal information?

• Do we have retention policies defined for all data?
Enter the data catalog.
Data Catalog Use Cases
In the past few years, the concept of a data catalog has become popular because of the
increasingly large amounts of data that now have to be managed and accessed. Cloud, big
data analytics, AI and machine learning have started to change the way we need to see,
manage, and leverage our data—and not just manage of it, but be able to fully use and access
it.
Using a data catalog the right way means better data usage, all of which contributes to:
• Cost savings
• Operational efficiency
• Competitive advantages
• Better customer experience
• Fraud and risk advantage
• And so much more
Here are just a few of the use cases for a data catalog. But really, a data catalog can be used
in so many ways because fundamentally, it’s about having wider visibility and deeper access
to your data.
Self-service analytics. Many data users have trouble finding the right data. And not just
finding the right data but understanding whether it’s useful. You might discover a file called
customer_info.csv. And you might need a file about customers. But that doesn’t mean it’s the
right one because it can be one of 50 such similar files. The file may have many fields and
you may not understand what all of those data elements are. You’ll want an easier way to see
the business context around it, such as whether it’s a managed resource, from the right data
store, or what the relationship is with other data artifacts.
Discovery could also entail understanding the shape and characteristics of data, from
something as simple as value distribution, statistical information, or something as important
and complex as Personally Identifiable Information (PII) or Personal Health Information
(PHI).
Audit, compliance, and change management. With ever-increasing government regulations

around data, you often need to demonstrate the provenance of data—whether certain data
artifacts are coming from this source or that source, or how it’s getting transformed before
reaching whatever the final target is. When looking at a table, report, or file, your data users
often want to understand where the data is coming from and how it’s moving through the
organization in various ways. From a change management perspective, it’s important to view
how changes in one part of a data pipeline affect other parts of the system. This is why
customers seek detailed data lineage.
Supporting data governance with business glossaries. Most organizations have a
vocabulary that everyone agrees on and a consistent understanding that they can use for
business concepts. But often, it’s recorded in Excel sheets lying around somewhere—and
that’s if the organization is lucky. A data catalog is a much better place where you can store
and manage this vital business information.
A data catalog also allows you to establish links between business terms to establish a
taxonomy. Beyond that, it can record relationships between terms and physical assets such as
tables and columns. It also enables users to understand which business concepts are relevant
to which technical artifacts. This can be used to classify data assets along business concept
lines and then actually use business concepts instead of technical names for search and
discovery. This helps by increasing user trust in what they’re looking at, because they can see
everything that’s related to their data and it’s often a good starting point for data governance.
What Is Needed to Fully Make Use of Data in a Data Catalog?
So let’s take a step back and quickly explain metadata to those who might not be entirely
familiar with it. What is metadata? There are three kinds of metadata:
• Technical metadata: Schemas, tables, columns, file names, report names – anything that is
documented in the source system
• Business metadata: This is typically the business knowledge that users have about the
assets in the organization. This might include business descriptions, comments,
annotations, classifications, fitness-for-use, ratings, and more.
• Operational metadata: When was this object refreshed? Which ETL job created it? How
many times has a table been accessed by users—and which one?
In the past few years, we’ve seen a mini-revolution on how we can use this valuable
metadata. Once, metadata was mostly used only for audit, lineage, and reporting only. But
today, technological innovations like serverless processing, graph databases, and especially
new or more accessible AI and machine learning techniques are pushing the boundaries and
making things possible with metadata that simply weren’t possible at this scale before.
Today, metadata can be used to augment data management. Everything from self-service data
preparation to role-and-data content-base access control, . Automated data onboarding,
Monitoring and alerting anomalies. Auto-provisioning and auto-scaling resources etc.. All of
this can now be augmented with the help of metadata.
And the data catalog uses metadata to help you achieve more than ever with your data
management.
Data Warehousing - Security
The objective of a data warehouse is to make large amounts of data easily accessible to the
users, hence allowing the users to extract information about the business as a whole. But we
know that there could be some security restrictions applied on the data that can be an obstacle
for accessing the information. If the analyst has a restricted view of data, then it is impossible
to capture a complete picture of the trends within the business.
The data from each analyst can be summarized and passed on to management where the
different summaries can be aggregated. As the aggregations of summaries cannot be the same
as that of the aggregation as a whole, it is possible to miss some information trends in the data
unless someone is analyzing the data as a whole.
Security Requirements
Adding security features affect the performance of the data warehouse, therefore it is
important to determine the security requirements as early as possible. It is difficult to add
security features after the data warehouse has gone live.
During the design phase of the data warehouse, we should keep in mind what data sources
may be added later and what would be the impact of adding those data sources. We should
consider the following possibilities during the design phase.
• Whether the new data sources will require new security and/or audit restrictions to be
implemented?
• Whether the new users added who have restricted access to data that is already
generally available?
This situation arises when the future users and the data sources are not well known. In such a
situation, we need to use the knowledge of business and the objective of data warehouse to
know likely requirements.
The following activities get affected by security measures −
• User access
• Data load
• Data movement
• Query generation
User Access
We need to first classify the data and then classify the users on the basis of the data they can
access. In other words, the users are classified according to the data they can access.
Data Classification
The following two approaches can be used to classify the data −
• Data can be classified according to its sensitivity. Highly-sensitive data is classified as
highly restricted and less-sensitive data is classified as less restrictive.
• Data can also be classified according to the job function. This restriction allows only
specific users to view particular data. Here we restrict the users to view only that part
of the data in which they are interested and are responsible for.
There are some issues in the second approach. To understand, let's have an example. Suppose
you are building the data warehouse for a bank. Consider that the data being stored in the data
warehouse is the transaction data for all the accounts. The question here is, who is allowed to
see the transaction data. The solution lies in classifying the data according to the function.
User classification
The following approaches can be used to classify the users −
• Users can be classified as per the hierarchy of users in an organization, i.e., users can
be classified by departments, sections, groups, and so on.
• Users can also be classified according to their role, with people grouped across
departments based on their role.
Classification on basis of Department
Let's have an example of a data warehouse where the users are from sales and marketing
department. We can have security by top-to-down company view, with access centered on the
different departments. But there could be some restrictions on users at different levels. This
structure is shown in the following diagram.
But if each department accesses different data, then we should design the security access for
each department separately. This can be achieved by departmental data marts. Since these data
marts are separated from the data warehouse, we can enforce separate security restrictions on
each data mart. This approach is shown in the following figure.
Classification Based on Role

If the data is generally available to all the departments, then it is useful to follow the role
access hierarchy. In other words, if the data is generally accessed by all the departments, then
apply security restrictions as per the role of the user. The role access hierarchy is shown in the
following figure.
Audit Requirements
Auditing is a subset of security, a costly activity. Auditing can cause heavy overheads on the
system. To complete an audit in time, we require more hardware and therefore, it is
recommended that wherever possible, auditing should be switched off. Audit requirements
can be categorized as follows −
• Connections
• Disconnections
• Data access
• Data change
Note − For each of the above-mentioned categories, it is necessary to audit success, failure,
or both. From the perspective of security reasons, the auditing of failures are very important.
Auditing of failure is important because they can highlight unauthorized or fraudulent access.
Network Requirements
Network security is as important as other securities. We cannot ignore the network security
requirement. We need to consider the following issues −
• Is it necessary to encrypt data before transferring it to the data warehouse?
• Are there restrictions on which network routes the data can take?
These restrictions need to be considered carefully. Following are the points to remember −
• The process of encryption and decryption will increase overheads. It would require
more processing power and processing time.
• The cost of encryption can be high if the system is already a loaded system because the
encryption is borne by the source system.
Data Movement
There exist potential security implications while moving the data. Suppose we need to transfer
some restricted data as a flat file to be loaded. When the data is loaded into the data warehouse,
the following questions are raised −
• Where is the flat file stored?
• Who has access to that disk space?
If we talk about the backup of these flat files, the following questions are raised −
• Do you backup encrypted or decrypted versions?

• Do these backups need to be made to special tapes that are stored separately?
• Who has access to these tapes?
Some other forms of data movement like query result sets also need to be considered. The
questions raised while creating the temporary table are as follows −
• Where is that temporary table to be held?

• How do you make such table visible?
We should avoid the accidental flouting of security restrictions. If a user with access to the
restricted data can generate accessible temporary tables, data can be visible to non-authorized
users. We can overcome this problem by having a separate temporary area for users with
access to restricted data.
Documentation
The audit and security requirements need to be properly documented. This will be treated as
a part of justification. This document can contain all the information gathered from −
• Data classification
• User classification
• Network requirements
• Data movement and storage requirements
• All auditable actions
Impact of Security on Design
Security affects the application code and the development timescales. Security affects the
following area −
• Application development
• Database design
• Testing
Application Development
Security affects the overall application development and it also affects the design of the
important components of the data warehouse such as load manager, warehouse manager, and
query manager. The load manager may require checking code to filter record and place them
in different locations. More transformation rules may also be required to hide certain data.
Also there may be requirements of extra metadata to handle any extra objects.
To create and maintain extra views, the warehouse manager may require extra codes to enforce
security. Extra checks may have to be coded into the data warehouse to prevent it from being
fooled into moving data into a location where it should not be available. The query manager
requires the changes to handle any access restrictions. The query manager will need to be
aware of all extra views and aggregations.
Database design
The database layout is also affected because when security measures are implemented, there
is an increase in the number of views and tables. Adding security increases the size of the
database and hence increases the complexity of the database design and management. It will
also add complexity to the backup management and recovery plan.
Testing
Testing the data warehouse is a complex and lengthy process. Adding security to the data
warehouse also affects the testing time complexity. It affects the testing in the following two
ways −
• It will increase the time required for integration and system testing.
• There is added functionality to be tested which will increase the size of the testing suite.
Staging area
• A data staging area (DSA) is a temporary storage area between the data sources and a data
warehouse.
• The staging area is mainly used to quickly extract data from its data sources, minimizing the
impact of the sources.
• After data has been loaded into the staging area, the staging area is used to combine data from
multiple data sources, transformations, validations, data cleansing. Data is often transformed into a
star schema prior to loading a data warehouse.
• The Data Warehouse Staging Area is temporary location where data from source
systems is copied.
• A staging area is mainly required in a Data Warehousing Architecture for timing
reasons. In short, all required data must be available before data can be integrated
into the Data Warehouse.
• Due to varying business cycles, data processing cycles, hardware and network
resource limitations and geographical factors, it is not feasible to extract all the data
from all Operational databases at exactly the same time.
•

Data Mining Notes1

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Data Mining Notes1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Notes1

Uploaded by

Copyright:

Available Formats

Data Mining –Introduction

What is Data Mining?

Functionalities of Data Mining

Data Mining Applications

• Market Analysis and Management

Market Analysis and Management

Corporate Analysis and Risk Management

Data Mining - Issues

• Mining Methodology and User Interaction

Mining Methodology and User Interaction Issues

Diverse Data Types Issues

Data Preprocessing in Data Mining

Steps Involved in Data Preprocessing:

Data Mining Query Language

Syntax for Task-Relevant Data Specification

use data warehouse data_warehouse_name

Syntax for Specifying the Kind of Knowledge

Syntax for Concept Hierarchy Specification

if (price - cost)< $50

if ((price - cost) > $50) and ((price - cost) ≤ $250))

Syntax for Interestingness Measures Specification

Syntax for Pattern Presentation and Visualization Specification

Full Specification of DMQL

What is Concept Description?

• Descriptive vs. predictive data mining

Descriptive mining: describes concepts or task-relevant data sets in concise, summarily,

Comparison: provides descriptions comparing two or more collections of data

Concept Description vs. OLAP

-A more automated process

– Restricted to a small number of dimension and measure types

Data Generalization and Summarization-based Characterization

Approaches: CONCEPTUAL LEVELS

• Data cube approach(OLAP approach)

• Attribute-oriented induction approach

Characterization: Data Cube Approach:

• Perform computations and store results in data cubes

– An efficient implementation of data generalization

– Computation of various kinds of measures • e.g., count( ), sum( ), average( ), max( )

• Proposed in 1989 (KDD ‘89 workshop)

• Not confined to categorical data nor particular measures.

– Perform generalization by attribute removal or attribute generalization.

– Interactive presentation with users.

Analytical Characterization : Analysis of Attribute Relevance

Methods of Attribute Relevance Analysis:

Preliminary relevance analysis using conservative AOI:

Mining Class Comparisons: Discriminating Between Different Classes

with their own choices, when preferred.

“How is class comparison performed?” In general, the procedure is as follows:

3. Synchronous generalization: Generalization is performed on the target class to the level

for this, the DMQL query would be.

Now from this, we can formulate that

1. Data collection -Understanding Target and Contrasting classes.

3. Synchronous generalization - It is controlled by user-specified dimension thresholds, a prime

c= σ(Milk, Diaper, Beer) / (Milk, Diaper)

l= Supp({Milk, Diaper, Beer}) / Supp({Milk, Diaper})*Supp({Beer})

Mining single-dimensional Boolean association rules from

Rule form: “Bodyhead *support, confidence+”. Buys

(x,”diapers”) buys(x, “beers”) *0.5%, 60%+

Association rule: basic concepts: