Data Mining Notes1
Data Mining Notes1
Data Mining Notes1
There is a huge amount of data available in the Information Industry. This data is of no use until it
is converted into useful information. It is necessary to analyze this huge amount of data and extract
useful information from it.
Extraction of information is not the only process we need to perform; data mining also involves
other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining,
Pattern Evaluation and Data Presentation. Once all these processes are over, we would be able
to use this information in many applications such as Fraud Detection, Market Analysis, Production
Control, Science Exploration, etc.
• Data Characterization:
This refers to the summary of general characteristics or features of the class that is
under the study. For example. To study the characteristics of a software product
whose sales increased by 15% two years ago, anyone can collect these type of data
related to such products by running SQL queries.
• Data Discrimination:
It compares common features of class which is under study. The output of this
process can be represented in many forms. Eg., bar charts, curves and pie charts.
2. Mining Frequent Patterns, Associations, and Correlations:
Frequent patterns are nothing but things that are found to be most common in the data.
There are different kinds of frequency that can be observed in the dataset.
• Frequent item set:
This applies to a number of items that can be seen together regularly for eg: milk and
sugar.
• Frequent Subsequence:
This refers to the pattern series that often occurs regularly such as purchasing a
phone followed by a back cover.
• Frequent Substructure:
It refers to the different kinds of data structures such as trees and graphs that may
be combined with the itemset or subsequence.
Association Analysis:
The process involves uncovering the relationship between data and deciding the rules of
the association. It is a way of discovering the relationship between various items. for
example, it can be used to determine the sales of items that are frequently purchased
together.
Correlation Analysis:
Correlation is a mathematical technique that can show whether and how strongly the pairs
of attributes are related to each other. For example, Heighted people tend to have more
weight.
Classification of data mining
The information or knowledge extracted so can be used for any of the following applications −
• Market Analysis
• Fraud Detection
• Customer Retention
• Production Control
• Science Exploration
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect
frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call, time
of the day or week, etc. It also analyzes the patterns that deviate from expected norms
Performance Issues
There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses
data reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute.the attribute
having p-value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are:Wavelet transforms and PCA (Principal Componenet Analysis).
Unit II
Data Mining Task Primitives:
Data Mining Task Primitives Each user will have a data mining task in mind, that is, some form of data
analysis that he or she would like to have performed. A data mining task can be specified in the form of a
data mining query, which is input to the data mining system. A data mining query is defined in terms of
data mining task primitives. These primitives allow the user to interactively communicate with the data
mining system during discovery in order to direct the mining process, or examine the findings from different
angles or depths. The data mining primitives specify the following, as illustrated in Figure 1.13.
The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in
which the user is interested. This includes the database attributes or data warehouse dimensions of interest
(referred to as the relevant attributes or dimensions).
The kind of knowledge to be mined: This specifies the data mining functions to be performed, such as
characterization, discrimination, association or correlation analysis, classification, prediction, clustering,
outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about the domain to be
mined is useful for guiding the knowledge discovery process and for evaluating the patterns found. Concept
hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels
of abstraction. An example of a concept hierarchy for the attribute (or dimension) age is shown in Figure
1.14. User beliefs regarding relationships in the data are another form of background knowledge.
The interestingness measures and thresholds for pattern evaluation: They may be used to guide the mining
process or, after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have
different interestingness measures. For example, interestingness measures for association rules include
support and confidence. Rules whose support and confidence values are below user-specified thresholds
are considered uninteresting.
The expected representation for visualizing the discovered patterns: This refers to the form in which
discovered patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and
cubes
The Data Mining Query Language (DMQL) was proposed by Han, Fu, Wang, et al. for the DBMiner
data mining system. The Data Mining Query Language is actually based on the Structured Query
Language (SQL). Data Mining Query Languages can be designed to support ad hoc and interactive
data mining. This DMQL provides commands for specifying primitives. The DMQL can work with
databases and data warehouses as well. DMQL can be used to define data mining tasks.
Particularly we examine how to define data warehouses and data marts in DMQL.
or
Characterization
The syntax for characterization is −
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum, or count%.
For example −
Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%
Discrimination
The syntax for Discrimination is −
mine comparison [as {pattern_name]}
For {target_class } where {target condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
For example, a user may define big spenders as customers who purchase items that cost $100 or
more on an average; and budget spenders as customers who purchase items at less than $100 on
an average. The mining of discriminant descriptions for customers from each of these categories
can be specified in the DMQL as −
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count
Association
The syntax for Association is−
mine associations [ as {pattern_name} ]
{matching {metapattern} }
For Example −
mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
where X is key of customer relation; P and Q are predicate variables; and W, Y, and Z are object
variables.
Classification
The syntax for Classification is −
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
For example, to mine patterns, classifying customer credit rating where the classes are determined
by the attribute credit_rating, and mine classification is determined as
classifyCustomerCreditRating.
analyze credit_rating
Prediction
The syntax for prediction is −
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
Predictive mining: Based on data and analysis, constructs models for the database, and predicts the
trend and properties of unknown data
• Concept description: Characterization: provides a concise and succinct summarization of the
given collection of data
Concept description:
-Can handle complex data types of the attributes and their aggregations .
OLAP:
– User-controlled process.
Data generalization – A process which abstracts a large set of task-relevant data in a database from
a low conceptual level to higher ones.
• Strength:
– Generalization and specialization can be performed on a data cube by roll-up and drill-down •
Limitations:
– handle only dimensions of simple nonnumeric data and measures of simple aggregated numeric
values.
– Lack of intelligent analysis, can’t tell which dimensions should be used and what levels should
the generalization reach
Attribute-Oriented Induction
• How it is done?
– Collect the task-relevant data( initial relation) using a relational database query
– Apply aggregation by merging identical, generalized tuples and accumulating their respective
counts.
“What if am not sure which attribute to include or class characterization and class comparison ? I may end
up specifying too many attributes, which could slow down the: system considerably .” Measures of attribute
relevance analysis can be used to help identify irrelevant or weakly relevant attributes that can be excluded
from the concept description process. The incorporation of this pre-processing step into class
characterization or comparison is referred to as analytical characterization or analytical comparison,
respectively . This section describes a general method of attribute relevance analysis and its integration with
attribute-oriented induction.
The first limitation of class characterization for multidimensional data analysis in Data warehouses and
OLAP tools is the handling of complex objects . The second Limitation is the lack of an automated
generalization process: the user must explicitly Tell the system which dimension should be included in the
class characterization and to How high a level each dimension should be generalized . Actually , the user
must specify each step of generalization or specification on any dimension.
Usually , it is not difficult for a user to instruct a data mining system regarding how high level each
dimension should be generalized . For example , users can set attributegeneralization thresholds for this , or
specify which level a given dimension should reach ,such as with the command “generalize dimension
location to the country level”. Even without explicit user instruction , a default value such as 2 to 8 can be
set by the data mining system , which would allow each dimension to be generalized to a level that contains
only 2 to 8 distinct values. If the user is not satisfied with the current level of generalization, she can specify
dimensions on which drill-down or roll-up operations should be applied.
It is nontrivial, howesver, for users to determine which dimensions should be included in the analysis of
class characteristics. Data relations often contain 50 to 100 attributes , and a user may have little knowledge
regarding which attributes or dimensions should be selected for effective data mining. A user may include
too few attributes in the analysis, causing the resulting mined descriptions to be incomplete. On the other
hand, a user may introduce too many attributes for analysis (e.g. , by indicating “in relevance to *”, which
includes all the attributes in the specified relations).
Methods should be introduced to perform attribute (or dimension )relevance Analysis in order to filter out
statistically irrelevant or weakly relevant attributes, and retain or even rank the most relevant attributes for
the descriptive mining task at hand. Class characterization that includes the analysis of attribute/dimesnsion
relevance is called analytical characterization. Class comparison that includes such analysis is called
analytical comparison.
Intuitively, an attribute or dimension is considered highly relevant with respect to a Given class if it is likely
that the values of the attribute or dimension may be used to Distinguish the class from others. For example,
it is unlikely that the color of an Automobile can be used to distinguish expensive from cheap cars, but the
model , make, style, and number of cylinders are likely to be more relevant attributes. Moreover, even
within the same dimension, different levels of concepts may have dramatically different powers for
distinguishing a class from others.
For example, in the birth_date dimension, birth_day and birth_month are unlikely to be relevant to the salary
of employees. However, the birth_decade (i.e. , age interval) may be highly relevant to the salary of
employees. This implies that the analysis of dimension relevance should be performed at multi-levels of
abstraction, and only the most relevant levels of a dimension should be included in the analysis. Above we
said that attribute/ dimension relevance is evaluated based on the ability of the attribute/ dimension to
distinguish objects of a class from others. When mining a class comparison (or discrimination), the target
class and the contrasting classes are Explicitly given in the mining query. The relevance analysis should be
performed by Comparison of these classes, as we shall see below. However, when mining class
Characteristics, there is only one class to be characterized. That is, no contrasting class is specified. It is
therefore not obvious what the contrasting class should be for use in of comparable data in the database that
excludes the set of data to be characterized. For example, to characterize graduate students, the contrasting
class can be composed of the set of undergraduate students.
There have been many studies in machine learning, statistics, fuzzy and rough set Theories, and so on , on
attribute relevance analysis. The general idea behind attribute Relevance analysis is to compute some
measure that is used to quantify the relevance of an attribute with respect to a given class or concept. Such
measures include information gain, the Gini index, uncertainity, and correlation coefficients. Here we
introduce a method that integrates an information gain analysis technique With a dimension-based data
analysis method. The resulting method removes the less informative attributes, collecting the more
informative ones for use in concept description analysis.
Data Collection:
Collect data for both the target class and the contrasting class by query processing. For class comparison, the
user in the data-mining query provides both the target class and the contrasting class. For class
characterization, the target class is the class to be characterized, whereas the contrasting class is the set of
comparable data that are not in the target class.
This step identifies a Set of dimensions and attributes on which the selected relevance measure is to be
Applied. Since different levels of a dimension may have dramatically different Relevance with respect to a
given class, each attribute defining the conceptual levels of the dimension should be included in the
relevance analysis in principle. Attribute-oriented induction (AOI)can be used to perform some preliminary
relevance analysis on the data by removing or generalizing attributes having a very large number of distinct
values (such as name and phone#). Such attributes are unlikely to be found useful for concept description.
To be conservative , the AOI performed here should employ attribute generalization thresholds that are set
reasonably large so as to allow more (but not all)attributes to be considered in further relevance analysis by
the selected measure (Step 3 below). The relation obtained by such an application of AOI is called the
candidate relation of the mining task.
Remove irrelevant and weakly attributes using the selected relevance analysis measure:
Evaluate each attribute in the candidate relation using the selected relevance analysis measure. The
relevance measure used in this step may be built into the data mining system or provided by the user. For
example, the information gain measure described above may be used. The attributes are then sorted(i.e.,
ranked )according to their computed relevance to the data mining task. Attributes that are not relevant or are
weakly relevant to the task are then removed. A threshold may be set to define “weakly relevant.” This step
results in an initial Target class working relation and an initial contrasting class working relation.
Generate the concept description using AOI:
Perform AOI using a less Conservative set of attribute generalization thresholds. If the descriptive mining
Task is class characterization, only the initial target class working relation is included here. If the descriptive
mining task is class comparison, both the initial target class working relation and the initial contrasting class
working relation are included. The complexity of this procedure is the induction process is perfomed twice,
that Is, in preliminary relevance analysis (Step 2)and on the initial working relation (Step4). The statistics
used in attribute relevance analysis with the selected measure (Step 3) may be collected during the scanning
of the database in Step 2
Introduction: In many applications, users may not be interested in having a single class (or
concept) described or characterized, but rather would prefer to mine a description that compares or
distinguishes one class (or concept) from other comparable classes (or concepts).Class
discrimination or comparison (hereafter referred to as class comparison) mines descriptions that
distinguish a target class from its contrasting classes. Notice that the target and contrasting classes
must be comparable in the sense that they share similar dimensions and attributes. For example, the
three classes, person, address, and item, are not comparable.
However, the sales in the last three years are comparable classes, and so are computer science
students versus physics students. Our discussions on class characterization in the previous sections
handle multilevel data summarization and characterization in a single class. The techniques
developed can be extended to handle class comparison across several comparable classes. For
example, the attribute generalization process described for class characterization can be modified
so that the generalization is performed synchronously among all the classes compared. This allows
the attributes in all of the classes to be generalized to the same levels of abstraction. Suppose, for
instance, that we are given the All Electronics data for sales in 2003 and sales in 2004 and would
like to compare these two classes. Consider the dimension location with abstractions at the city,
province or state, and country levels. Each class of data should be generalized to the same location
level. That is, they are synchronously all generalized to either the city level, or the province or state
level, or the country level. Ideally, this is more useful than comparing, say, the sales in Vancouver
in 2003 with the sales in the United States in 2004 (i.e., where each set of sales data is generalized
to a different level). The users, however, should have the option to overwrite such an automated,
synchronous comparison
1. Data collection: The set of relevant data in the database is collected by query processing and is
partitioned respectively into a target class and one or a set of contrasting class(es).
2. Dimension relevance analysis: If there are many dimensions, then dimension relevance analysis
should be performed on these classes to select only the highly relevant dimensions for further
analysis. Correlation or entropy-based measures can be used for this step (Chapter 2).
4. Presentation of the derived comparison: The resulting class comparison description can be
visualized in the form of tables, graphs, and rules. This presentation usually includes a
“contrasting” measure such as count% (percentage count) that reflects the comparison between the
target and contrasting classes. The user can adjust the comparison description by applying drill-
down, roll-up, and other OLAP operations to the target and contrasting classes, as desired.
The above discussion outlines a general algorithm for mining comparisons in databases. In
comparison with characterization, the above algorithm involves synchronous generalization of the
target class with the contrasting classes, so that classes are simultaneously
Example
Task - Compare graduate and undergraduate students using the discriminant rule.
use University_Database
mine comparison as “graduate_students vs_undergraduate_students”
in relevance to name, gender, program, birth_place, birth_date, residence, phone_no, GPA
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student
2. Attribute relevance analysis - It is used to remove attributes name, gender, program, phone_no.
4. Drill down, roll up and other OLAP operations on target and contrasting classes to adjust levels
of abstractions of resulting description.
Unit 3
Association Rule Mining
ssociation rule mining is a procedure which aims to observe frequently occurring patterns,
correlations, or associations from datasets found in various kinds of databases such as relational
databases, transactional databases, and other forms of repositories.
An association rule has 2 parts:
• an antecedent (if) and
• a consequent (then)
An antecedent is something that’s found in data, and a consequent is an item that is found in
combination with the antecedent. Have a look at this rule for instance:
“If a customer buys bread, he’s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the consequent. Simply put, it can
be understood as a retail store’s association rule to target their customers better. If the above rule is
a result of a thorough analysis of some data sets, it can be used to not only improve customer
service but also improve the company’s revenue.
Association rules are created by thoroughly analyzing data and looking for frequent if/then
patterns. Then, depending on the following two parameters, the important relationships are
observed:
1. Support: Support indicates how frequently the if/then relationship appears in the database.
2. Confidence: Confidence tells about the number of times these relationships have been found to be
true.
So, in a given transaction with multiple items, Association Rule Mining primarily tries to find the
rules that govern how or why such products/items are often bought together. For example, peanut
butter and jelly are frequently purchased together because a lot of people like to make PB&J
sandwiches.
A Beginner’s Guide to Data Science and Its Applications
Association Rule Mining is sometimes referred to as “Market Basket Analysis”, as it was the first
application area of association mining. The aim is to discover associations of items occurring
together more often than you’d expect from randomly sampling all the possibilities. The classic
anecdote of Beer and Diaper will help in understanding this better.
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is
Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show associations
between items.It allows retailers to identify relationships between the items that people buy
together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item based on
the occurrences of other items in the transaction.
Before we start defining the rule, let us first see the basic definition
Support COUNT(σ)– Frequency of occurrence of a itemset.
Here σ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal to minsup threshold.
Association Rule – An implication expression of the form X -> Y, where X and Y are any 2
itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –
• Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction.It is a measure of how frequently the collection
of items occur together as a percentage of all transactions.
• Support = σ (X+Y) ÷ total –
It is interpreted as fraction of transactions that contain both X and Y.
• Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as the no of
transactions that includes all items in {A} to the no of transactions that includes all items in
{A}.
• Conf(X=>Y) = Supp(XUY) ÷ Supp(X) –
It measures how often each item in Y appears in transactions that contains items in X also.
• Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence,
assuming that the itemsets X and Y are independent of each other.The expected confidence is
the confidence divided by the frequency of {Y}.
• Lift(X=>Y) = Conf(X=>Y) ÷ Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as expected, greater than 1
means they appear together more than expected and less than 1 means they appear less than
expected.Greater lift values indicate stronger association.
Example – From the above table, {Milk, Diaper}=>{Beer}
s= σ({Milk, Diaper, Beer}) / |T|
= 2/5
= 0.4
Examples:
• Given: (1) database of transaction, (2) each transaction is a list of items (purchased by
a customer in visit)
• Find: all rules that correlate the presence of one set of items with that of another set
of items.
⎯ E.g., 98% of people who purchase tires and auto accessories also get automotive
services done.
⎯ E.g., Market Basket Analysis
This process analyzes customer buying habits by finding associations between the
different items that customers place in their “Shopping Baskets”. The discovery of such
associations can help retailers develop marketing strategies by gaining insight into which
items are frequently purchased together by customer.
• Applications
⎯ *maintenance agreement (what the store should do to boost maintenance
agreement sales)
⎯ Home electronics * (what other products should the store stocks up?)
⎯ Attached mailing in direct marketing
⎯ Detecting “ping-pong” ing of patients, faulty “collisions”
Rules that satisfy both a minimum supports threshold (min_sup) and a minimum
confidence threshold (min_conf) are called strong
LET minimum, support 50%, and minimum confidence 50%, we have
A C (50%,66.6%)
C A (50%, 100%)
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min_sup.
2. Generate strong association rules from the frequent itemsets:
By definition, these rules must satisfy minimum support and minimum confidence.
• Classification
• Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky, or a prediction model to predict the expenditures
in dollars of potential customers on computer equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
• A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric
prediction.
With the help of the bank loan application that we have discussed above, let us understand the
working of classification. The Data Classification process includes two steps −
The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring
value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
• Data Transformation and reduction − The data can be transformed by any of the
following methods.
o Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within
a small specified range. Normalization is used when in the learning step, the
neural networks or the methods involving measurements are used.
o Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
Note − Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.
Here is the criteria for comparing the methods of Classification and Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class
label correctly and the accuracy of the predictor refers to how well a given predictor
can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the classifier or
predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor understand
Data Mining - Decision Tree Induction
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf
node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer
at a company is likely to buy a computer or not. Each internal node represents a test on an
attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows −
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or
outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
• Posterior Probability [P(H/X)]
• Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)
Bayesian Belief Networks specify joint conditional probability distributions. They are also
known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
• A Belief Network allows class conditional independencies to be defined between
subsets of variables.
• It provides a graphical model of causal relationship on which learning can be
performed.
• We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −
The following diagram shows a directed acyclic graph for six Boolean variables.
The arc in the diagram allows representation of causal knowledge. For example, lung cancer
is influenced by a person's family history of lung cancer, as well as whether or not the person
is a smoker. It is worth noting that the variable PositiveXray is independent of whether the
patient has a family history of lung cancer or that the patient is a smoker, given that we know
the patient has lung cancer.
The conditional probability table for the values of the variable LungCancer (LC) showing
each possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker
(S) is as follows −
Unit -4
Web Mining
Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is
discovering useful information from the World-Wide Web and its usage patterns.
Applications of Web Mining:
1. Web mining helps to improve the power of web search engine by classifying the web
documents and identifying the web pages.
2. It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g., FatLens,
Become etc.
3. Web mining is used to predict user behavior.
4. Web mining is very useful of a particular Website and e-service e.g., landing page optimization.
Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.
1. Web Content Mining:
Web content mining is the application of extracting useful information from the content of the
web documents. Web content consist of several types of data – text, image, audio, video etc.
Content data is the group of facts that a web page is designed. It can provide effective and
interesting patterns about user needs. Text documents are related to text mining, machine
learning and natural language processing. This mining is also known as text mining. This type
of mining performs scanning and mining of the text, images and groups of web pages according
to the content of the input.
2. Web Structure Mining:
Web structure mining is the application of discovering structure information from the web. The
structure of the web graph consists of web pages as nodes, and hyperlinks as edges connecting
related pages. Structure mining basically shows the structured summary of a particular website.
It identifies relationship between web pages linked by information or direct link connection. To
determine the connection between two commercial websites, Web structure mining can be very
useful.
Data Mining is
Clustering,
classification,
regression,
It includes
application level
Spatial data is associated with geographic locations such as cities,towns etc. A spatial
database is optimized to store and query data representing objects. These are the
objects which are defined in a geometric space.
• It is a database system
• It offers spatial data types (SDTs) in its data model and query language.
• It supports spatial data types in its implementation, providing at least spatial
indexing and efficient algorithms for spatial join.
Example
A road map is a visualization of geographic information. A road map is a 2-dimensional
object which contains points, lines, and polygons that can represent cities, roads, and
political boundaries such as states or provinces.
In general, spatial data can be of two types −
• Vector data: This data is represented as discrete points, lines and polygons
• Rastor data: This data is represented as a matrix of square cells.
The spatial data in the form of points, lines, polygons etc. is used by many different
databases as shown above.
ABSTRACT
Temporal Data Mining is a rapidly evolving area of research that is at the intersection of
several disciplines, including statistics, temporal pattern recognition, temporal databases,
optimisation, visualisation, high-performance computing, and parallel computing. This paper
is first intended to serve as an overview of the temporal data mining in research and
applications.
INTRODUCTION
Temporal Data Mining is a rapidly evolving area of research that is at the intersection of
several disciplines, including statistics (e.g., time series analysis), temporal pattern
recognition, temporal databases, optimisation, visualisation, high-performance computing,
and parallel computing. This paper is intended to serve as an overview of the temporal data
mining in research and applications. In addition to providing a general overview, we motivate
the importance of temporal data mining problems within Knowledge Discovery in Temporal
Databases (KDTD) which include formulations of the basic categories of temporal data
mining methods, models, techniques and some other related areas. The paper is structured as
follows. Section 2 discusses the definitions and tasks of temporal data mining. Section 3
discusses the issues on temporal data mining techniques. Section 4 discusses two major
problems of temporal data mining, those of similarity and periodicity. Section 5 provides an
overview of time series temporal data mining. Section 6 moves onto a discussion of several
important challenges in temporal data mining and outlines our general distribution theory for
answering some those challenges. The last section concludes the paper with a brief summary
Spatial databases
objects. fields.
It is the method of
An association rule
lock”. By temporal
Unusual locations. “.
A database is used to model the state of some aspect of the real world outside in the form of
relation. In general, database system store only one state that is the current state of the real
world, and do not store data about previous and past states, except perhaps as audit trails. If the
current state of the real world changes, the database gets modified and updated, and
information about the past state gets lost.
However, in most of the real life applications, it is necessary to store and retrieve information
about old states. For example, a student database must contain information about the previous
performance history of that student for preparing the final result. An autonomous robotic
system must store information about present and previous data of sensors from the environment
for effective Action.
Example:
ID name dept name salary from to
10101 Srinivasan Comp. Sci. 61000 2007/1/1 2007/12/31
10101 Srinivasan Comp. Sci. 65000 2008/1/1 2008/12/31
12121 Wu Finance 82000 2005/1/1 2006/12/31
12121 Wu Finance 87000 2007/1/1 2007/12/31
12121 Wu Finance 90000 2008/1/1 2008/12/31
98345 Kim Elec. Eng. 80000 2005/1/1 2008/12/31
In the above example, to simplify the representation, every row has only one time interval
associated with it; thus, a row is represented once for every disjoint time interval in which it is
true. Intervals that are given here are a combination of attributes from and to; an actual
implementation would have a structured type, which is known as Interval, that contains both
fields.
There are few important terminologies used in the concept of time in database:
1. Temporal database :
Databases that store information about states of the real world across time are known as
temporal databases.
2. Valid time :
Valid time denotes the time period during which a fact is true with respect to the real world.
3. Transaction time :
Transaction time is the time period during which a fact is stored in the databases.
4. Temporal relation :
Temporal relation is one where each tuple has an associated time when it is true; the time may
be either valid time or transaction time.
5. Bi-temporal relation :
Both valid time and transaction time can be stored, in which case the relation is said to be a Bi-
temporal relation.
There are several descriptive statistical measures to mine in large databases in data mining i.e
used for knowledge discovery in large databases.
Mean:
Median:
Mode:
• It is nothing but the value that occurs most frequently in the data.
• If there is only one mode in the data then it is a unimodal data.
• If there are two modes in the data then it is bimodal data.
• If there are three modes in the data then it is trimodal data.
• The empirical formula of mode is, median-mode=3*(mean-median).
Boxplot Analysis
It is graphs of the quantiles of one univariate which shows distribution against the
corresponding quantiles of another.
• It allows the user to view whether there is a shift in going from one distribution to another.
Scatter Plot
• It provides a first look at bivariate data to see clusters of points, outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in the plane.
UNIT V
A Data Warehousing (DW) is process for collecting and managing data from varied sources
to provide meaningful business insights. A Data warehouse is typically used to connect and
analyze business data from heterogeneous sources. The data warehouse is the core of the BI
system which is built for data analysis and reporting.
It is a blend of technologies and components which aids the strategic use of data. It is
electronic storage of a large amount of information by a business which is designed for query
and analysis instead of transaction processing. It is a process of transforming data into
information and making it available to users in a timely manner to make a difference.
Data warehouse can be controlled when the user has a shared way of explaining the trends that
are introduced as specific subject. Below are major characteristics of data warehouse:
1. Subject-oriented –
A data warehouse is always a subject oriented as it delivers information about a theme
instead of organization’s current operations. It can be achieved on specific theme. That
means the data warehousing process is proposed to handle with a specific theme which
is more defined. These themes can be sales, distributions, marketing etc.
A data warehouse never put emphasis only current operations. Instead, it focuses on
demonstrating and analysis of data to make various decision. It also delivers an easy
and precise demonstration around particular theme by eliminating data which is not
required to make the decisions.
2. Integrated –
It is somewhere same as subject orientation which is made in a reliable format.
Integration means founding a shared entity to scale the all similar data from the
different databases. The data also required to be resided into various data warehouse in
shared and generally granted manner.
A data warehouse is built by integrating data from various sources of data such that a
mainframe and a relational database. In addition, it must have reliable naming
conventions, format and codes. Integration of data warehouse benefits in effective
analysis of data. Reliability in naming conventions, column scaling, encoding structure
etc. should be confirmed. Integration of data warehouse handles various subject related
warehouse.
3. Time-Variant –
In this data is maintained via different intervals of time such as weekly, monthly, or
annually etc. It founds various time limit which are structured between the large
datasets and are held in online transaction process (OLTP). The time limits for data
warehouse is wide-ranged than that of operational systems. The data resided in data
warehouse is predictable with a specific interval of time and delivers information from
the historical perspective. It comprises elements of time explicitly or implicitly. Another
feature of time-variance is that once data is stored in the data warehouse then it cannot
be modified, alter, or updated.
4. Non-Volatile –
As the name defines the data resided in data warehouse is permanent. It also means that
data is not erased or deleted when new data is inserted. It includes the mammoth
quantity of data that is inserted into modification between the selected quantity on
logical business. It evaluates the analysis within the technologies of warehouse.
A Data Warehouse works as a central repository where information arrives from one or more
data sources. Data flows into a data warehouse from the transactional system and other
relational databases.
1. Structured
2. Semi-structured
3. Unstructured data
The data is processed, transformed, and ingested so that users can access the processed data
in the Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A
data warehouse merges information coming from different sources into one comprehensive
database.
By merging all of this information in one place, an organization can analyze its customers
more holistically. This helps to ensure that it has considered all the information available.
Data warehousing makes data mining possible. Data mining is looking for patterns in the data
that may lead to higher sales and profits.
3. Data Mart:
A data mart is a subset of the data warehouse. It specially designed for a particular line of
business, such as sales, finance, sales or finance. In an independent data mart, data can collect
directly from sources.
Earlier, organizations started relatively simple use of data warehousing. However, over time,
more sophisticated use of data warehousing begun.
The following are general stages of use of the data warehouse (DWH):
In this stage, data is just copied from an operational system to another server. In this way,
loading, processing, and reporting of the copied data do not impact the operational system's
performance.
Data in the Datawarehouse is regularly updated from the Operational Database. The data in
Datawarehouse is mapped and transformed to meet the Datawarehouse objectives.
In this stage, Data warehouses are updated whenever any transaction takes place in
operational database. For example, Airline or railway booking system.
In this stage, Data Warehouses are updated continuously when the operational system
performs a transaction. The Datawarehouse then generates transactions which are passed
back to the operational system.
Load manager: Load manager is also called the front component. It performs with all the
operations associated with the extraction and load of data into the warehouse. These
operations include transformations to prepare the data for entering into the Data warehouse.
Warehouse Manager: Warehouse manager performs operations associated with the
management of the data in the warehouse. It performs operations like analysis of data to
ensure consistency, creation of indexes and views, generation of denormalization and
aggregations, transformation and merging of source data and archiving and baking-up data.
Query Manager: Query manager is also known as backend component. It performs all the
operation operations related to the management of user queries. The operations of this Data
warehouse components are direct queries to the appropriate tables for scheduling the
execution of queries.
• Data warehouse allows business users to quickly access critical data from some
sources all in one place.
• Data warehouse provides consistent information on various cross-functional activities.
It is also supporting ad-hoc reporting and query.
• Data Warehouse helps to integrate many sources of data to reduce stress on the
production system.
• Data warehouse helps to reduce total turnaround time for analysis and reporting.
• Restructuring and Integration make it easier for the user to use for reporting and
analysis.
• Data warehouse allows users to access critical data from the number of sources in a
single place. Therefore, it saves user's time of retrieving data from multiple sources.
• Data warehouse stores a large amount of historical data. This helps users to analyze
different time periods and trends to make future predictions.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an
activity recently dubbed online analytical processing (OLAP). These include applications
such as forecasting, profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or via OLTP applications.
In contrast, a warehouse database is updated from operational systems periodically, usually
during off-hours. As OLTP data accumulates in production databases, it is regularly
extracted, filtered, and then loaded into a dedicated warehouse server that is accessible to
users. As the warehouse is populated, it must be restructured tables de-normalized, data
cleansed of errors and redundancies and new fields and keys added to reflect the needs to the
user for sorting, combining, and summarizing data.
Data warehouses and their architectures very depending upon the elements of an
organization's situation.
An operational system is a method used in data warehousing to refer to a system that is used
to process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in
the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding and work
with particular instances of data more accessible. For example, author, data build, and data
changed, and file size are examples of very basic document metadata.
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the warehouse.
The principal purpose of a data warehouse is to provide information to the business managers
for strategic decision-making. These customers interact with the warehouse using end-client
access tools.
We must clean and process your operational information before put it into the warehouse.
We can do this programmatically, although data warehouses uses a staging area (A place
where data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming
from multiple source systems, especially for enterprise data warehouses where all relevant
data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from source systems is
copied.
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups within our
organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses that can
provided information for reporting and analysis on a section, unit, department or operation in
the company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In this
example, a financial analyst wants to analyze historical data for purchases and sales or mine
historical information to make predictions about customer behavior.
What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book. In other words, we can say that metadata is the summarized data that leads us to
detailed data. In terms of data warehouse, we can define metadata as follows.
• Metadata is the road-map to a data warehouse.
• Metadata in a data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision support system to locate
the contents of a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions of a given
data warehouse. Along with this metadata, additional metadata is also created for time-
stamping any extracted data, the source of extracted data.
Categories of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse
is different from the warehouse data, yet it plays an important role. The various roles of
metadata are explained below.
• Metadata acts as a directory.
• This directory helps the decision support system to locate the contents of the data
warehouse.
• Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
• Metadata helps in summarization between current detailed data and highly summarized
data.
• Metadata also helps in summarization between lightly detailed data and highly
summarized data.
• Metadata is used for query tools.
• Metadata is used in extraction and cleansing tools.
• Metadata is used in reporting tools.
• Metadata is used in transformation tools.
• Metadata plays an important role in loading functions.
The following diagram shows the roles of metadata.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following
metadata −
• Definition of data warehouse − It includes the description of structure of data
warehouse. The description is defined by schema, view, hierarchies, derived data
definitions, and data mart locations and contents.
• Business metadata − It contains has the data ownership information, business
definition, and changing policies.
• Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
• Data for mapping from operational environment to data warehouse − It includes
the source databases and their contents, data extraction, data partition cleaning,
transformation rules, data refresh and purging rules.
• Algorithms for summarization − It includes dimension algorithms, data on
granularity, aggregation, summarizing, etc.
The importance of metadata can not be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation, and ensures the accuracy of calculations. Metadata also
enforces the definition of business terms to business end-users. With all these uses of
metadata, it also has its challenges. Some of the challenges are discussed below.
• Metadata in a big organization is scattered across the organization. This metadata is
spread in spreadsheets, databases, and applications.
• Metadata could be present in text files or multimedia files. To use this data for
information management solutions, it has to be correctly defined.
• There are no industry-wide accepted standards. Data management solution vendors
have narrow focus.
• There are no easy and accepted methods of passing metadata.
Data Catalog and Why Do You Need One?
Simply put, a data catalog is an organized inventory of data assets in the organization. It uses
metadata to help organizations manage their data. It also helps data professionals collect,
organize, access, and enrich metadata to support data discovery and governance.
Data Catalog Definition and Analogy
We gave a short definition of a data catalog above, as something that uses metadata to help
organizations manage their data. But let’s expand upon that with the analogy of a library.
When you go to a library and you need to find a book, you use their catalog to discover
whether the book is there, which edition it is, where it’s located, a description—everything
you need so that you can decide whether you want it, and if you do, how to go and find it.
That’s what many object stores, databases, and data warehouses offer today.
But now, think back to the analogy of that library and the catalog. And now expand the
power of that catalog to cover every library in the country. Imagine that you have just one
interface and suddenly, you can find every single library in the country that has the copy of
the book you’re seeking, and you can find all the details you’d ever want on each one of
those books.
That’s what an enterprise data catalog does for all of your data. It gives you a single,
overarching view and deeper visibility into all of your data, not just each data store at a time.
Perhaps you might wonder—why would you need a view like that?
Challenges a Data Catalog Can Address
With more data than ever before, being able to find the right data has become harder than it
ever has been. At the same time, there are also more rules and regulations than ever before—
with GDPR being just one of them. So not only is data access becoming a challenge, but data
governance has become a challenge as well. It’s critical to understand the kind of data that
you have now, who is moving it, what it’s being used for, and how it needs to be protected.
But you also have to avoid putting too many layers and wrappers around your data—because
data is useless if it’s too difficult to be used. Unfortunately, there are many challenges with
finding and accessing the right data. These include:
Data engineers want to know how any changes will affect the system as a whole. They might
ask:
Using a data catalog the right way means better data usage, all of which contributes to:
• Cost savings
• Operational efficiency
• Competitive advantages
• Better customer experience
• Fraud and risk advantage
• And so much more
Here are just a few of the use cases for a data catalog. But really, a data catalog can be used
in so many ways because fundamentally, it’s about having wider visibility and deeper access
to your data.
Self-service analytics. Many data users have trouble finding the right data. And not just
finding the right data but understanding whether it’s useful. You might discover a file called
customer_info.csv. And you might need a file about customers. But that doesn’t mean it’s the
right one because it can be one of 50 such similar files. The file may have many fields and
you may not understand what all of those data elements are. You’ll want an easier way to see
the business context around it, such as whether it’s a managed resource, from the right data
store, or what the relationship is with other data artifacts.
Discovery could also entail understanding the shape and characteristics of data, from
something as simple as value distribution, statistical information, or something as important
and complex as Personally Identifiable Information (PII) or Personal Health Information
(PHI).
• Technical metadata: Schemas, tables, columns, file names, report names – anything that is
documented in the source system
• Business metadata: This is typically the business knowledge that users have about the
assets in the organization. This might include business descriptions, comments,
annotations, classifications, fitness-for-use, ratings, and more.
• Operational metadata: When was this object refreshed? Which ETL job created it? How
many times has a table been accessed by users—and which one?
In the past few years, we’ve seen a mini-revolution on how we can use this valuable
metadata. Once, metadata was mostly used only for audit, lineage, and reporting only. But
today, technological innovations like serverless processing, graph databases, and especially
new or more accessible AI and machine learning techniques are pushing the boundaries and
making things possible with metadata that simply weren’t possible at this scale before.
Today, metadata can be used to augment data management. Everything from self-service data
preparation to role-and-data content-base access control, . Automated data onboarding,
Monitoring and alerting anomalies. Auto-provisioning and auto-scaling resources etc.. All of
this can now be augmented with the help of metadata.
And the data catalog uses metadata to help you achieve more than ever with your data
management.
Data Warehousing - Security
The objective of a data warehouse is to make large amounts of data easily accessible to the
users, hence allowing the users to extract information about the business as a whole. But we
know that there could be some security restrictions applied on the data that can be an obstacle
for accessing the information. If the analyst has a restricted view of data, then it is impossible
to capture a complete picture of the trends within the business.
The data from each analyst can be summarized and passed on to management where the
different summaries can be aggregated. As the aggregations of summaries cannot be the same
as that of the aggregation as a whole, it is possible to miss some information trends in the data
unless someone is analyzing the data as a whole.
Security Requirements
Adding security features affect the performance of the data warehouse, therefore it is
important to determine the security requirements as early as possible. It is difficult to add
security features after the data warehouse has gone live.
During the design phase of the data warehouse, we should keep in mind what data sources
may be added later and what would be the impact of adding those data sources. We should
consider the following possibilities during the design phase.
• Whether the new data sources will require new security and/or audit restrictions to be
implemented?
• Whether the new users added who have restricted access to data that is already
generally available?
This situation arises when the future users and the data sources are not well known. In such a
situation, we need to use the knowledge of business and the objective of data warehouse to
know likely requirements.
The following activities get affected by security measures −
• User access
• Data load
• Data movement
• Query generation
User Access
We need to first classify the data and then classify the users on the basis of the data they can
access. In other words, the users are classified according to the data they can access.
Data Classification
The following two approaches can be used to classify the data −
• Data can be classified according to its sensitivity. Highly-sensitive data is classified as
highly restricted and less-sensitive data is classified as less restrictive.
• Data can also be classified according to the job function. This restriction allows only
specific users to view particular data. Here we restrict the users to view only that part
of the data in which they are interested and are responsible for.
There are some issues in the second approach. To understand, let's have an example. Suppose
you are building the data warehouse for a bank. Consider that the data being stored in the data
warehouse is the transaction data for all the accounts. The question here is, who is allowed to
see the transaction data. The solution lies in classifying the data according to the function.
User classification
The following approaches can be used to classify the users −
• Users can be classified as per the hierarchy of users in an organization, i.e., users can
be classified by departments, sections, groups, and so on.
• Users can also be classified according to their role, with people grouped across
departments based on their role.
Classification on basis of Department
Let's have an example of a data warehouse where the users are from sales and marketing
department. We can have security by top-to-down company view, with access centered on the
different departments. But there could be some restrictions on users at different levels. This
structure is shown in the following diagram.
But if each department accesses different data, then we should design the security access for
each department separately. This can be achieved by departmental data marts. Since these data
marts are separated from the data warehouse, we can enforce separate security restrictions on
each data mart. This approach is shown in the following figure.
• Connections
• Disconnections
• Data access
• Data change
Note − For each of the above-mentioned categories, it is necessary to audit success, failure,
or both. From the perspective of security reasons, the auditing of failures are very important.
Auditing of failure is important because they can highlight unauthorized or fraudulent access.
Network Requirements
Network security is as important as other securities. We cannot ignore the network security
requirement. We need to consider the following issues −
• Is it necessary to encrypt data before transferring it to the data warehouse?
• Are there restrictions on which network routes the data can take?
These restrictions need to be considered carefully. Following are the points to remember −
• The process of encryption and decryption will increase overheads. It would require
more processing power and processing time.
• The cost of encryption can be high if the system is already a loaded system because the
encryption is borne by the source system.
Data Movement
There exist potential security implications while moving the data. Suppose we need to transfer
some restricted data as a flat file to be loaded. When the data is loaded into the data warehouse,
the following questions are raised −
• Where is the flat file stored?
• Who has access to that disk space?
If we talk about the backup of these flat files, the following questions are raised −
Documentation
The audit and security requirements need to be properly documented. This will be treated as
a part of justification. This document can contain all the information gathered from −
• Data classification
• User classification
• Network requirements
• Data movement and storage requirements
• All auditable actions
Security affects the application code and the development timescales. Security affects the
following area −
• Application development
• Database design
• Testing
Application Development
Security affects the overall application development and it also affects the design of the
important components of the data warehouse such as load manager, warehouse manager, and
query manager. The load manager may require checking code to filter record and place them
in different locations. More transformation rules may also be required to hide certain data.
Also there may be requirements of extra metadata to handle any extra objects.
To create and maintain extra views, the warehouse manager may require extra codes to enforce
security. Extra checks may have to be coded into the data warehouse to prevent it from being
fooled into moving data into a location where it should not be available. The query manager
requires the changes to handle any access restrictions. The query manager will need to be
aware of all extra views and aggregations.
Database design
The database layout is also affected because when security measures are implemented, there
is an increase in the number of views and tables. Adding security increases the size of the
database and hence increases the complexity of the database design and management. It will
also add complexity to the backup management and recovery plan.
Testing
Testing the data warehouse is a complex and lengthy process. Adding security to the data
warehouse also affects the testing time complexity. It affects the testing in the following two
ways −
• It will increase the time required for integration and system testing.
• There is added functionality to be tested which will increase the size of the testing suite.
Staging area
• A data staging area (DSA) is a temporary storage area between the data sources and a data
warehouse.
• The staging area is mainly used to quickly extract data from its data sources, minimizing the
impact of the sources.
• After data has been loaded into the staging area, the staging area is used to combine data from
multiple data sources, transformations, validations, data cleansing. Data is often transformed into a
star schema prior to loading a data warehouse.
• The Data Warehouse Staging Area is temporary location where data from source
systems is copied.
• A staging area is mainly required in a Data Warehousing Architecture for timing
reasons. In short, all required data must be available before data can be integrated
into the Data Warehouse.
• Due to varying business cycles, data processing cycles, hardware and network
resource limitations and geographical factors, it is not feasible to extract all the data
from all Operational databases at exactly the same time.
•