Data Mining Unit I notes

UNIT I
Data Mining: Data–Types of Data–, Data Mining Functionalities– Interestingness Patterns–

Classification of Data Mining systems– Data mining Task primitives –Integration of Data mining
system with a Data warehouse–Major issues in Data Mining–Data Preprocessing.
Data
• Collection of data objects and their attributes
• An attribute is a property or characteristic of an object
– Examples: eye color of a person, temperature, etc.
– Attribute is also known as variable, field, characteristic, or feature
• A collection of attributes describe an object
– Object is also known as record, point, case, sample, entity, or instance Attributes
Objects
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
20PC0IT23: DATA MINING

Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in
{tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
Examples: temperature in Kelvin, length, time, counts
Types of Data
1. Data stored in the database
A database is also called a database management system or DBMS. Every DBMS stores data that
are related to each other in a way or the other. It also has a set of software programs that are used
to manage data and provide easy access to it. These software programs serve a lot of purposes,
including defining structure for database, making sure that the stored information remains secured
and consistent, and managing different types of data access, such as shared, distributed, and
concurrent.
A relational database has tables that have different names, attributes, and can store rows or records
of large data sets. Every record stored in a table has a unique key. Entity-relationship model is
created to provide a representation of a relational database that features entities and the
relationships that exist between them.
2. Data warehouse
A data warehouse is a single data storage location that collects data from multiple sources and then
stores it in the form of a unified plan. When data is stored in a data warehouse, it undergoes
cleaning, integration, loading, and refreshing. Data stored in a data warehouse is organized in
several parts. If you want information on data that was stored 6 or 12 months back, you will get it
in the form of a summary.

3. Transactional data
Transactional database stores record that are captured as transactions. These transactions include
flight booking, customer purchase, click on a website, and others. Every transaction record has a
unique ID. It also lists all those items that made it a transaction.
4. Other types of data
We have a lot of other types of data as well that are known for their structure, semantic meanings,
and versatility. They are used in a lot of applications. Here are a few of those data types: data
streams, engineering design data, sequence data, graph data, spatial data, multimedia data, and
more.
Tasks and Functionalities of Data Mining
Data mining tasks are designed to be semi-automatic or fully automatic and on large data sets to
uncover patterns such as groups or clusters, unusual or over the top data called anomaly detection
and dependencies such as association and sequential pattern. Once patterns are uncovered, they
can be thought of as a summary of the input data, and further analysis may be carried out using
Machine Learning and Predictive analytics. For example, the data mining step might help identify
multiple groups in the data that a decision support system can use. Note that data collection,
preparation, reporting are not part of data mining.
There is a lot of confusion between data mining and data analysis. Data mining functions are used
to define the trends or correlations contained in data mining activities. While data analysis is used
to test statistical models that fit the dataset, for example, analysis of a marketing campaign, data
mining uses Machine Learning and mathematical and statistical models to discover patterns hidden
in the data. In comparison, data mining activities can be divided into two categories:
o Descriptive Data Mining: It includes certain knowledge to understand what is happening

within the data without a previous idea. The common data features are highlighted in the
data set. For example, count, average etc.
o Predictive Data Mining: It helps developers to provide unlabeled definitions of attributes.
With previously available or historical data, data mining can be used to make predictions
about critical business metrics based on data's linearity. For example, predicting the volume
of business next quarter based on performance in the previous quarters over several years
or judging from the findings of a patient's medical examinations that is he suffering from
any particular disease.
Functionalities of Data Mining
Data mining functionalities are used to represent the type of patterns that have to be discovered in
data mining tasks. Data mining tasks can be classified into two types: descriptive and predictive.

Descriptive mining tasks define the common features of the data in the database, and the predictive
mining tasks act in inference on the current information to develop predictions.
Data mining is extensively used in many areas or sectors. It is used to predict and characterize
data. But the ultimate objective in Data Mining Functionalities is to observe the various trends
in data mining. There are several data mining functionalities that the organized and scientific
methods offer, such as:
1. Class/Concept Descriptions
A class or concept implies there is a data set or set of features that define the class or a concept. A
class can be a category of items on a shop floor, and a concept could be the abstract idea on which
data may be categorized like products to be put on clearance sale and non-sale products. There are
two concepts here, one that helps with grouping and the other that helps in differentiating.
o Data Characterization: This refers to the summary of general characteristics or features

of the class, resulting in specific rules that define a target class. A data analysis technique
called Attribute-oriented Induction is employed on the data set for achieving
characterization.
o Data Discrimination: Discrimination is used to separate distinct data sets based on the
disparity in attribute values. It compares features of a class with features of one or more
contrasting classes. eg., bar charts, curves and pie charts.
2. Mining Frequent Patterns
One of the functions of data mining is finding data patterns. Frequent patterns are things that are
discovered to be most common in data. Various types of frequency can be found in the dataset.
o Frequent item set:This term refers to a group of items that are commonly found together,
such as milk and sugar.

o Frequent substructure: It refers to the various types of data structures that can be
combined with an item set or subsequences, such as trees and graphs.
o Frequent Subsequence: A regular pattern series, such as buying a phone followed by a
cover.
3. Association Analysis
It analyses the set of items that generally occur together in a transactional dataset. It is also known
as Market Basket Analysis for its wide use in retail sales. Two parameters are used for determining
the association rules:
o It provides which identifies the common item set in the database.

o Confidence is the conditional probability that an item occurs when another item occurs in
a transaction.
4. Classification
Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to predict a
class or essentially classify a collection of items. A training set containing items whose properties
are known is used to train the system to predict the category of items from an unknown collection
of items.
5. Prediction
It defines predict some unavailable data values or spending trends. An object can be anticipated
based on the attribute values of the object and attribute values of the classes. It can be a prediction
of missing numerical values or increase or decrease trends in time-related information. There are
primarily two types of predictions in data mining: numeric and class predictions.
o Numeric predictions are made by creating a linear regression model that is based on
historical data. Prediction of numeric values helps businesses ramp up for a future event
that might impact the business positively or negatively.
o Class predictions are used to fill in missing class information for products using a training
data set where the class for products is known.
6. Cluster Analysis
In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes

represent the classes. Similar data are grouped together, with the difference being that a class label
is not known. Clustering algorithms group data based on similar features and dissimilarities.
7. Outlier Analysis
Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns. An outlier analysis determines if there is something out of
turn in the data and whether it indicates a situation that a business needs to consider and take
measures to mitigate. An outlier analysis of the data that cannot be grouped into any classes by the
algorithms is pulled up.
8. Evolution and Deviation Analysis
Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify, cluster
or discriminate time-related data.
9. Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly two attributes
is related to one another. It refers to the various types of data structures, such as trees and graphs,
that can be combined with an item set or subsequence. It determines how well two numerically
measured continuous variables are linked. Researchers can use this type of analysis to see if there
are any possible correlations between variables in their study.
Interestingness Patterns
A data mining system has the potential to generate thousands or even millions of patterns, or rules.
This raises some serious questions for data mining:
A pattern is interesting if
(1) it is easily understood by humans,
(2) valid on new or test data with some degree of certainty,
(3) potentially useful, and
(4) novel.
A pattern is also interesting if it validates a hypothesis that the user sought to confirm. An
interesting pattern represents knowledge. Several objective measures of pattern interestingness
exist. These are based on the structure of discovered patterns and the statistics underlying them.

An objective measure for association rules of the form XU Y is rule support, representing the
percentage of data samples that the given rule satisfies. Another objective measure for association
rules is confidence, which assesses the degree of certainty of the detected association. It is defined
as the conditional probability that a pattern Y is true given that X is true. More formally, support
and confidence are defined as
support (X ) Y) = Prob{XUY}g
confidence (X ) Y) = Prob{Y |X}g
A classification of data mining systems Data mining is an interdisciplinary field, the confidence of
a set of disciplines including database systems, statistics, machine learning, visualization, and
information science. Moreover, depending on the data mining approach used, techniques from
other disciplines.
may be applied, such as neural networks, fuzzy and/or rough set theory, knowledge representation,
inductive logic programming, or high-performance computing. Depending on the kinds of data to
be mined or on the given data mining application, the data mining system may also integrate
techniques from spatial data analysis, information retrieval, pattern recognition, image analysis,
signal processing, computer graphics, Web technology, economics, or psychology. Because of the
diversity of disciplines contributing to data mining, data mining research is expected to generate a
large variety of data mining systems. Therefore, it is necessary to provide a clear classification of
data mining systems. Such a classification may help potential users distinguish data mining
systems and identify those that best match their needs. Data mining systems can be categorized
according to various criteria, as follows. Classification according to the kinds of databases mined.
A data mining system can be classified according to the kinds of databases mined. Database
systems themselves can be classified according to different criteria (such as data models, or the
types of data or applications involved), each of which may require its own data mining technique.
Data mining systems can therefore be classified accordingly. For instance, if classifying according
to data models, we may have a relational, transactional, object-oriented, object-relational, or data
warehouse mining system. If classifying according to the special types of data handled, we may
have a spatial, time-series, text, or multimedia data mining system, or a World-Wide Web mining
system. Other system types include heterogeneous data mining systems, and legacy data mining
systems. Classification according to the kinds of knowledge mined. Data mining systems can be
categorized according to the kinds of knowledge they mine, i.e., based on data mining
functionalities, such as characterization, discrimination, association, classification, clustering,
trend and evolution analysis, deviation analysis, similarity analysis, etc. A comprehensive data
mining system usually provides multiple and/or integrated data mining functionalities. Moreover,
data mining systems can also be distinguished based on the granularity or levels of abstraction of
the knowledge mined, including generalized knowledge (at a high level of abstraction), primitive-
level knowledge (at a raw data level), or knowledge at multiple levels (considering several levels
of abstraction). An advanced data mining system should facilitate the discovery of knowledge at
multiple levels of abstraction.
Classification according to the kinds of knowledge mined. Data mining systems can be categorized
according to the kinds of knowledge they mine, i.e., based on data mining functionalities, such as
characterization, discrimination, association, classification, clustering, trend and evolution
analysis, deviation analysis, similarity analysis, etc. A comprehensive data mining system usually
provides multiple and/or integrated data mining functionalities. Moreover, data mining systems
can also be distinguished based on the granularity or levels of abstraction of the knowledge mined,
including generalized knowledge (at a high level of abstraction), primitive-level knowledge (at a
raw data level), or knowledge at multiple levels (considering several levels of abstraction). An
advanced data mining system should facilitate the discovery of knowledge at multiple levels of
abstraction. Classification according to the kinds of techniques utilized Data mining systems can
also be categorized according to the underlying data mining techniques employed. These
techniques can be described according to the degree of user interaction involved (e.g., autonomous
systems, interactive exploratory systems, query-driven systems), or the methods of data analysis
employed (e.g., database-oriented or data warehouse-oriented techniques, machine learning,
statistics, visualization, pattern recognition, neural networks, and so on). A sophisticated data
mining system will often adopt multiple data mining techniques or work out an effective,
integrated technique which combines the merits of a few individual approaches.
Classification of Data Mining Systems
Data mining refers to the process of extracting important data from raw data. It analyses the data
patterns in huge sets of data with the help of several software. Ever since the development of data
mining, it is being incorporated by researchers in the research and development field.
With Data mining, businesses are found to gain more profit. It has not only helped in understanding
customer demand but also in developing effective strategies to enforce overall business turnover.
It has helped in determining business objectives for making clear decisions.
Data collection and data warehousing, and computer processing are some of the strongest pillars
of data mining. Data mining utilizes the concept of mathematical algorithms to segment the data
and assess the possibility of occurrence of future events.
To understand the system and meet the desired requirements, data mining can be classified into
the following systems:
• Classification based on the mined Databases

• Classification based on the type of mined knowledge
• Classification based on statistics
• Classification based on Machine Learning
• Classification based on visualization
• Classification based on Information Science
• Classification based on utilized techniques
• Classification based on adapted applications

Classification Based on the mined Databases
A data mining system can be classified based on the types of databases that have been mined. A
database system can be further segmented based on distinct principles, such as data models, types
of data, etc., which further assist in classifying a data mining system.
For example, if we want to classify a database based on the data model, we need to select either
relational, transactional, object-relational or data warehouse mining systems.
Classification Based on the type of Knowledge Mined

A data mining system categorized based on the kind of knowledge mind may have the following
functionalities:
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Outlier Analysis
• Evolution Analysis
Classification Based on the Techniques Utilized
A data mining system can also be classified based on the type of techniques that are being
incorporated. These techniques can be assessed based on the involvement of user interaction
involved or the methods of analysis employed.
Classification Based on the Applications Adapted

Data mining systems classified based on adapted applications adapted are as follows:
• Finance
• Telecommunications
• DNA
• Stock Markets
• E-mail
Examples of Classification Task
Following is some of the main examples of classification tasks:
• Classification helps in determining tumor cells as benign or malignant.

• Classification of credit card transactions as fraudulent or legitimate.
• Classification of secondary structures of protein as alpha-helix, beta-sheet, or random coil.
• Classification of news stories into distinct categories such as finance, weather,
entertainment, sports, etc.
Data mining Task primitives
A data mining query is defined in terms of the following primitives
1.Task-relevant data: This is the database portion to be investigated. For example, suppose that
you are a manager of All Electronics in charge of sales in the United States and Canada. In
particular, you would like to study the buying trends of customers in Canada. Rather than mining
on the entire database. These are referred to as relevant attributes
2.Type of knowledge to be mined: This specifies the data mining functions to be performed, such
as characterization, discrimination, association, classification, clustering, or evolution analysis.
For instance, if studying the buying habits of customers in Canada, you may choose to mine
associations between customer profiles and the items that these customers like to buy
3.Background knowledge: Users can specify background knowledge, or knowledge about the
domain to be mined. This knowledge is useful for guiding the knowledge discovery process, and
for evaluating the patterns found. There are several kinds of background knowledge.
4.Measures of patterns: These functions are used to separate uninteresting patterns from
knowledge. They may be used to guide the mining process, or after discovery, to evaluate the
discovered patterns. Different kinds of knowledge may have different interestingness measures.
5. Presentation and visualization of discovered patterns: This refers to the form in which
discovered patterns are to be displayed. Users can choose from different forms for knowledge
presentation, such as rules, tables, charts, graphs, decision trees, and cubes.
Integration schemes of Database and Data warehouse systems
No Coupling
In no coupling schema, the data mining system does not use any database or data warehouse system
functions.
Drawbacks:
First, a Database/Data Warehouse system provides a great deal of flexibility and efficiency at
storing, organizing, accessing, and processing data.
Without using a Database/Data Warehouse system, a Data Mining system may spend a substantial
amount of time finding, collecting, cleaning, and transforming data.
Second, there are many tested, scalable algorithms and data structures implemented in Database
and Data Warehouse systems.
Loose Coupling
In loose coupling, data mining utilizes some of the database or data warehouse system
functionalities. It mainly fetches the data from the data repository managed by these systems and
then performs data mining. The results are kept either in the file or any designated place in the
database or data warehouse.
Drawbacks
It's difficult for loose coupling to achieve high scalability and good performance with large data
sets.
Semi-Tight Coupling (Enhanced Data Mining Performance)
The semi-tight coupling means that besides linking a Data Mining system to a Database/Data
Warehouse system, efficient implementations of a few essential data mining primitives (identified
by the analysis of frequently encountered data mining functions) can be provided in the
Database/Data Warehouse system.
These primitives can include sorting, indexing, aggregation, histogram analysis, multi-way join,
and pre-computation of some essential statistical measures, such as sum, count, max, min, standard
deviation.
This design will enhance the performance of Data Mining systems.
Tight Coupling (A Uniform Information Processing Environment)
Tight coupling means that a Data Mining system is smoothly integrated into the Database/Data
Warehouse system.
The data mining subsystem is treated as one functional component of the information system.
Data mining queries and functions are optimized based on mining query analysis, data structures,
indexing schemes, and query processing methods of a Database or Data Warehouse system.
Major issues in data mining
The scope of this book addresses major issues in data mining regarding mining methodology, user
interaction, performance, and diverse data types. These issues are introduced below:

1.Mining methodology and user-interaction issues. These reflect the kinds of knowledge mined,
the ability to mine knowledge at multiple granularities, the use of domain knowledge, ad-hoc
mining, and knowledge visualization.
Mining different kinds of knowledge in databases.
Since different users can be interested in different kinds of knowledge, data mining should cover
a wide spectrum of data analysis and knowledge discovery tasks, including data characterization,
discrimination, association, classification, clustering, trend and deviation analysis, and similarity
analysis. These tasks may use the same database in different ways and require the development of
numerous data mining techniques.
Interactive mining of knowledge at multiple levels of abstraction.
Since it is difficult to know exactly what can be discovered within a database, the data mining
process should be interactive. For databases containing a huge amount of data, appropriate
sampling technique can first be applied to facilitate interactive data exploration. Interactive mining
allows users to focus the search for patterns, providing and refining data mining requests based on
returned results. Specifically, knowledge should be mined by drilling-down, rolling-up, and
pivoting through the data space and knowledge space interactively, similar to what OLAP can do
on data cubes. In this way, the user can interact with the data mining system to view data and
discovered patterns at multiple granularities and from different angles.
Incorporation of background knowledge.

Background knowledge, or information regarding the domain under study, may be used to guide
the discovery process and allow discovered patterns to be expressed in concise terms and at
different levels of abstraction. Domain knowledge related to databases, such as integrity
constraints and deduction rules, can help focus and speed up a data mining process, or judge the
interestingness of discovered patterns.
Data mining query languages and ad-hoc data mining.

Relational query languages (such as SQL) allow users to pose ad-hoc queries for data retrieval. In
a similar vein, high-level data mining query languages need to be developed to allow users to
describe ad-hoc data mining tasks by facilitating the specification of the relevant sets of data for

analysis, the domain knowledge, the kinds of knowledge to be mined, and the conditions and
interestingness constraints to be enforced on the discovered patterns. Such a language should be
integrated with a database or data warehouse query language, and optimized for efficient and
flexible data mining.
Presentation and visualization of data mining results.

Discovered knowledge should be expressed in high-level languages, visual representations, or
other expressive forms so that the knowledge can be easily understood and directly usable by
humans. This is especially crucial if the data mining system is to be interactive. This requires the
system to adopt expressive knowledge representation techniques, such as trees, tables, rules,
graphs, charts, crosstabs, matrices, or curves.
Handling outlier or incomplete data.

The data stored in a database may reflect outliers | noise, exceptional cases, or incomplete data
objects. These objects may confuse the analysis process, causing overfitting of the data to the
knowledge model constructed. As a result, the accuracy of the discovered patterns can be poor.
Data cleaning methods and data analysis methods which can handle outliers are required. While
most methods discard outlier data, such data may be of interest in itself such as in fraud detection
for finding unusual usage of tele-communication services or credit cards. This form of data
analysis is known as outlier mining.
Pattern evaluation: the interestingness problem.

A data mining system can uncover thousands of patterns. Many of the patterns discovered may be
uninteresting to the given user, representing common knowledge or lacking novelty. Several
challenges remain regarding the development of techniques to assess the interestingness of
discovered patterns, particularly with regard to subjective measures which estimate the value of
patterns with respect to a given user class, based on user beliefs or expectations. The use of
interestingness measures to guide the discovery process and reduce the search space is another
active area of research.

2.Performance issues. These include efficiency, scalability, and parallelization of data mining
algorithms.
Efficiency and scalability of data mining algorithms.

To effectively extract information from a huge amount of data in databases, data mining algorithms
must be efficient and scalable. That is, the running time of a data mining algorithm must be
predictable and acceptable in large databases. Algorithms with exponential or even medium-order
polynomial complexity will not be of practical use. From a database perspective on knowledge
discovery, efficiency and scalability are key issues in the implementation of data mining systems.
Many of the issues discussed above under mining methodology and user-interaction must also
consider efficiency and scalability.
Parallel, distributed, and incremental updating algorithms.

The huge size of many databases, the wide distribution of data, and the computational complexity
of some data mining methods are factors motivating the development of parallel and distributed
data mining algorithms. Such algorithms divide the data into partitions, which are processed in
parallel. The results from the partitions are then merged. Moreover, the high cost of some data
mining processes promotes the need for incremental data mining algorithms which incorporate
database updates without having to mine the entire data again from scratch. Such algorithms
perform knowledge modification incrementally to amend and strengthen what was previously
discovered.
3.Issues relating to the diversity of database types.
Handling of relational and complex types of data.
There are many kinds of data stored in databases and data warehouses. Since relational databases
and data warehouses are widely used, the development of efficient and effective data mining
systems for such data is important. However, other databases may contain complex data objects,
hypertext and multimedia data, spatial data, temporal data, or transaction data. It is unrealistic to
expect one system to mine all kinds of data due to the diversity of data types and different goals
of data mining. Specific data mining systems should be constructed for mining specific kinds of
data. Therefore, one may expect to have different data mining systems for different kinds of data.

Mining information from heterogeneous databases and global information systems. Local and
wide-area computer networks (such as the Internet) connect many sources of data, forming huge,
distributed, and heterogeneous databases. The discovery of knowledge from different sources of
structured, semi-structured, or unstructured data with diverse data semantics poses great
challenges to data mining. Data mining may help disclose high-level data regularities in multiple
heterogeneous databases that are unlikely to be discovered by simple query systems and may
improve information exchange and interoperability in heterogeneous databases.
Major Issues in Data Warehousing and Mining

• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and global information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools

Data Preprocessing
Data cleaning.
Data cleaning routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
(i). Missing values
1.Ignore the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
2.Fill in the missing value manually: In general, this approach is time-consuming and may not
be feasible given a large data set with many missing values.
3.Use a global constant to fill in the missing value: Replace all missing attribute values by the
same constant, such as a label like “Unknown". If missing values are replaced by, say,
“Unknown", then the mining program may mistakenly think that they form an interesting concept,
since they all have a value in common - that of “Unknown". Hence, although this method is
simple, it is not recommended.

4.Use the attribute mean to fill in the missing value: For example, suppose that the average
income of All Electronics customers is $28,000. Use this value to replace the missing value for
income.
5.Use the attribute mean for all samples belonging to the same class as the given tuple: For
example, if classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the given tuple.
6.Use the most probable value to fill in the missing value: This may be determined with
inference-based tools using a Bayesian formalism or decision tree induction. For example, using
the other customer attributes in your data set, you may construct a decision tree to predict the
missing values for income.
(ii). Noisy data
Noise is a random error or variance in a measured variable.
1.Binning methods:
Binning methods smooth a sorted data value by consulting the ”neighborhood", or values around
it. The sorted values are distributed into a number of 'buckets', or bins. Because binning methods
consult the neighborhood of values, they perform local smoothing. Figure illustrates some binning
techniques.
In this example, the data for price are first sorted and partitioned into equi-depth bins (of depth
3). In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this
bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in which
each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.

(i) Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
(ii) Partition into (equi-width) bins:
✓ Bin 1: 4, 8, 15
✓ Bin 2: 21, 21, 24
✓ Bin 3: 25, 28, 34
(iii) Smoothing by bin means:
✓ Bin 1: 9, 9, 9,
✓ Bin 2: 22, 22, 22
✓ Bin 3: 29, 29, 29
(iv) Smoothing by bin boundaries:
✓ Bin 1: 4, 4, 15
✓ Bin 2: 21, 21, 24
✓ Bin 3: 25, 25, 34
2. Clustering:
Outliers may be detected by clustering, where similar values are organized into groups or
“clusters”. Intuitively, values which fall outside of the set of clusters may be considered outliers.
Figure: Outliers may be detected by clustering analysis.
3. Combined computer and human inspection: Outliers may be identified through a

combination of computer and human inspection. In one application, for example, an information-
theoretic measure was used to help identify outlier patterns in a handwritten character database
for classification. The measure's value reflected the “surprise" content of the predicted character
label with respect to the known label. Outlier patterns may be informative or “garbage". Patterns
whose surprise content is above a threshold are output to a list. A human can then sort through the
patterns in the list to identify the actual garbage ones
4.Regression: Data can be smoothed by fitting the data to a function, such as with regression.
Linear regression involves finding the “best" line to fit two variables, so that one variable can be
used to predict the other. Multiple linear regression is an extension of linear regression, where
more than two variables are involved and the data are fit to a multidimensional surface.
(iii). Inconsistent data
There may be inconsistencies in the data recorded for some transactions. Some data
inconsistencies may be corrected manually using external references. For example, errors made
at data entry may be corrected by performing a paper trace. This may be coupled with routines
designed to help correct the inconsistent use of codes. Knowledge engineering tools may also be
used to detect the violation of known data constraints. For example, known functional
dependencies between attributes can be used to find values contradicting the functional
constraints.
Data transformation.
In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Data transformation can involve the following: Normalization, Smoothing, Aggregation and
Generalization
1.Normalization, where the attribute data are scaled so as to fall within a small specified range,
such as -1.0 to 1.0, or 0 to 1.0.
There are three main methods for data normalization : min-max normalization, z- score
normalization, and normalization by decimal scaling.

(i).Min-max normalization performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an attribute A. Min-max normalization
maps a value v of A to v0 in the range [new minA; new maxA] by computing
(ii).z-score normalization (or zero-mean normalization), the values for an attribute A are
normalized based on the mean and standard deviation of A. A value v of A is normalized to v0 by
computing where mean A and stand dev A are the mean and standard deviation, respectively, of
attribute A. This method of normalization is useful when the actual minimum and maximum of
attribute A are unknown, or when there are outliers which dominate the min-max normalization.
(iii). Normalization by decimal scaling normalizes by moving the decimal point of values of
attribute A. The number of decimal points moved depends on the maximum absolute value of
A. A value v of A is normalized to v0by computing where j is the smallest integer such that
2.Smoothing, which works to remove the noise from data? Such techniques include binning,
clustering, and regression.
3.Aggregation, where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute monthly and annual total amounts.
4.Generalization of the data, where low level or 'primitive' (raw) data are replaced by higher
level concepts through the use of concept hierarchies. For example, categorical attributes, like
street, can be generalized to higher level concepts, like city or county.
Data reduction.
Data reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintains the integrity of the original data. That is, mining
on the reduced data set should be more efficient yet produce the same (or almost the same)
analytical results.
Strategies for data reduction include the following.

(i) Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
(ii) Dimension reduction, where irrelevant, weakly relevant or redundant attributes or
dimensions may be detected and removed.
(iii)Data compression, where encoding mechanisms are used to reduce the data set size.
(iv) Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need store only the model
parameters instead of the actual data), or nonparametric methods such as clustering,
sampling, and the use of histograms.
(v) Discretization and concept hierarchy generation, where raw data values for attributes
are replaced by ranges or higher conceptual levels. Concept hierarchies allow the mining
of data at multiple levels of abstraction, and are a powerful tool for data mining.
Data Cube Aggregation

• The lowest level of a data cube
- the aggregated data for an individual entity of interest
- e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
- Further reduce the size of data to deal with
• Reference appropriate levels
- Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using data cube, when
possible
Dimensionality Reduction
Feature selection (i.e., attribute subset selection):

– Select a minimum set of features such that the probability distribution of different classes
given the values for those features is as close as possible to the original distribution given
the values of all features
– reduce # of patterns in the patterns, easier to understand
Heuristic methods:
1.Step-wise forward selection: The procedure starts with an empty set of attributes. The best of
the original attributes is determined and added to the set. At each subsequent iteration or step, the
best of the remaining original attributes is added to the set.
2.Step-wise backward elimination: The procedure starts with the full set of attributes. At each
step, it removes the worst attribute remaining in the set.
3.Combination forward selection and backward elimination: The step-wise forward selection
and backward elimination methods can be combined, where at each step one selects the best
attribute and removes the
4.Decision tree induction: Decision tree algorithms, such as ID3 and C4.5, were originally
intended for classification. Decision tree induction constructs a flow-chart-like structure where
each internal (non-leaf) node denotes a test on an attribute, each branch corresponds to an outcome
of the test, and each external (leaf) node denotes a class prediction. At each node, the algorithm
chooses the “best" attribute to partition the data into individual classes.
Data compression
In data compression, data encoding or transformations are applied so as to obtain a reduced or
”compressed" representation of the original data. If the original data can be reconstructed from
the compressed data without any loss of information, the data compression technique used is
called lossless. If, instead, we can reconstruct only an approximation of the original data, then the
data compression technique is called lossy. The two popular and effective methods of lossy data
compression: wavelet transforms, and principal components analysis.
Wavelet transforms
The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied
to a data vector D, transforms it to a numerically different vector, D0, of wavelet coefficients. The
two vectors are of the same length.
The DWT is closely related to the discrete Fourier transform (DFT), a signal processing technique
involving sines and cosines. In general, however, the DWT achieves better lossy compression.
The general algorithm for a discrete wavelet transform is as follows.

1. The length, L, of the input data vector must be an integer power of two. This condition
can be met by padding the data vector with zeros, as necessary.
2. Each transform involves applying two functions. The first applies some data smoothing,
such as a sum or weighted average. The second performs a weighted difference.

3. The two functions are applied to pairs of the input data, resulting in two sets of data of
length L=2. In general, these respectively represent a smoothed version of the input data,
and the high-frequency content of it.
4. The two functions are recursively applied to the sets of data obtained in the previous loop,
until the resulting data sets obtained are of desired length.
5. A selection of values from the data sets obtained in the above iterations are designated the
wavelet coefficients of the transformed data.
Principal components analysis

Principal components analysis (PCA) searches for c k-dimensional orthogonal vectors that can
best be used to represent the data, where c << N. The original data is thus projected onto a much
smaller space, resulting in data compression. PCA can be used as a form of dimensionality
reduction. The initial data can then be projected onto this smaller set.
The basic procedure is as follows.
1. The input data are normalized, so that each attribute falls within the same range. This step helps
ensure that attributes with large domains will not dominate attributes with smaller domains.
2. PCA computes N orthonormal vectors which provide a basis for the normalized input data.
These are unit vectors that each point in a direction perpendicular to the others. These vectors
are referred to as the principal components. The input data are a linear combination of the
principal components.
3. The principal components are sorted in order of decreasing “significance" or strength. The
principal components essentially serve as a new set of axes for the data, providing important
information about variance.
4. since the components are sorted according to decreasing order of “significance", the size of the
data can be reduced by eliminating the weaker components, i.e., those with low variance. Using
the strongest principal components, it should be possible to reconstruct a good approximation
of the original data.

Numerosity reduction Regression and log-linear models
Regression and log-linear models can be used to approximate the given data. In linear regression,
the data are modeled to fit a straight line. For example, a random variable, Y (called a response
variable), can be modeled as a linear function of another random variable, X (called a predictor
variable), with the equation where the variance of Y is assumed to be constant. These coefficients
can be solved for by the method of least squares, which minimizes the error between the actual
line separating the data and the estimate of the line.
Multiple regression is an extension of linear regression allowing a response variable Y to be
modeled as a linear function of a multidimensional feature vector.
Log-linear models approximate discrete multidimensional probability distributions. The method

can be used to estimate the probability of each cell in a base cuboid for a set of discretized
attributes, based on the smaller cuboids making up the data cube lattice
Histograms
A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or buckets.
The buckets are displayed on a horizontal axis, while the height (and area) of a bucket typically
reflects the average frequency of the values represented by the bucket.
1. Equi-width: In an equi-width histogram, the width of each bucket range is constant (such
as the width of $10 for the buckets in Figure 3.8).
2. Equi-depth (or equi-height): In an equi-depth histogram, the buckets are created so that,
roughly, the frequency of each bucket is constant (that is, each bucket contains roughly
the same number of contiguous data samples).
3. V-Optimal: If we consider all of the possible histograms for a given number of buckets,
the V-optimal histogram is the one with the least variance. Histogram variance is a
weighted sum of the original values that each bucket represents, where bucket weight is
equal to the number of values in the bucket.
4. MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of
adjacent values. A bucket boundary is established between each pair for pairs having the
largest differences, where is user-specified.

Clustering
Clustering techniques consider data tuples as objects. They partition the objects into groups or
clusters, so that objects within a cluster are “similar" to one another and “dissimilar" to objects in
other clusters. Similarity is commonly defined in terms of how “close" the objects are in space,
based on a distance function. The “quality" of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster. Centroid distance is an alternative
measure of cluster quality, and is defined as the average distance of each cluster object from the
cluster centroid.
Sampling
Sampling can be used as a data reduction technique since it allows a large data set to be
represented by a much smaller random sample (or subset) of the data. Suppose that a large data
set, D, contains N tuples. Let's have a look at some possible samples for D.
1.Simple random sample without replacement (SRSWOR) of size n: This is created by

drawing n of the N tuples from D (n < N), where the probably of drawing any tuple in D is 1=N,
i.e., all tuples are equally likely.
2.Simple random sample with replacement (SRSWR) of size n: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a
tuple is drawn, it is placed back in D so that it may be drawn again.
3.Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters", then a SRS
of m clusters can be obtained, where m < M. A reduced data representation can be obtained by
applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.
4.Stratified sample: If D is divided into mutually disjoint parts called “strata", a stratified sample
of D is generated by obtaining a SRS at each stratum. This helps to ensure a representative sample,
especially when the data are skewed. For example, a stratified sample may be obtained from
customer data, where stratum is created for each customer age group.


Data Mining Unit I notes

Uploaded by

Data Mining Unit I notes

Uploaded by

UNIT I

Data Mining: Data–Types of Data–, Data Mining Functionalities– Interestingness Patterns–

• Collection of data objects and their attributes

• An attribute is a property or characteristic of an object

– Examples: eye color of a person, temperature, etc.

– Attribute is also known as variable, field, characteristic, or feature

• A collection of attributes describe an object

• Attribute values are numbers or symbols assigned to an attribute

• Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute values

• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers

• But properties of attribute values can be different

– ID has no limit but age has a maximum and minimum value

20PC0IT23: DATA MINING

• There are different types of attributes

• Examples: ID numbers, eye color, zip codes

• Examples: calendar dates, temperatures in Celsius or Fahrenheit.

Examples: temperature in Kelvin, length, time, counts

20PC0IT23: DATA MINING

Tasks and Functionalities of Data Mining

o Descriptive Data Mining: It includes certain knowledge to understand what is happening

Functionalities of Data Mining

20PC0IT23: DATA MINING

o Data Characterization: This refers to the summary of general characteristics or features

2. Mining Frequent Patterns

20PC0IT23: DATA MINING

o It provides which identifies the common item set in the database.

20PC0IT23: DATA MINING

8. Evolution and Deviation Analysis

20PC0IT23: DATA MINING

confidence (X ) Y) = Prob{Y |X}g

Classification of Data Mining Systems

• Classification based on the mined Databases

20PC0IT23: DATA MINING

Classification Based on the type of Knowledge Mined

Classification Based on the Applications Adapted

• Classification helps in determining tumor cells as benign or malignant.

Integration schemes of Database and Data warehouse systems

Semi-Tight Coupling (Enhanced Data Mining Performance)

This design will enhance the performance of Data Mining systems.

Tight Coupling (A Uniform Information Processing Environment)

Major issues in data mining

20PC0IT23: DATA MINING

Incorporation of background knowledge.

Data mining query languages and ad-hoc data mining.

20PC0IT23: DATA MINING

Presentation and visualization of data mining results.

Handling outlier or incomplete data.

Pattern evaluation: the interestingness problem.

20PC0IT23: DATA MINING

Efficiency and scalability of data mining algorithms.

Parallel, distributed, and incremental updating algorithms.

20PC0IT23: DATA MINING

Major Issues in Data Warehousing and Mining

– Mining different kinds of knowledge in databases

– Interactive mining of knowledge at multiple levels of abstraction

– Incorporation of background knowledge

– Data mining query languages and ad-hoc data mining

– Expression and visualization of data mining results

– Handling noise and incomplete data

– Pattern evaluation: the interestingness problem

• Performance and scalability

– Efficiency and scalability of data mining algorithms

– Parallel, distributed and incremental mining methods

• Issues relating to the diversity of data types

– Handling relational and complex types of data

– Application of discovered knowledge

• Domain-specific data mining tools

20PC0IT23: DATA MINING

(i). Missing values

20PC0IT23: DATA MINING

(ii). Noisy data

Noise is a random error or variance in a measured variable.

20PC0IT23: DATA MINING