Data Mining Unit-I
Data Mining Unit-I
Data Mining Unit-I
UNIT-I
Introduction:
What is Data?
Data can be defined as a representation of facts, concepts, or instructions in a formalized manner,
which should be suitable for communication, interpretation, or processing by human or electronic
machine.
In other words, The Data is collection of objects defined by attributes.
A data object represents an entity.
- Also called as record, sample, example, instance, data point, object, tuple.
Examples:
- In a sales database, the objects may be customers, store items, and sales;
- In a medical database, the objects may be patients;
- In a university database, the objects may be students, professors, and courses.
Data objects are described by attributes, in other words, A collection of attributes describes an
object.
Examples: weight of a person, height, temperature, customer _ID, name, address etc.
Attribute Types:
Attribute values are numbers or symbols assigned to an attribute. The type of the attribute can be
determined based on the assigned value.
The set of possible values - nominal, binary, ordinal, or numeric - the attribute can have.
The nominal attribute values do not have any meaningful order about them and they are not
quantitative. So
– It makes no sense to find the mean (average) value or median (middle) value for such an
attribute.
– However, we can find the attribute’s most commonly occurring value (mode)
Binary Attributes:
A binary attribute is a special nominal attribute with only two states: 0 or 1. Where 0 typically means
that the attribute is absent, and 1 means that it is present.
➢ Symmetric Binary Attribute:
A binary attribute is symmetric if both of its states are equally valuable and carry the
same weight.
Example: the attribute gender having the states male and female.
By convention, we code the most important outcome, which is usually the rarest one, by 1
(e.g., COVID positive) and the other by 0 (e.g., COVID negative).
Ordinal Attributes:
An ordinal attribute is an attribute with possible values that have a meaningful order or ranking
among them, but the magnitude between successive values is not known.
• Ordinal attributes are also referred to as Qualitative and Categorical attributes.
Example: An ordinal attribute drink_size corresponds to the size of drinks available at a fast-food
restaurant.
– This attribute has three possible values: small, medium, and large.
– The values have a meaningful sequence (which corresponds to increasing drink size);
– However, we cannot tell from the values how much bigger, say, a medium is than a large.
Ordinal attributes are useful in surveys, In one survey, participants were asked to rate how satisfied
they were as customers.
Customer satisfaction had the following ordinal categories:
0: very dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3: satisfied, and 4: very satisfied.
The central tendency of an ordinal attribute can be represented by its mode and its median (middle
value in an ordered sequence), but the mean cannot be defined.
Interval-Scaled Attributes:
Interval-scaled attributes are measured on a scale of equal-size units. The values of interval-scaled
attributes have order and can be positive, 0, or negative.We can compare and quantify the difference
between values of interval attributes.
Examples:
A temperature attribute is an interval attribute.
- We can quantify the difference between values. For example, a temperature of 20oC is five
degrees higher than a temperature of 15 o C.
Calendar dates is another example for an interval attribute.
o
- Temperatures in Celsius do not have a true zero point, that is, 0 C does not indicate “no
temperature.”
- Calendar dates do not have a true zero point, that is, the year 0 not the beginning of the time.
Although we can compute the difference between temperature values, we cannot talk of one
temperature value as being a multiple of another.
Without a true zero, we cannot say, for instance, that 10 o C is twice as warm as 5 o C. That is, we
cannot speak of the values in terms of ratios.
The central tendency of an interval attribute can be represented by its mode, its median (middle value
in an ordered sequence), and its mean Data.
Ratio Attribute:
A ratio attribute is a numeric attribute with an inherent zero point.
Examples:
➢ number_of_words in a documents object.
➢ count attribute such as years of experience for employee object.
➢ Attributes to measure weight, height, latitude, and longitude coordinates.
➢ With an amount attribute we can say “you are 100 times richer with $100 than with $1”.
- If a measurement is ratio scaled, we can speak of a value as being a multiple (or ratio) of another
value.
The central tendency of an ratio attribute can be represented by its mode, its median (middle value
in an ordered sequence), and its mean
Mean:
The most common and effective numeric measure of the “center” of a set of data is the (arithmetic)
mean. Let x1, x2, ……, xN be a set of N values or observations, such as for some numeric attribute
X, like salary.
The mean of this set of values is
Example: Mean. Suppose we have the following values for salary (in thousands of dollars), shown
in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Using above Eq.,
we have
Example: Median. Suppose we have the following values for salary (in thousands of dollars), shown
in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
There is an even number of observations (i.e., 12); therefore, the median is not unique. It can be any
value within the two middlemost values of 52 and 56 (that is, within the sixth and seventh values in
the list). By convention, we assign the average of the two middlemost values as the median;
that is
Mode:
Another measure of central tendency is the mode. The mode for a set of data is the value that occurs
most frequently in the set.
It is possible for the greatest frequency to correspond to several different values, which results in
more than one mode.
- Data sets with one, two, or three modes: called unimodal, bimodal, and trimodal.
- At the other extreme, if each data value occurs only once, then there is no mode.
Example: Mode. Suppose we have the following values for salary (in thousands of dollars), shown
in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
The above data has bimodal mode.i. e The two modes are 52 and 70.
Midrange:
The midrange can also be used to assess the central tendency of a numeric data set. It is the
average of the largest and smallest values in the set.
Example: Midrange. Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
The midrange of the data is
Example:
What are central tendency measures (mean, median, mode) for the following attributes?
Solution:
attr1 = {2,4,4,6,8,24}
mean = (2+4+4+6+8+24)/6 = 8 average of all values
median = (4+6)/2 = 5 avg. of two middle values
mode = 4 most frequent item
attr2 = {2,4,7,10,12}
mean = (2+4+7+10+12)/5 = 7 average of all values
median = 7 middle value
mode = any of them (no mode) all of them has same freq.
attr3 = {xs, s, s, s, m, m, l}
mean is meaningless for categorical attributes.
median = s middle value
mode = s most frequent item
What is Data Mining?
Data mining is the process of discovering interesting patterns and knowledge from large amounts
of data. The data sources can include databases, data warehouses, the Web, other information
repositories, or data that are streamed into the system dynamically.
(or)
Data Mining is a process of finding potentially useful patterns and valuable information from huge
amount of data.
(or)
Data Mining is all about discovering hidden, unsuspected, and previously unknown yet valid
relationships amongst the data.
(or)
Data Mining: Transforming tremendous amounts of data into organized knowledge.
Data Mining is also called as Knowledge discovery, Knowledge extraction, data/pattern analysis,
Information extraction, data dredging, etc.
Applications Usage
Data mining helps insurance companies to price their products profitable and
Insurance deciding whether to approve policy applications, including risk modelling and
management for prospective customers.
Data mining benefits educators to access student data, predict achievement levels
Education and find students or groups of students which need extra attention. For example,
students who are weak in maths subject.
Data mining helps finance sector to get a view of market risks and manage
regulatory compliance. It helps banks to identify probable defaulters to decide
Banking whether to issue credit cards, loans, etc.
Bank and credit card companies use data mining tools to build financial risk
models, detect fraudulent transactions and examine loan and credit applications.
Data Mining techniques help retail malls and grocery stores identify and arrange
most sellable items in the most attentive positions. It helps store owners to comes
Retail up with the offer which encourages customers to increase their spending.
Online retailers mine customer data and internet clickstream records to help them
target marketing campaigns, ads and promotional offers to individual shoppers.
Service providers like mobile phone and utility industries use Data Mining to
Service predict the reasons when a customer leaves their company. They analyse billing
Providers details, customer service interactions, complaints made to the company to assign
each customer a probability score and offers incentives.
E-commerce websites use Data Mining to offer cross-sells and up-sells through
E-Commerce their websites. One of the most famous names is Amazon, who use Data mining
techniques to get more customers into their eCommerce store.
Data Mining allows supermarket’s develop rules to predict if their shoppers were
likely to be expecting. By evaluating their buying pattern, they could find woman
Super Markets
customers who are most likely pregnant. They can start targeting products like
baby powder, baby shop, diapers and so on.
Streaming services do data mining to analyse what users are watching or listening
Entertainment to and to make personalized recommendations based on people's viewing and
listening habits.
Data mining helps doctors diagnose medical conditions, treat patients and analyse
Healthcare X-rays and other medical imaging results. Medical research also depends heavily
on data mining, machine learning and other forms of analytics.
Knowledge Discovery from Data (KDD):
The need of data mining is to extract useful information from large datasets and use it to make
predictions or better decision-making. Nowadays, data mining is used in almost all places where a
large amount of data is stored and processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion Detection.
Data Cleaning:
Data cleaning is defined as removal of noisy and irrelevant/ inconsistent data from data
collection.
• Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
In this step, the noise and inconsistent data is removed.
Data Integration:
Data integration is defined as heterogeneous data from multiple data sources combined in a
common source (Data Warehouse).
i.e., In this step, multiple data sources may be combined as single data source.
A popular trend in the information industry is to perform data cleaning and data integration as a
data preprocessing step, where the resulting data are stored in a data warehouse.
Data Selection:
Data selection is defined as the process where data relevant to the analysis is decided and retrieved
from the data collection. This step in the KDD process is identifying and selecting the relevant data
for analysis.
Data Transformation:
Data Transformation is defined as the process of transforming data into appropriate form required
by mining procedure. This step involves reducing the data dimensionality, aggregating the data,
normalizing it, and discretizing it to prepare it for further analysis.
Data Mining:
This is the heart of the KDD process and involves applying various data mining techniques to the
transformed data to discover hidden patterns, trends, relationships, and insights. A few of the most
common data mining techniques include clustering, classification, association rule mining, and
anomaly detection.
Pattern Evaluation:
After the data mining, the next step is to evaluate the discovered patterns to determine their
usefulness and relevance. This involves assessing the quality of the patterns, evaluating their
significance, and selecting the most promising patterns for further analysis.
Knowledge Representation:
This step involves representing the knowledge extracted from the data in a way humans can easily
understand and use. This can be done through visualizations, reports, or other forms of
communication that provide meaningful insights into the data.
The data needs to be cleaned, integrated, and selected before passing it to the database or data
warehouse server. As the data is from different sources and in different formats, it cannot be used
directly for the data mining process because the data might not be complete and reliable. So, first
data needs to be cleaned and integrated.
The data mining engine might get inputs from the knowledge base to make the result more accurate
and reliable. The pattern evaluation module interacts with the knowledge base on a regular basis to
get inputs and also to update it.
Basic forms of data for mining. And other forms of data for mining.
➢ Database Data (or) Relational database ➢ Multimedia Database
➢ Data warehouse data ➢ Spatial Database
➢ Transactional data ➢ World Wide Web
➢ Text data (Flat File)
➢ Time series database
➢ Transactional data:
Transactional database is a collection of data organized by time stamps, date etc to represent
transaction in databases. In general, each record in a transactional database captures a transaction,
such as a customer’s purchase, a flight booking, or a user’s clicks on a web page.
A transaction typically includes a unique transaction identity number (trans ID) and a list of the items
making up the transaction, such as the items purchased in the transaction.
This type of database has the capability to roll back or undo operation when a transaction is not
completed or committed. And it follows ACID property of DBMS.
Example:
TID Items
T1 Bread, Coke, Milk
T2 Popcorn, Bread
T3 Popcorn, Coke, Egg, Milk
T4 Popcorn, Bread, Egg, Milk
T5 Coke, Egg, Milk
➢ Spatial database:
A spatial database is a database that is enhanced to store and access spatial data or data that defines
a geometric space. These data are often associated with geographic locations and features, or
constructed features like cities. Data on spatial databases are stored as coordinates, points, lines,
polygons and topology.
Descriptive data mining: Similarities and patterns in data may be discovered using descriptive
data mining.
This kind of mining focuses on transforming raw data into information that can be used in reports
and analyses. It provides certain knowledge about the data, for instance, count, average.
It gives information about what is happening inside the data without any previous idea. It exhibits
the common features in the data. In simple words, you get to know the general properties of the data
present in the database.
Predictive data mining: These kind of mining tasks perform inference on the current data in
order to make predictions.
This helps the developers in understanding the characteristics that are not explicitly available. For
instance, the prediction of business analysis in the next quarter with the performance of the previous
quarters. In general, the predictive analysis predicts or infers the characteristics with the previously
available data.
The following are data mining functionalities:
• Class/Concept Description
(Characterization and Discrimination)
• Classification
• Prediction
• Association Analysis
• Cluster Analysis
• Outlier Analysis
Class/Concept Description: Characterization and Discrimination:
Data is associated with classes or concepts.
Class: A collection of things sharing a common attribute
Example: Classes of items – computers and printers
Concept: An abstract or general idea derived from specific instances.
Example: Concepts of customers – bigSpenders and budgetSpenders.
It can be useful to describe individual classes and concepts in summarized, concise, and yet precise
terms. Such descriptions of a class or a concept are called class/concept descriptions.
These descriptions can be derived using data characterization and data discrimination, or both.
Data characterization:
Data characterization is a summarization of the general characteristics or features of a target class of
data.
Data summarization can be done based on statistical measures and plots.
The output of data characterization can be presented in various forms it includes pie charts, bar charts,
curves, and multidimensional data cubes.
Example:
A customer relationship manager at AllElectronics may order the following data mining task:
Summarize the characteristics of customers who spend more than $5000 a year at AllElectronics.
The result is a general profile of these customers, such as that they are 40 to 50 years old, employed,
and have excellent credit ratings.
Data discrimination:
Data discrimination is one of the functionalities of data mining. It compares the data between the two
classes. Generally, it maps the target class with a predefined group or class. It compares and contrasts
the characteristics of the class with the predefined class using a set of rules called discriminate rules.
Example:
A customer relationship manager at AllElectronics may want to compare two groups of customers
those who shop for computer products regularly (e.g., more than twice a month) and those who rarely
shop for such products (e.g., less than three times a year).
The resulting description provides a general comparative profile of these customers, such as that
80% of the customers who frequently purchase computer products are between 20 and 40 years old
and have a university education, whereas 60% of the customers who infrequently buy such products
are either seniors or youths, and have no university degree.
Classification:
Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like IF-THEN, Decision trees or Neural networks to predict
a class or essentially classify a collection of items.
Classification is a supervised learning technique used to categorize data into predefined classes or
labels.
Example:
Fig: Prediction
Association Analysis:
Association Analysis is a functionality of data mining. It relates two or more attributes of the data. It
discovers the relationship between the data and the rules that are binding them. It is also known
as Market Basket Analysis for its wide use in retail sales.
The suggestion that Amazon shows on the bottom, “Customers who bought this also bought.” is a
real-time example of association analysis.
It relates two transactions of similar items and finds out the probability of the same happening again.
This helps the companies improve their sales of various items.
Fig: Clustering
Example2:
Fig: Clustering
Outlier Analysis:
When data that cannot be grouped in any of the class appears, we use outlier analysis. There will
be occurrences of data that will have different attributes/features to any of the other classes or
clusters. These outstanding data are called outliers. They are usually considered noise or
exceptions, and the analysis of these outliers is called outlier mining.
Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns out of it.
Example1:
This raises some serious questions for data mining. You may wonder,
1. What makes a pattern interesting?
2. Can a data mining system generate all the interesting patterns?
3. Can a data mining system generate only interesting patterns?
The second question ―Can a data mining system generate all the interesting patterns? --refers to
the completeness of a data mining algorithm. It is often unrealistic and inefficient for data mining
systems to generate all the possible patterns. Instead, user-provided constraints and interestingness
measures should be used to focus the search. A data mining algorithm is complete if it mines all
interesting patterns.
Finally, the third question -- “Can a data mining system generate only interesting patterns?”—
is an optimization problem in data mining. It is highly desirable for data mining systems to generate
only interesting patterns. An interesting pattern represents knowledge.
Data mining discovers patterns and extracts useful information from large datasets. Organizations
need to analyze and interpret data using data mining systems as data grows rapidly. With an
exponential increase in data, active data analysis is necessary to make sense of it all.
For example, suppose that you are a manager of All Electronics in charge of sales in the United
States and Canada. You would like to study the buying trends of customers in Canada. Rather than
mining on the entire database. These are referred to as relevant attributes.
Example:
An example of a concept hierarchy for the attribute (or dimension) age is shown in the following
Figure.
In the above, the root node represents the most general abstraction level, denoted as all.
No Coupling:
No coupling means that a Data Mining system will not utilize any function of a Data Base or Data
Warehouse system.
It may fetch data from a particular source (such as a file system), process data using some data mining
algorithms, and then store the mining results in another file.
Drawbacks:
o First, without using a Database/Data Warehouse system, a Data Mining system may spend a
substantial amount of time finding, collecting, cleaning, and transforming data.
o Second, there are many tested, scalable algorithms and data structures implemented in
Database and Data Warehouse systems.
Loose Coupling:
In this Loose coupling, the data mining system uses some facilities / services of a database or data
warehouse system. The data is fetched from a data repository managed by these (DB/DW) systems.
Data mining approaches are used to process the data and then the processed data is saved either in a
file or in a designated area in a database or data warehouse.
Loose coupling is better than no coupling because it can fetch any portion of data stored in Databases
or Data Warehouses by using query processing, indexing, and other system facilities.
Drawbacks:
o It is difficult for loose coupling to achieve high scalability and good performance with large
data sets.
Semi-Tight Coupling:
Semitight coupling means that besides linking a Data Mining system to a Data Base/Data Warehouse
system, efficient implementations of a few essential data mining primitives can be provided in the
DB/DW system. These primitives can include sorting, indexing, aggregation, histogram analysis,
multi way join, and precomputation of some essential statistical measures, such as sum, count, max,
min, standard deviation.
Advantage:
o This Coupling will enhance the performance of Data Mining systems
Tight Coupling:
Tight coupling means that a Data Mining system is smoothly integrated into the Data Base/Data
Warehouse system. The data mining subsystem is treated as one functional component of
information system. Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods of a DB or DW system.
Major issues in Data Mining:
Data mining, the process of extracting knowledge from data, has become increasingly important as
the amount of data generated by individuals, organizations, and machines has grown exponentially.
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources.
The above factors may lead to some issues in data mining. These issues are mainly divided into three
categories, which are given below:
1. Mining Methodology and User Interaction
2. Performance Issues
3. Diverse Data Types Issues
• Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kinds of data.
• Mining information from heterogeneous databases and global information systems − The
data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore, mining the knowledge from them adds
challenges to data mining.
Data Preprocessing in Data Mining:
Data preprocessing is a crucial step in data mining. It involves transforming raw data into a clean,
structured, and suitable format for mining. Proper data preprocessing helps improve the quality of
the data, enhances the performance of algorithms, and ensures more accurate and reliable results.
In the real world, many databases and data warehouses have noisy, missing, and inconsistent data
due to their huge size. Low quality data leads to low quality data mining.
Missing: lacking certain attribute values or containing only aggregate data. E.g., Occupation = “ ”
Inconsistent: Data inconsistency meaning is that different versions of the same data appear in
different places. For example, the ZIP code is saved in one table as 1234-567 numeric data format;
while in another table it may be represented in 1234567.
Data preprocessing is used to improve the quality of data and mining results. And The goal of data
preprocessing is to enhance the accuracy, efficiency, and reliability of data mining algorithms.
Major Tasks in Data Preprocessing:
Data preprocessing is an essential step in the knowledge discovery process, because quality decisions
must be based on quality data. And Data Preprocessing involves Data Cleaning, Data Integration,
Data Reduction and Data Transformation.
1. Data Cleaning:
Data cleaning is a process that "cleans" the data by filling in the missing values, smoothing noisy
data, analyzing, and removing outliers, and removing inconsistencies in the data.
If users believe the data are dirty, they are unlikely to trust the results of any data mining that has
been applied.
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing)
routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
Missing Values:
Imagine that you need to analyze All Electronics sales and customer data. You note that many tuples
have no recorded value for several attributes such as customer income. How can you go about filling
in the missing values for this attribute? There are several methods to fill the missing values.
Those are,
a. Ignore the tuple: This is usually done when the class label is missing(classification). This
method is not very effective, unless the tuple contains several attributes with missing values.
b. Fill in the missing value manually: In general, this approach is time consuming and may not
be feasible given a large data set with many missing values.
c. Use a global constant to fill in the missing value: Replace all missing attribute values by the
same constant such as a label like “Unknown” or “- ∞ “.
d. Use the attribute mean or median to fill in the missing value: Replace all missing values in
the attribute by the mean or median of that attribute values.
Noisy Data:
Noise is a random error or variance in a measured variable. Data smoothing techniques are used to
eliminate noise and extract the useful patterns.
a. Binning: Binning methods smooth a sorted data value by consulting its “neighbourhood,”
that is, the values around it. The sorted values are distributed into several “buckets,” or bins.
Because binning methods consult the neighbourhood of values, they perform local
smoothing.
There are three kinds of binning. They are:
• Smoothing by Bin Means: In this method, each value in a bin is replaced by the mean
value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9.
Therefore, each original value in this bin is replaced by the value 9.
• Smoothing by Bin Medians: In this method, each value in a bin is replaced by the
median value of the bin. For example, the median of the values 4, 8, and 15 in Bin 1
is 8. Therefore, each original value in this bin is replaced by the value 8.
• Smoothing by Bin Boundaries: In this method, the minimum and maximum values in
each bin are identified as the bin boundaries. Each bin value is then replaced by the
closest boundary value. For example, the middle value of the values 4, 8, and 15 in
Bin 1 is replaced with nearest boundary i.e., 4.
Example:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin medians:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
b. Regression: Data smoothing can also be done by regression, a technique that used to predict
the numeric values in a given data set. It analyses the relationship between a target variable
(dependent) and its predictor variable (independent).
• Regression is a form of a supervised machine learning technique that tries to predict
any continuous valued attribute.
• Regression done in two ways; Linear regression involves finding the “best” line to fit
two attributes (or variables) so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.
c. Clustering: It supports in identifying the outliers. The similar values are organized into
clusters and those values which fall outside the cluster are known as outliers.
2. Data Integration:
Data integration is the process of combining data from multiple sources into a single, unified view.
This process involves identifying and accessing the different data sources, mapping the data to a
common format. Different data sources may include multiple data cubes, databases, or flat files.
The goal of data integration is to make it easier to access and analyze data that is spread across
multiple systems or platforms, in order to gain a more complete and accurate understanding of the
data.
Data integration strategy is typically described using a triple (G, S, M) approach, where G denotes
the global schema, S denotes the schema of the heterogeneous data sources, and M represents the
mapping between the queries of the source and global schema.
Example: To understand the (G, S, M) approach, let us consider a data integration scenario that aims
to combine employee data from two different HR databases, database A and database B. The global
schema (G) would define the unified view of employee data, including attributes like EmployeeID,
Name, Department, and Salary.
In the schema of heterogeneous sources, database A (S1) might have attributes like EmpID,
FullName, Dept, and Pay, while database B's schema (S2) might have attributes like ID,
EmployeeName, DepartmentName, and Wage. The mappings (M) would then define how the
attributes in S1 and S2 map to the attributes in G, allowing for the integration of employee data from
both systems into the global schema.
Issues in Data Integration:
There are several issues that can arise when integrating data from multiple sources, including:
a. Data Quality: Data from different sources may have varying levels of accuracy,
completeness, and consistency, which can lead to data quality issues in the integrated data.
b. Data Semantics: Integrating data from different sources can be challenging because the same
data element may have different meanings across sources.
c. Data Heterogeneity: Different sources may use different data formats, structures, or
schemas, making it difficult to combine and analyze the data.
3. Data Reduction:
Imagine that you have selected data from the AllElectronics data warehouse for analysis. The data
set will likely be huge! Complex data analysis and mining on huge amounts of data can take a long
time, making such analysis impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on
the reduced data set should be more efficient yet produce the same (or almost the same) analytical
results.
In simple words, Data reduction is a technique used in data mining to reduce the size of a dataset
while still preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount of
irrelevant or redundant information.
a. Data Sampling: This technique involves selecting a subset of the data to work with, rather
than using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
b. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features
into a single feature.
c. Data compression: This is the process of altering, encoding, or transforming the structure
of data in order to save space. By reducing duplication and encoding data in binary form, data
compression creates a compact representation of information. And it involves the techniques
such as lossy or lossless compression to reduce the size of a dataset.
4. Data Transformation:
Data transformation in data mining refers to the process of converting raw data into a format
that is suitable for analysis and modelling. The goal of data transformation is to prepare the data
for data mining so that it can be used to extract useful insights and knowledge.
a. Smoothing: It is a process that is used to remove noise from the dataset using techniques
include binning, regression, and clustering.
b. Attribute construction (or feature construction): In this, new attributes are constructed
and added from the given set of attributes to help the mining process.
c. Aggregation: In this, summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated to compute monthly and annual total
amounts.
d. Data normalization: This process involves converting all data variables into a small
range. such as -1.0 to 1.0, or 0.0 to 1.0.
e. Generalization: It converts low-level data attributes to high-level data attributes using
concept hierarchy. For Example, Age initially in Numerical form (22, ) is converted into
categorical value (young, old).