4.1 - Data Preprocessing

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

Data Preprocessing

CC19 – Data Mining


Agenda
• Definition of Data Preprocessing
• Types of Data Preprocessing
• Data Cleaning
• Data Integration
• Data Transformation
• Data Normalization
• Data Reduction
• Steps of Data Preprocessing
Defining Data Preprocessing
• Data preprocessing is a key step in data mining that involves
modifying data to prepare it for analysis.
• This allows the data to better fit different data mining analysis
techniques and tools.
• Different techniques can be utilized depending on the type of data
being analyzed.
Defining Data Preprocessing
• Preparing data in important to ensure that large datasets can be
processed more easily.
• While more data is available for analysis compared to before, a lot of
that data is “dirty”.
• The data collected via data collection techniques can also be
inconsistent in format and quality.
Defining Data Preprocessing
• Many techniques for data mining rely on data which is complete
or noise free.
• Unfortunately for us, real-world data is rarely clean or complete.
• These are other reasons why we need to preprocess data to make it
usable for data mining tools.
Types of Data Preprocessing
• Listed below are common techniques for data preprocessing:
• Data Cleaning
• Data Integration
• Data Transformation
• Data Normalization
• Data Reduction
Types of Data Preprocessing – Data Cleaning

• Data cleaning involves correcting bad data, filtering incorrect data, or


reduce unnecessary data details.
• It is a general technique that is commonly used with other techniques.
• Treatment of missing and noise data is also included here.
Types of Data Preprocessing – Data Cleaning

• Data cleaning involves


identifying and correcting errors
and inconsistencies in the data.
• These errors can involve missing
values, outliers, and duplicates.
Types of Data Preprocessing – Data Integration

• Data integration involves merging data from multiple data sources.


• This should include steps to reduce redundancies and inconsistencies
in your data set.
• Techniques involved here include identification and unification of
variables and domains.
Types of Data Preprocessing – Data Integration

• Data integration can be


challenging as it requires
combining data from different
sources with different formats,
structures, and semantics.
• Techniques used here can
include record linkage and data
fusion.
Types of Data Preprocessing – Data Transformation

• Data transformation involves converting data so that the mining


process result could be more efficient.
• These are typically composed of different tasks that are dependent on
the type of data being transformed.
• Some data transformation techniques might not work if the data used
is incompatible.
Types of Data Preprocessing – Data Transformation

• Data transformation techniques


includes smoothing, feature
construction, aggregation, or
summarization.
Types of Data Preprocessing – Data Normalization

• Data normalization involves scaling data to a common range.


• Normalizing the data attempts to give all attributes equal weight to
make them easier to analyze.
• This is done because the measurement units used for data mining can
affect the data analysis.
Types of Data Preprocessing – Data Normalization

• All attributes in the data mining


process should be expressed in
the same measurement units and
should use a common scale or
range.
Types of Data Preprocessing – Data Reduction

• Data reduction comprises techniques which obtain a reduced


representation of the original data.
• Data being processed maintains the essential structure and integrity of
the original data but is downsized.
• This is done because many data mining algorithms become very slow
the more data they process.
Types of Data Preprocessing – Data Reduction

• There are three common types of data reduction methods:


• Feature selection
• Instance selection
• Discretization
Types of Data Preprocessing – Data Reduction

Feature Selection
• This achieves the reduction of
data by removing irrelevant or
redundant features.
• This aims to find a minimum set
of attributes.
Types of Data Preprocessing – Data Reduction

Instance Selection
• This looks at choosing a subset
of the total available data to
achieve the original purpose of
data mining.
• It works in a similar manner to
statistical sampling methods.
Types of Data Preprocessing – Data Reduction

Discretization
• This transformed quantitative
(numerical) data into qualitative
(nominal) data.
• An association between each
interval with a numerical discrete
value is then established.
Types of Data Preprocessing
• To summarize how these data preprocessing tools work:
• Data Cleaning – How do I clean up the data?
• Data Integration – How do I incorporate and adjust data?
• Data Transformation – How do I provide accurate data?
• Data Normalization – How do I unify and scale data?
• Data Reduction – How do I select the best features of my data?
Steps in Data Preprocessing
• These are the general steps to consider when doing data preprocessing:
• Assess your Data Quality
• Clean your Data
• Transform your Data
• Reduce your Data
• Further Process your Data
Steps in Data Preprocessing
Assess your Data Quality
• Start by looking at your data to get an idea of its overall quality.
• This is where you look at your data collection results and determine
what issues your data may have.
• Once you have identified issues, you then need to determine which
data preprocessing techniques to use.
Steps in Data Preprocessing
Assess your Data Quality
• These are common issues you might need to look at in your data:
• Mismatched Data Types
• Mixed Data Values
• Outliers
• Missing Data
Steps in Data Preprocessing
Clean your Data
• Generally, you always want to clean your data as your first
preprocessing method.
• This is because it removes useless, unrelated, corrupted, or incorrect
data which can interfere with other steps.
• This can be done manually by deleting files or automated with code or
tools.
Steps in Data Preprocessing
Transform your Data
• This is where your data is transformed into a format suitable for
your data analysis tools.
• How you transform your data will depend on what tool you are using
and what analysis you will perform.
• This involves steps such as normalization to further enhance the data.
Steps in Data Preprocessing
Reduce your Data
• You will then want to reduce the size of your overall dataset as
needed to make analysis easier.
• This may not be needed for small datasets but becomes important for
larger datasets.
• This ensures that your data analysis process will not be slow or
impossible.
Steps in Data Preprocessing
Further Process your Data
• You will need to determine if your current data preprocessing
steps are sufficient.
• This is typically done after data analysis to check if the data
preprocessing enhanced the results.
• You can add or remove preprocessing methods if you find that they are
not effective for your dataset.
References
• Data Preprocessing in Data Mining – GeeksforGeeks
• Data Preprocessing in Data Mining.pdf (dstu.dp.ua)
• What Is Data Preprocessing & What Are The Steps Involved? (monke
ylearn.com)
• Data Preprocessing: Definition, Key Steps and Concepts (techtarget.co
m)
• A survey on data preprocessing for data stream mining: Current status
and future directions (ugr.es)

You might also like