0% found this document useful (0 votes)
76 views31 pages

CH 2 Introduction To Data Warehousing

The document provides an overview of chapter 2 which covers data warehouse architecture, online analytical processing (OLAP), dimensional data modeling, data preprocessing, and machine learning. It then describes in detail the common architectures for data warehouses including the basic architecture with no staging area, an architecture with a staging area, and an architecture with staging area and data marts. Key components of data warehouse architectures like separation of transactional and analytical processing, scalability, extensibility, and security are also discussed.

Uploaded by

srjaswar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
76 views31 pages

CH 2 Introduction To Data Warehousing

The document provides an overview of chapter 2 which covers data warehouse architecture, online analytical processing (OLAP), dimensional data modeling, data preprocessing, and machine learning. It then describes in detail the common architectures for data warehouses including the basic architecture with no staging area, an architecture with a staging area, and an architecture with staging area and data marts. Key components of data warehouse architectures like separation of transactional and analytical processing, scalability, extensibility, and security are also discussed.

Uploaded by

srjaswar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 31

Chapter 2 Introduction to Data Warehousing.

Content:
2.1 Architecture of DW
2.2 OLAP and Data Cubes
2.3 Dimensional Data Modeling-star, snowflake schemas
2.4 Data Preprocessing – Need, Data Cleaning, Data Integration & Transformation, Data
Reduction
2.5 Machine Learning, Pattern Matching

2.1 Architecture of DW:

A data warehouse architecture is a method of defining the overall architecture of data


communication processing and presentation that exist for end-clients computing within
the enterprise. Each data warehouse is different, but all are characterized by standard
vital components.

Production applications such as payroll accounts payable product purchasing and


inventory control are designed for online transaction processing (OLTP). Such
applications gather detailed data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP). These
include applications such as forecasting, profiling, summary reporting, and trend
analysis.

Production databases are updated continuously by either by hand or via OLTP


applications. In contrast, a warehouse database is updated from operational systems
periodically, usually during off-hours. As OLTP data accumulates in production
databases, it is regularly extracted, filtered, and then loaded into a dedicated warehouse
server that is accessible to users. As the warehouse is populated, it must be restructured
tables de-normalized, data cleansed of errors and redundancies and new fields and keys
added to reflect the needs to the user for sorting, combining, and summarizing data.

Data warehouses and their architectures very depending upon the elements of an
organization's situation.

Three common architectures are:

o Data Warehouse Architecture: Basic


o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic

Operational System

An operational system is a method used in data warehousing to refer to a system that


is used to process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and every file
in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
End-User access Tools

The principal purpose of a data warehouse is to provide information to the business


managers for strategic decision-making. These customers interact with the warehouse
using end-client access tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools


o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

Data Warehouse Architecture: With Staging Area

We must clean and process your operational information before put it into the
warehouse.

We can do this programmatically, although data warehouses uses a staging area (A


place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses where
all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from source
systems is copied.

Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups within our
organization.

We can do this by adding data marts. A data mart is a segment of a data warehouses
that can provided information for reporting and analysis on a section, unit, department
or operation in the company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are separated. In
this example, a financial analyst wants to analyze historical data for purchases and sales
or mine historical information to make predictions about customer behavior.
Properties of Data Warehouse Architectures

The following architecture properties are necessary for a data warehouse system:

1. Separation: Analytical and transactional processing should be keep apart as much as


possible.

2. Scalability: Hardware and software architectures should be simple to upgrade the


data volume, which has to be managed and processed, and the number of user's
requirements, which have to be met, progressively increase.

3. Extensibility: The architecture should be able to perform new operations and


technologies without redesigning the whole system.

4. Security: Monitoring accesses are necessary because of the strategic data stored in
the data warehouses.

5. Administerability: Data Warehouse management should not be complicated.

Types of Data Warehouse Architectures


Single-Tier Architecture

Single-Tier architecture is not periodically used in practice. Its purpose is to minimize


the amount of data stored to reach this goal; it removes data redundancies.

The figure shows the only layer physically available is the source layer. In this method,
data warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an
intermediate processing layer.

The vulnerability of this architecture lies in its failure to meet the requirement for
separation between analytical and transactional processing. Analysis queries are agreed
to operational data after the middleware interprets them. In this way, queries affect
transactional workloads.

Two-Tier Architecture

The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation between
physically available sources and data warehouses, in fact, consists of four subsequent
data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source of data.


That data is stored initially to corporate relational databases or legacy databases,
or it may come from an information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to
remove inconsistencies and fill gaps, and integrated to merge heterogeneous
sources into one standard schema. The so-named Extraction, Transformation,
and Loading Tools (ETL) can combine heterogeneous schemata, extract,
transform, cleanse, validate, filter, and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized
individual repository: a data warehouse. The data warehouses can be directly
accessed, but it can also be used as a source for creating data marts, which
partially replicate data warehouse contents and are designed for specific
enterprise departments. Meta-data repositories store information on sources,
access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to
issue reports, dynamically analyze information, and simulate hypothetical
business scenarios. It should feature aggregate information navigators, complex
query optimizers, and customer-friendly GUIs.

Three-Tier Architecture

The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and data
warehouse.

The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source data
extraction and integration from those of data warehouse population. In some cases,
the reconciled layer is also directly used to accomplish better some operational tasks,
such as producing daily reports that cannot be satisfactorily prepared using the
corporate applications or generating data flows to feed external processes periodically
to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A
disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away from
being real-time.

Three-Tier Data Warehouse Architecture

Data Warehouses usually have a three-level (tier) architecture that includes:

1. Bottom Tier (Data Warehouse Server)


2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).

Bottom Tier (Data Warehouse Server):

A bottom-tier that consists of the Data Warehouse server, which is almost always an
RDBMS. It may include several specialized data marts and a metadata repository.

Data from operational databases and external sources (such as user profile data
provided by external consultants) are extracted using application program interfaces
called a gateway. A gateway is provided by the underlying DBMS and allows customer
programs to generate SQL code to be executed at a server.

Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-
Linking and Embedding for Databases), by Microsoft, and JDBC (Java Database
Connection).

Middle Tier (OLAP Server):

A middle-tier which consists of an OLAP server for fast querying of the data
warehouse.

The OLAP server is implemented using either

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps
functions on multidimensional data to standard relational operations.

(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that
directly implements multidimensional information and operations.

Top Tier (Front end Tools).

A top-tier that contains front-end tools for displaying results provided by OLAP, as
well as additional tools for data mining of the OLAP-generated data.

The overall Data Warehouse Architecture is shown in fig:


The metadata repository stores information that defines DW objects. It includes the
following parameters and information for the middle and the top-tier applications:

1. A description of the DW structure, including the warehouse schema, dimension,


hierarchies, data mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored
data, i.e., active, archived or purged, and warehouse monitoring information, i.e.,
usage statistics, error reports, audit, etc.
3. System performance data, which includes indices, used to improve data access
and retrieval performance.
4. Information about the mapping from operational databases, which provides
source RDBMSs and their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which
include business terms and definitions, ownership information, etc.

Principles of Data Warehousing

Load Performance

Data warehouses require increase loading of new data periodically basis within narrow
time windows; performance on the load process should be measured in hundreds of
millions of rows and gigabytes per hour and must not artificially constrain the volume
of data business.
Load Processing

Many phases must be taken to load new or update data into the data warehouse,
including data conversion, filtering, reformatting, indexing, and metadata update.

Data Quality Management

Fact-based management demands the highest data quality. The warehouse ensures local
consistency, global consistency, and referential integrity despite "dirty" sources and
massive database size.

Query Performance

Fact-based management must not be slowed by the performance of the data warehouse
RDBMS; large, complex queries must be complete in seconds, not days.

Terabyte Scalability

Data warehouse sizes are growing at astonishing rates. Today these size from a few to
hundreds of gigabytes and terabyte-sized data warehouses.

2.2 OLAP and Data Cubes:

What is Data Cube?

When data is grouped or combined in multidimensional matrices called Data Cubes. The
data cube method has a few alternative names or a few variants, such as
"Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical
Processing)."

The general idea of this approach is to materialize certain expensive computations that
are frequently inquired.

For example, a relation with the schema sales (part, supplier, customer, and sale-price)
can be materialized into a set of eight views as shown in fig, where psc indicates a view
consisting of aggregate function value (such as total-sales) computed by grouping three
attributes part, supplier, and customer, p indicates a view composed of the
corresponding aggregate function values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes are
chosen to be measure attributes, i.e., the attributes whose values are of interest.
Another attributes are selected as dimensions or functional attributes. The measure
attributes are aggregated according to the dimensions.

For example, XYZ may create a sales data warehouse to keep records of the store's sales
for the dimensions time, item, branch, and location. These dimensions enable the store
to keep track of things like monthly sales of items, and the branches and locations at
which the items were sold. Each dimension may have a table identify with it, known as a
dimensional table, which describes the dimensions. For example, a dimension table for
items may contain the attributes item_name, brand, and type.

Data cube method is an interesting technique with many applications. Data cubes could
be sparse in many cases because not every cell in each dimension may have
corresponding data in the database.

Techniques should be developed to handle sparse cubes efficiently.

If a query contains constants at even lower levels than those provided in a data cube, it
is not clear how to make the best use of the precomputed results stored in the data
cube.

The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.

A data cube enables data to be modeled and viewed in multiple dimensions. A


multidimensional data model is organized around a central theme, like sales and
transactions. A fact table represents this theme. Facts are numerical measures. Thus, the
fact table contains measure (such as Rs_sold) and keys to each of the related
dimensional tables.

Dimensions are a fact that defines a data cube. Facts are generally quantities, which are
used for analyzing the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data
for items sold per quarter in the city of Vancouver. The measured display in dollars
sold (in thousands).

3-Dimensional Cuboids

Let suppose we would like to view the sales data with a third dimension. For example,
suppose we would like to view the data according to time, item as well as the location
for the cities Chicago, New York, Toronto, and Vancouver. The measured display in
dollars sold (in thousands). These 3-D data are shown in the table. The 3-D data of the
table are represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown
in fig:

Let us suppose that we would like to view our sales data with an additional fourth
dimension, such as a supplier.

In data warehousing, the data cubes are n-dimensional. The cuboid which holds the
lowest level of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item,
location, and supplier dimensions.

Figure is shown a 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure displayed is dollars sold (in
thousands).

The topmost 0-D cuboid, which holds the highest level of summarization, is known as
the apex cuboid. In this example, this is the total sales, or dollars sold, summarized over
all four dimensions.

The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating
4-D data cubes for the dimension time, item, location, and supplier. Each cuboid
represents a different degree of summarization.
2.3 Dimensional Data Modeling-star, snowflake schemas:
Star Schema:

A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as date,
item, or customer.

A star schema is a relational schema where a relational schema whose design


represents a multidimensional data model. The star schema is the explicit data
warehouse schema. It is known as star schema because the entity-relationship diagram
of this schemas simulates a star, with points, diverge from a central table. The center of
the schema consists of a large fact table, and the points of the star are the dimension
tables.

Fact Tables

A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to the
dimension table. The primary key of the fact tables is generally a composite key that is
made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact table
generally contains facts with the same level of aggregation.
Dimension Tables

A dimension is an architecture usually composed of one or more hierarchies that


categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list. The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional tables
are usually small in size than fact table.

Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.

Characteristics of Star Schema


The star schema is intensely suitable for data warehouse database design because of the
following features:

o It creates a DE-normalized database that can quickly provide query responses.


o It provides a flexible design that can be changed easily or added to throughout
the development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the
data.
o It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema

Star Schemas are easy for end-users and application to understand and navigate. With a
well-designed schema, the customer can instantly analyze large, multidimensional data
sets.

The main advantage of star schemas in a decision-support environment are:


Query Performance

A star schema database has a limited number of table and clear join paths, the query run
faster than they do against OLTP systems. Small single-table queries, frequently of a
dimension table, are almost instantaneous. Large join queries that contain multiple
tables takes only seconds or minutes to run.

In a star schema database design, the dimension is connected only through the central
fact table. When the two-dimension table is used in a query, only one join path,
intersecting the fact tables, exist between those two tables. This design feature enforces
authentic and consistent query results.

Load performance and administration


Structural simplicity also decreases the time required to load large batches of record
into a star schema database. By describing facts and dimensions and separating them
into the various table, the impact of a load structure is reduced. Dimension table can be
populated once and occasionally refreshed. We can add new facts regularly and
selectively by appending records to a fact table.

Built-in referential integrity

A star schema has referential integrity built-in when information is loaded. Referential
integrity is enforced because each data in dimensional tables has a unique primary key,
and all keys in the fact table are legitimate foreign keys drawn from the dimension table.
A record in the fact table which is not related correctly to a dimension cannot be given
the correct key value to be retrieved.

Easily Understood

A star schema is simple to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end-user because they
represent the fundamental relationship between parts of the underlying business.
Customer can also browse dimension table attributes before constructing a query.

Disadvantage of Star Schema

There is some condition which cannot be meet by star schemas like the relationship
between the user, and bank account cannot describe as star schema as the relationship
between them is many to many.

Example: Suppose a star schema is composed of a fact table, SALES, and several
dimension tables connected to it for time, branch, item, and geographic locations.

The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table
has columns for each branch_key, branch_name, branch_type. The LOCATION table has
columns of geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the
dimension tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for
time data, four columns for ITEM data, three columns for BRANCH data, and four
columns for LOCATION data. Thus, the size of the fact table is significantly reduced.
When we need to change an item, we need only make a single change in the dimension
table, instead of making many changes in the fact table.

We can create even more complex star schemas by normalizing a dimension table into
several tables. The normalized dimension table is called a Snowflake.

Snowflake Schema:
A snowflake schema is equivalent to the star schema. "A schema is known as a
snowflake if one or more dimension tables do not connect directly to the fact table but
must join through other dimension tables."

The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of
snowflake schema resembles a snowflake. Snowflaking is a method of normalizing the
dimension tables in a STAR schemas. When we normalize all the dimension tables
entirely, the resultant structure resembles a snowflake with the fact table in the middle.

Snowflaking is used to develop the performance of specific queries. The schema is


diagramed with each fact surrounded by its associated dimensions, and those
dimensions are related to other dimensions, branching out into a snowflake pattern.

The snowflake schema consists of one fact table which is linked to many dimension
tables, which can be linked to other dimension tables through a many-to-one
relationship. Tables in a snowflake schema are generally normalized to the third normal
form. Each dimension table performs exactly one level in a hierarchy.

The following diagram shows a snowflake schema with two dimensions, each having
three levels. A snowflake schema can have any number of dimension, and each
dimension can have any number of levels.

Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location,
Time, Product, Line, and Family dimension tables. The Market dimension has two
dimension tables with Store as the primary dimension table, and Location as the
outrigger dimension table. The product dimension has three dimension tables with
Product as the primary dimension table, and the Line and Family table are the outrigger
dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This
needed more disk space than a more normalized snowflake schema. Snowflaking
normalizes the dimension by moving attributes with low cardinality into separate
dimension tables that relate to the core dimension table by using foreign keys.
Snowflaking for the sole purpose of minimizing disk space is not recommended, because
it can adversely impact query performance.

In snowflake, schema tables are normalized to delete redundancy. In snowflake


dimension tables are damaged into multiple dimension tables.

Figure shows a simple STAR schema for sales in a manufacturing company. The sales
fact table include quantity, price, and other relevant metrics. SALESREP, CUSTOMER,
PRODUCT, and TIME are the dimension tables.

The STAR schema for sales, as shown above, contains only five tables, whereas the
normalized version now extends to eleven tables. We will notice that in the snowflake
schema, the attributes with low cardinality in each original dimension tables are
removed to form separate tables. These new tables are connected back to the original
dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex dimensions
and relationship. It is suitable for many to many and one to many relationships between
dimension levels.

Advantage of Snowflake Schema

1. The primary advantage of the snowflake schema is the development in query


performance due to minimized disk storage requirements and joining smaller
lookup tables.
2. It provides greater scalability in the interrelationship between dimension levels
and components.
3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional


maintenance efforts required due to the increasing number of lookup tables. It is
also known as a multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
Difference between Star and Snowflake Schemas
Star Schema

o In a star schema, the fact table will be at the center and is connected to the
dimension tables.
o The tables are completely in a denormalized structure.
o SQL queries performance is good as there is less number of joins involved.
o Data redundancy is high and occupies more disk space.

Snowflake Schema

o A snowflake schema is an extension of star schema where the dimension tables


are connected to one or more dimensions.
o The tables are partially denormalized in structure.
o The performance of SQL queries is a bit less when compared to star schema as
more number of joins are involved.
o Data redundancy is low and occupies less disk space when compared to star
schema.
Let's see the differentiate between Star and Snowflake Schema.
Basis for
Star Schema Snowflake Schema
Comparison

Ease of It has redundant data and hence No redundancy and therefore more
Maintenance/change less easy to maintain/change easy to maintain and change

Less complex queries and simple More complex queries and therefore
Ease of Use
to understand less easy to understand

In a star schema, a dimension In a snowflake schema, a dimension


Parent table table will not have any parent table will have one or more parent
table tables

Less number of foreign keys and More foreign keys and thus more
Query Performance
hence lesser query execution time query execution time

Normalization It has De-normalized tables It has normalized tables

Good for data marts with simple Good to use for data warehouse core
Type of Data
relationships (one to one or one to to simplify complex relationships
Warehouse
many) (many to many)

Joins Fewer joins Higher number of joins

It contains only a single


It may have more than one
Dimension Table dimension table for each
dimension table for each dimension
dimension

Hierarchies are broken into separate


tables in a snowflake schema. These
Hierarchies for the dimension are
hierarchies help to drill down the
Hierarchies stored in the dimensional table
information from topmost
itself in a star schema
hierarchies to the lowermost
hierarchies.

When dimensional table store a


When the dimensional table huge number of rows with
When to use contains less number of rows, we redundancy information and space
can go for Star schema. is such an issue, we can choose
snowflake schema to store space.

Data Warehouse Work best in any data Better for small data
system warehouse/ data mart warehouse/data mart.
2.4 Data Preprocessing – Need, Data Cleaning, Data Integration &
Transformation, Data Reduction
What Is Data Preprocessing?
Data preprocessing is a step in the data mining and data analysis process that takes raw
data and transforms it into a format that can be understood and analyzed by computers
and machine learning.
Raw, real-world data in the form of text, images, video, etc., is messy. Not only may it
contain errors and inconsistencies, but it is often incomplete, and doesn’t have a regular,
uniform design.
Machines like to process nice and tidy information – they read data as 1s and 0s. So
calculating structured data, like whole numbers and percentages is easy. However,
unstructured data, in the form of text and images must first be cleaned and formatted
before analysis.
Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw data. The
quality of the data should be checked before applying machine learning or data mining
algorithms.
Why is Data preprocessing important?
Preprocessing of data is mainly to check the data quality. The quality can be checked by
the following
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not recorded.
• Consistency: To check whether the same data is kept in all the places that do or
do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.

Major Tasks in Data Preprocessing:


1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
Data cleaning:
Data cleaning is the process to remove incorrect data, incomplete data and inaccurate
data from the datasets, and it also replaces the missing values. There are some
techniques in data cleaning
Handling missing values:
• Standard values like “Not Available” or “NA” can be used to replace the missing
values.
• Missing values can also be filled manually but it is not recommended when that
dataset is big.
• The attribute’s mean value can be used to replace the missing value when the
data is normally distributed wherein in the case of non-normal distribution
median value of the attribute can be used.
• While using regression or decision tree algorithms the missing value can be
replaced by the most probable value.
Noisy:

Noisy generally means random error or containing unnecessary data points.

Here are some of the methods to handle noisy data.


• Binning: This method is to smooth or handle noisy data. First, the data is sorted
then and then the sorted values are separated and stored in the form of bins.
There are three methods for smoothing data in the bin. Smoothing by bin mean
method: In this method, the values in the bin are replaced by the mean value of
the bin; Smoothing by bin median: In this method, the values in the bin are
replaced by the median value; Smoothing by bin boundary: In this method, the
using minimum and maximum values of the bin values are taken and the values
are replaced by the closest boundary value.
• Regression: This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide
the variable which is suitable for our analysis.
• Clustering: This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.

Data integration:
The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management. There are some
problems to be considered during data integration.
• Schema integration: Integrates metadata(a set of data that describes other
data) from different sources.
• Entity identification problem: Identifying entities from multiple databases. For
example, the system or the use should know student _id of one database and
student_name of another database belongs to the same entity.
• Detecting and resolving data value concepts: The data taken from different
databases while merging may differ. Like the attribute values from one database
may differ from another database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.
Data reduction:
This process helps in the reduction of the volume of the data which makes the
analysis easier yet produces the same or almost the same result. This reduction also
helps to reduce storage space. There are some of the techniques in data reduction are
Dimensionality reduction, Numerosity reduction, Data compression.
• Dimensionality reduction: This process is necessary for real-world
applications as the data size is big. In this process, the reduction of random
variables or attributes is done so that the dimensionality of the data set can be
reduced. Combining and merging the attributes of the data without losing its
original characteristics. This also helps in the reduction of storage space and
computation time is reduced. When the data is highly dimensional the problem
called “Curse of Dimensionality” occurs.
• Numerosity Reduction: In this method, the representation of the data is made
smaller by reducing the volume. There will not be any loss of data in this
reduction.
• Data compression: The compressed form of data is called data compression.
This compression can be lossless or lossy. When there is no loss of information
during compression it is called lossless compression. Whereas lossy compression
reduces information but it removes only the unnecessary information.
Data Transformation:
The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements. There
are some methods in data transformation.
• Smoothing: With the help of algorithms, we can remove noise from the dataset
and helps in knowing the important features of the dataset. By smoothing we can
find even a simple change that helps in prediction.
• Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set which is from multiple sources is integrated into with
data analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the
quantity of the data are good the results are more relevant.
• Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we can
set an interval like (3 pm-5 pm, 6 pm-8 pm).
• Normalization: It is the method of scaling the data so that it can be represented
in a smaller range. Example ranging from -1.0 to 1.0.

2.5 Machine Learning, Pattern Matching:


What is Machine learning?
Machine learning is related to the development and designing of a machine that can
learn itself from a specified set of data to obtain a desirable result without it being
explicitly coded. Hence Machine learning implies 'a machine which learns on its
own. Arthur Samuel invented the term Machine learning an American pioneer in the
area of computer gaming and artificial intelligence in 1959. He said that "it gives
computers the ability to learn without being explicitly programmed."
Machine learning is a technique that creates complex algorithms for large data
processing and provides outcomes to its users. It utilizes complex programs that can
learn through experience and make predictions.
The algorithms are enhanced by themselves by frequent input of training data. The aim
of machine learning is to understand information and build models from data that can
be understood and used by humans.
Machine learning algorithms are divided into two types:
o Unsupervised Learning
o Supervised Learning
1. Unsupervised Machine Learning:
Unsupervised learning does not depend on trained data sets to predict the results, but it
utilizes direct techniques such as clustering and association in order to predict the
results. Trained data sets are defined as the input for which the output is known.
2. Supervised Machine Learning:
As the name implies, supervised learning refers to the presence of a supervisor as a
teacher. Supervised learning is a learning process in which we teach or train the
machine using data which is well leveled implies that some data is already marked with
the correct responses. After that, the machine is provided with the new sets of data so
that the supervised learning algorithm analyzes the training data and gives an accurate
result from labeled data.

Difference between Data mining and Machine learning

Factors Data Mining Machine Learning

Traditional databases with


Origin It has an existing algorithm and data.
unstructured data.

Meaning Extracting information from a Introduce new Information from data as


huge amount of data. well as previous experience.

History In 1930, it was known as The first program, i.e., Samuel's checker
knowledge discovery in playing program, was established in
databases(KDD). 1950.

Responsibility Data Mining is used to obtain the Machine learning teaches the computer,
rules from the existing data. how to learn and comprehend the rules.

Abstraction Data mining abstract from the Machine learning reads machine.
data warehouse.

Applications In compare to machine learning, It needs a large amount of data to obtain


data mining can produce accurate results. It has various
outcomes on the lesser volume of applications, used in web search, spam
data. It is also used in cluster filter, credit scoring, computer design,
analysis. etc.

Nature It involves human interference It is automated, once designed and


more towards the manual. implemented, there is no need for human
effort.
Techniques Data mining is more of research It is a self-learned and train system to do
involve using a technique like a machine the task precisely.
learning.

Scope Applied in the limited fields. It can be used in a vast area.

You might also like