CH 2 Introduction To Data Warehousing
CH 2 Introduction To Data Warehousing
Content:
2.1 Architecture of DW
2.2 OLAP and Data Cubes
2.3 Dimensional Data Modeling-star, snowflake schemas
2.4 Data Preprocessing – Need, Data Cleaning, Data Integration & Transformation, Data
Reduction
2.5 Machine Learning, Pattern Matching
Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP). These
include applications such as forecasting, profiling, summary reporting, and trend
analysis.
Data warehouses and their architectures very depending upon the elements of an
organization's situation.
Operational System
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file
in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata.
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
End-User access Tools
We must clean and process your operational information before put it into the
warehouse.
A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses where
all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from source
systems is copied.
We may want to customize our warehouse's architecture for multiple groups within our
organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses
that can provided information for reporting and analysis on a section, unit, department
or operation in the company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In
this example, a financial analyst wants to analyze historical data for purchases and sales
or mine historical information to make predictions about customer behavior.
Properties of Data Warehouse Architectures
The following architecture properties are necessary for a data warehouse system:
4. Security: Monitoring accesses are necessary because of the strategic data stored in
the data warehouses.
The figure shows the only layer physically available is the source layer. In this method,
data warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an
intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for
separation between analytical and transactional processing. Analysis queries are agreed
to operational data after the middleware interprets them. In this way, queries affect
transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation between
physically available sources and data warehouses, in fact, consists of four subsequent
data flow stages:
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and data
warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source data
extraction and integration from those of data warehouse population. In some cases,
the reconciled layer is also directly used to accomplish better some operational tasks,
such as producing daily reports that cannot be satisfactorily prepared using the
corporate applications or generating data flows to feed external processes periodically
to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A
disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away from
being real-time.
A bottom-tier that consists of the Data Warehouse server, which is almost always an
RDBMS. It may include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data
provided by external consultants) are extracted using application program interfaces
called a gateway. A gateway is provided by the underlying DBMS and allows customer
programs to generate SQL code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-
Linking and Embedding for Databases), by Microsoft, and JDBC (Java Database
Connection).
A middle-tier which consists of an OLAP server for fast querying of the data
warehouse.
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps
functions on multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that
directly implements multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as
well as additional tools for data mining of the OLAP-generated data.
Load Performance
Data warehouses require increase loading of new data periodically basis within narrow
time windows; performance on the load process should be measured in hundreds of
millions of rows and gigabytes per hour and must not artificially constrain the volume
of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse,
including data conversion, filtering, reformatting, indexing, and metadata update.
Fact-based management demands the highest data quality. The warehouse ensures local
consistency, global consistency, and referential integrity despite "dirty" sources and
massive database size.
Query Performance
Fact-based management must not be slowed by the performance of the data warehouse
RDBMS; large, complex queries must be complete in seconds, not days.
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few to
hundreds of gigabytes and terabyte-sized data warehouses.
When data is grouped or combined in multidimensional matrices called Data Cubes. The
data cube method has a few alternative names or a few variants, such as
"Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical
Processing)."
The general idea of this approach is to materialize certain expensive computations that
are frequently inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price)
can be materialized into a set of eight views as shown in fig, where psc indicates a view
consisting of aggregate function value (such as total-sales) computed by grouping three
attributes part, supplier, and customer, p indicates a view composed of the
corresponding aggregate function values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes are
chosen to be measure attributes, i.e., the attributes whose values are of interest.
Another attributes are selected as dimensions or functional attributes. The measure
attributes are aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales
for the dimensions time, item, branch, and location. These dimensions enable the store
to keep track of things like monthly sales of items, and the branches and locations at
which the items were sold. Each dimension may have a table identify with it, known as a
dimensional table, which describes the dimensions. For example, a dimension table for
items may contain the attributes item_name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could
be sparse in many cases because not every cell in each dimension may have
corresponding data in the database.
If a query contains constants at even lower levels than those provided in a data cube, it
is not clear how to make the best use of the precomputed results stored in the data
cube.
The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are
used for analyzing the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data
for items sold per quarter in the city of Vancouver. The measured display in dollars
sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example,
suppose we would like to view the data according to time, item as well as the location
for the cities Chicago, New York, Toronto, and Vancouver. The measured display in
dollars sold (in thousands). These 3-D data are shown in the table. The 3-D data of the
table are represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown
in fig:
Let us suppose that we would like to view our sales data with an additional fourth
dimension, such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the
lowest level of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item,
location, and supplier dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure displayed is dollars sold (in
thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as
the apex cuboid. In this example, this is the total sales, or dollars sold, summarized over
all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating
4-D data cubes for the dimension time, item, location, and supplier. Each cuboid
represents a different degree of summarization.
2.3 Dimensional Data Modeling-star, snowflake schemas:
Star Schema:
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as date,
item, or customer.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to the
dimension table. The primary key of the fact tables is generally a composite key that is
made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact table
generally contains facts with the same level of aggregation.
Dimension Tables
Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.
Star Schemas are easy for end-users and application to understand and navigate. With a
well-designed schema, the customer can instantly analyze large, multidimensional data
sets.
A star schema database has a limited number of table and clear join paths, the query run
faster than they do against OLTP systems. Small single-table queries, frequently of a
dimension table, are almost instantaneous. Large join queries that contain multiple
tables takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the central
fact table. When the two-dimension table is used in a query, only one join path,
intersecting the fact tables, exist between those two tables. This design feature enforces
authentic and consistent query results.
A star schema has referential integrity built-in when information is loaded. Referential
integrity is enforced because each data in dimensional tables has a unique primary key,
and all keys in the fact table are legitimate foreign keys drawn from the dimension table.
A record in the fact table which is not related correctly to a dimension cannot be given
the correct key value to be retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end-user because they
represent the fundamental relationship between parts of the underlying business.
Customer can also browse dimension table attributes before constructing a query.
There is some condition which cannot be meet by star schemas like the relationship
between the user, and bank account cannot describe as star schema as the relationship
between them is many to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several
dimension tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table
has columns for each branch_key, branch_name, branch_type. The LOCATION table has
columns of geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the
dimension tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for
time data, four columns for ITEM data, three columns for BRANCH data, and four
columns for LOCATION data. Thus, the size of the fact table is significantly reduced.
When we need to change an item, we need only make a single change in the dimension
table, instead of making many changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into
several tables. The normalized dimension table is called a Snowflake.
Snowflake Schema:
A snowflake schema is equivalent to the star schema. "A schema is known as a
snowflake if one or more dimension tables do not connect directly to the fact table but
must join through other dimension tables."
The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of
snowflake schema resembles a snowflake. Snowflaking is a method of normalizing the
dimension tables in a STAR schemas. When we normalize all the dimension tables
entirely, the resultant structure resembles a snowflake with the fact table in the middle.
The snowflake schema consists of one fact table which is linked to many dimension
tables, which can be linked to other dimension tables through a many-to-one
relationship. Tables in a snowflake schema are generally normalized to the third normal
form. Each dimension table performs exactly one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having
three levels. A snowflake schema can have any number of dimension, and each
dimension can have any number of levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location,
Time, Product, Line, and Family dimension tables. The Market dimension has two
dimension tables with Store as the primary dimension table, and Location as the
outrigger dimension table. The product dimension has three dimension tables with
Product as the primary dimension table, and the Line and Family table are the outrigger
dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This
needed more disk space than a more normalized snowflake schema. Snowflaking
normalizes the dimension by moving attributes with low cardinality into separate
dimension tables that relate to the core dimension table by using foreign keys.
Snowflaking for the sole purpose of minimizing disk space is not recommended, because
it can adversely impact query performance.
Figure shows a simple STAR schema for sales in a manufacturing company. The sales
fact table include quantity, price, and other relevant metrics. SALESREP, CUSTOMER,
PRODUCT, and TIME are the dimension tables.
The STAR schema for sales, as shown above, contains only five tables, whereas the
normalized version now extends to eleven tables. We will notice that in the snowflake
schema, the attributes with low cardinality in each original dimension tables are
removed to form separate tables. These new tables are connected back to the original
dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex dimensions
and relationship. It is suitable for many to many and one to many relationships between
dimension levels.
o In a star schema, the fact table will be at the center and is connected to the
dimension tables.
o The tables are completely in a denormalized structure.
o SQL queries performance is good as there is less number of joins involved.
o Data redundancy is high and occupies more disk space.
Snowflake Schema
Ease of It has redundant data and hence No redundancy and therefore more
Maintenance/change less easy to maintain/change easy to maintain and change
Less complex queries and simple More complex queries and therefore
Ease of Use
to understand less easy to understand
Less number of foreign keys and More foreign keys and thus more
Query Performance
hence lesser query execution time query execution time
Good for data marts with simple Good to use for data warehouse core
Type of Data
relationships (one to one or one to to simplify complex relationships
Warehouse
many) (many to many)
Data Warehouse Work best in any data Better for small data
system warehouse/ data mart warehouse/data mart.
2.4 Data Preprocessing – Need, Data Cleaning, Data Integration &
Transformation, Data Reduction
What Is Data Preprocessing?
Data preprocessing is a step in the data mining and data analysis process that takes raw
data and transforms it into a format that can be understood and analyzed by computers
and machine learning.
Raw, real-world data in the form of text, images, video, etc., is messy. Not only may it
contain errors and inconsistencies, but it is often incomplete, and doesn’t have a regular,
uniform design.
Machines like to process nice and tidy information – they read data as 1s and 0s. So
calculating structured data, like whole numbers and percentages is easy. However,
unstructured data, in the form of text and images must first be cleaned and formatted
before analysis.
Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw data. The
quality of the data should be checked before applying machine learning or data mining
algorithms.
Why is Data preprocessing important?
Preprocessing of data is mainly to check the data quality. The quality can be checked by
the following
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not recorded.
• Consistency: To check whether the same data is kept in all the places that do or
do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
Data integration:
The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management. There are some
problems to be considered during data integration.
• Schema integration: Integrates metadata(a set of data that describes other
data) from different sources.
• Entity identification problem: Identifying entities from multiple databases. For
example, the system or the use should know student _id of one database and
student_name of another database belongs to the same entity.
• Detecting and resolving data value concepts: The data taken from different
databases while merging may differ. Like the attribute values from one database
may differ from another database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.
Data reduction:
This process helps in the reduction of the volume of the data which makes the
analysis easier yet produces the same or almost the same result. This reduction also
helps to reduce storage space. There are some of the techniques in data reduction are
Dimensionality reduction, Numerosity reduction, Data compression.
• Dimensionality reduction: This process is necessary for real-world
applications as the data size is big. In this process, the reduction of random
variables or attributes is done so that the dimensionality of the data set can be
reduced. Combining and merging the attributes of the data without losing its
original characteristics. This also helps in the reduction of storage space and
computation time is reduced. When the data is highly dimensional the problem
called “Curse of Dimensionality” occurs.
• Numerosity Reduction: In this method, the representation of the data is made
smaller by reducing the volume. There will not be any loss of data in this
reduction.
• Data compression: The compressed form of data is called data compression.
This compression can be lossless or lossy. When there is no loss of information
during compression it is called lossless compression. Whereas lossy compression
reduces information but it removes only the unnecessary information.
Data Transformation:
The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements. There
are some methods in data transformation.
• Smoothing: With the help of algorithms, we can remove noise from the dataset
and helps in knowing the important features of the dataset. By smoothing we can
find even a simple change that helps in prediction.
• Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set which is from multiple sources is integrated into with
data analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the
quantity of the data are good the results are more relevant.
• Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we can
set an interval like (3 pm-5 pm, 6 pm-8 pm).
• Normalization: It is the method of scaling the data so that it can be represented
in a smaller range. Example ranging from -1.0 to 1.0.
History In 1930, it was known as The first program, i.e., Samuel's checker
knowledge discovery in playing program, was established in
databases(KDD). 1950.
Responsibility Data Mining is used to obtain the Machine learning teaches the computer,
rules from the existing data. how to learn and comprehend the rules.
Abstraction Data mining abstract from the Machine learning reads machine.
data warehouse.