Introduction To Data Warehouse
Introduction To Data Warehouse
A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful
concerning the subject and including all data needed by the users to understand the
subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records. It requires performing data cleaning and integration
during data warehousing to ensure consistency in naming conventions, attributes
types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
These variations with a transactions system, where often only the most current file is
kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed.
Data within operational systems are Non-volatile, new data may be added
mainly updated regularly according to regularly. Once Added rarely changed.
need.
Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented
Relational databases are created for on- Data Warehouse designed for on-line
line transactional Processing (OLTP) Analytical Processing (OLAP)
Data Warehouse Architecture
Data warehouses and their architectures very depending upon the elements of an
organization's situation.
Operational System
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every
file in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
We must clean and process your operational information before put it into the
warehouse.
We can do this programmatically, although data warehouses uses a staging area (A place
where data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses where
all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from source
systems is copied.
We can do this by adding data marts. A data mart is a segment of a data warehouses
that can provided information for reporting and analysis on a section, unit, department
or operation in the company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In
this example, a financial analyst wants to analyze historical data for purchases and sales
or mine historical information to make predictions about customer behavior.
• Middle Tier - In the middle tier, we have the OLAP (Online Analytical Processing) Server
that can be implemented in either of the following ways.
o By Relational OLAP (ROLAP), this is an extended relational database
management system. The ROLAP maps the operations on multidimensional data
to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, this directly implements the
multidimensional data and operations. They map multidimensional storage
engines. The map multi dimensional vies directly to data cube array structures.
Data cube allows fast indexing to precompiled summarized data. Many MOLAP
servers adopt a two level storage representation to handle dense and sparse
data sets.
o By Hybrid OLAP (HOLAP) model, it combines both ROLAP and MOLAP. A HOLAP
server may allow large volumes of detail data to be stored in relational database.
Microsoft SQL server supports a hybrid HOLAP server.
• Top-Tier - This tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse:
Source data coming into the data warehouses may be grouped into four broad
categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose
segments of the data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases. This is the
internal data, part of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business.
In every operational system, we periodically take the old data and store it in achieved
files.
External Data: Most executives depend on information from external sources for a
large percentage of the information they use. They use statistics associating to their
industry produced by the external department.
After we have been extracted data from various operational systems and external
sources, we have to prepare the files for storing in the data warehouse. The extracted
data coming from several different sources need to be changed, converted, and made
ready in a format that is relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many
different sources. If data extraction for a data warehouse posture big challenges, data
transformation present even significant challenges. We perform several individual
tasks as part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data
transformation contains many forms of combining pieces of data from different
sources. We combine data from single source record or related data parts from many
source records.
On the other hand, data transformation also contains purging source data that is not
useful and separating outsource records into new combinations. Sorting and merging
of data take place on a large scale in the data staging area. When the data
transformation function ends, we have a collection of integrated data that is cleaned,
standardized, and summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When
we complete the structure and construction of the data warehouse and go live for the
first time, we do the initial loading of the information into the data warehouse storage.
The initial load moves high volumes of data using up a substantial amount of time.
Data storage for the data warehousing is a split repository. The data repositories for
the operational systems generally include only the current data. Also, these data
repositories include the data structured in highly normalized for fast and efficient
processing.
The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to
some customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system. In the data dictionary, we keep the data about the
logical data structures, the data about the records and addresses, the information
about the indexes, and so on.
Data Marts
The management and control elements coordinate the services and functions within
the data warehouse.
These components control the data transformation and the data transfer into the
data warehouse storage.
Its work with the database management systems and authorizes data to be correctly
saved in the repositories.
It monitors the movement of information into the staging method and from there
into the data warehouses storage itself.
Data Warehouse queries are complex because they involve the computation of large
groups of data at summarized levels.
It may require the use of distinctive data organization, access, and implementation
method based on multidimensional views.
Because the two systems provide different functionalities and require different kinds
of data, it is necessary to maintain separate databases.
What is ETL?
The mechanism of extracting information from source systems and bringing it into the
data warehouse is commonly called ETL, which stands for Extraction,
Transformation and Loading.
The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs
to change with business changes. ETL is a recurring method (daily, weekly, monthly) of
a Data warehouse system and needs to be agile, automated, and well documented.
Cleansing
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and
up-to-date list of contact addresses, email addresses and telephone numbers must be
available.
If a client or supplier calls, the staff responding should be quickly able to find the
person in the enterprise database, but this need that the caller's name or his/her
company name is listed in the database.
If a user appears in the databases with two or more slightly different names or different
account numbers, it becomes difficult to update the customer's information.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.
Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.
1. Refresh: Data Warehouse data is completely rewritten. This means that older
file is replaced. Refresh is usually used in combination with static extraction to
populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the
Data Warehouse. An update is typically carried out without deleting or
modifying preexisting data. This method is used in combination with
incremental extraction to update data warehouses regularly.
Data warehouse modeling is the process of designing the schemas of the detailed and
summarized information of the data warehouse. The goal of data warehouse modeling
is to develop a schema describing the reality, or at least a part of the fact, which the
data warehouse is needed to support.
Types of Data Warehouse Models
Enterprise Warehouse
An Enterprise warehouse collects all of the records about subjects spanning the entire
organization. It supports corporate-wide data integration, usually from one or more
operational systems or external data providers, and it's cross-functional in scope. It
generally contains detailed information as well as summarized information and can
range in estimate from a few gigabyte to hundreds of gigabytes, terabytes, or beyond.
Data Mart
Independent Data Mart: Independent data mart is sourced from data captured from
one or more operational systems or external data providers, or data generally locally
within a different department or geographic area.
Dependent Data Mart: Dependent data marts are sourced exactly from enterprise
data-warehouses.
Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational database. For
effective query processing, only some of the possible summary vision may be
materialized. A virtual warehouse is simple to build but required excess capacity on
operational database servers.
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as
date, item, or customer.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to
the dimension table. The primary key of the fact tables is generally a composite key
that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated
(fact tables that include aggregated fact are often instead called summary tables). A
fact table generally contains facts with the same level of aggregation.
Dimension Tables
Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.
The star schema is intensely suitable for data warehouse database design because of
the following features:
Query Performance
A star schema database has a limited number of table and clear join paths, the query
run faster than they do against OLTP systems. Small single-table queries, frequently of
a dimension table, are almost instantaneous. Large join queries that contain multiple
tables takes only seconds or minutes to run.
Structural simplicity also decreases the time required to load large batches of record
into a star schema database. By describing facts and dimensions and separating them
into the various table, the impact of a load structure is reduced. Dimension table can
be populated once and occasionally refreshed. We can add new facts regularly and
selectively by appending records to a fact table.
Built-in referential integrity
A star schema has referential integrity built-in when information is loaded. Referential
integrity is enforced because each data in dimensional tables has a unique primary
key, and all keys in the fact table are legitimate foreign keys drawn from the dimension
table. A record in the fact table which is not related correctly to a dimension cannot
be given the correct key value to be retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end-user because they
represent the fundamental relationship between parts of the underlying business.
Customer can also browse dimension table attributes before constructing a query.
There is some condition which cannot be meet by star schemas like the relationship
between the user, and bank account cannot describe as star schema as the relationship
between them is many to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several
dimension tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table
has columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH
table has columns for each branch_key, branch_name, branch_type. The LOCATION
table has columns of geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the
dimension tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for
time data, four columns for ITEM data, three columns for BRANCH data, and four
columns for LOCATION data. Thus, the size of the fact table is significantly reduced.
When we need to change an item, we need only make a single change in the dimension
table, instead of making many changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into
several tables. The normalized dimension table is called a Snowflake.
The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of
snowflake schema resembles a snowflake. Snowflaking is a method of normalizing
the dimension tables in a STAR schemas. When we normalize all the dimension tables
entirely, the resultant structure resembles a snowflake with the fact table in the middle.
The following diagram shows a snowflake schema with two dimensions, each having
three levels. A snowflake schemas can have any number of dimension, and each
dimension can have any number of levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store,
Location, Time, Product, Line, and Family dimension tables. The Market dimension has
two dimension tables with Store as the primary dimension table, and Location as the
outrigger dimension table. The product dimension has three dimension tables with
Product as the primary dimension table, and the Line and Family table are the outrigger
dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This
needed more disk space than a more normalized snowflake schema. Snowflaking
normalizes the dimension by moving attributes with low cardinality into separate
dimension tables that relate to the core dimension table by using foreign keys.
Snowflaking for the sole purpose of minimizing disk space is not recommended,
because it can adversely impact query performance.
A Fact constellation means two or more fact tables sharing one or more dimensions.
It is also called Galaxy schema.
Fact table
It contains the facts of a particular business process, such as sales revenue by month.
Fact table stored quantitative information of the analysis that is not arranged.
1. Additive
2. Semi-additive
3. Non-additive
1. Additive
The numeric value in a fact table that is more flexible is an additive measure. For
each dimension you can even sum up. If you want to know the total sales of your
company you can easily sum up all the sales.
Additive facts are facts that can be summed up through all of the dimensions
in the fact table. The addition will be performed along with diff dimensions.
2. Semi-additive
With these measures, you have to pay attention. Semi-additive facts are facts
that can be summed up for some of the dimensions in the fact table, but not the
others.
These tables are created to reduce the actual fact table size.
Two main categories in factless fact table-
1. Event
2. Coverage
Event
Coverage
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Production
o Production planning
o Defect analysis
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up.
Drill-down is like zooming-in on the data cube. It navigates from less detailed record to
more detailed data. Drill-down can be performed by either stepping down a concept
hierarchy for a dimension or adding additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a
concept hierarchy which is defined as day, month, quarter, and year. Drill-down appears by
descending the time hierarchy from the level of the quarter to a more detailed level of the
month.
Roll-Up
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of
the dimension. For example, a slice operation is executed when the customer wants a
selection on one dimension of a three-dimensional cube resulting in a two-dimensional site.
So, the Slice operations perform a selection on one dimension of the given cube, thus
resulting in a subcube.
Dice
The dice operation describes a subcube or sub-set5by operating a selection on two or more
dimension.
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates
the data axes in view to provide an alternative presentation of the data. It may contain
swapping the rows and columns or moving one of the row-dimensions into the column
dimensions.
1. ROLAP
2. MOLAP
3. HOLAP
ROLAP(Relational OLAP)
It works directly with a relational database.
These are intermediate server which stands in between a relational back-end server
and user front-end tool.
They use relational or extended relational DBMS to save and handle warehouse data.
Advantages-
Disadvantages
Advantages
It is easy to use.
Disadvantages
ROLAP stands for MOLAP stands for HOLAP stands for Hybrid
Relational Online Multidimensional Online Online Analytical
Analytical Processing. Analytical Processing. Processing.
The ROLAP storage The MOLAP storage mode The HOLAP storage mode
mode causes the principle the aggregations connects attributes of
aggregation of the of the division and a copy of both MOLAP and ROLAP.
division to be stored in its source information to be Like MOLAP, HOLAP
indexed views in the saved in a multidimensional causes the aggregation of
relational database that operation in analysis the division to be stored in
was specified in the services when the a multidimensional
partition's data source. separation is processed. operation in an SQL Server
analysis services instance.
ROLAP does not This MOLAP operation is HOLAP does not causes a
because a copy of the highly optimize to copy of the source
source information to be maximize query information to be stored.
stored in the Analysis performance. The storage For queries that access the
services data folders. area can be on the only summary record in
Instead, when the computer where the the aggregations of a
outcome cannot be partition is described or on division, HOLAP is the
derived from the query another computer running equivalent of MOLAP.
cache, the indexed views Analysis services. Because a
in the record source are copy of the source
accessed to answer information resides in the
queries. multidimensional
operation, queries can be
resolved without accessing
the partition's source
record.
Query response is Query response times can Queries that access source
frequently slower with be reduced substantially by record for example, if we
ROLAP storage than using aggregations. The want to drill down to an
with the MOLAP or record in the partition's atomic cube cell for which
HOLAP storage mode. MOLAP operation is only as there is no aggregation
Processing time is also current as of the most information must retrieve
frequently slower with recent processing of the data from the relational
ROLAP. separation. database and will not be
as fast as they would be if
the source information
were stored in the MOLAP
architecture.