Introduction To Data Warehouse

What is a Data Warehouse?
A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of

information in support of management's decisions."
Characteristics of Data Warehouse
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful
concerning the subject and including all data needed by the users to understand the
subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records. It requires performing data cleaning and integration
during data warehousing to ensure consistency in naming conventions, attributes
types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
These variations with a transactions system, where often only the most current file is
kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed.
Goals of Data Warehousing
o To help reporting as well as analysis

o Maintain the organization's historical information
o Be the foundation for decision making.
Need for Data Warehouse
1. 1) Business User: Business users require a data warehouse to view summarized

data from the past. Since these people are non-technical, the data may be
presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable
data from the past. This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the
data in the data warehouse. So, data warehouse contributes to making strategic
decisions.
4. 4) For data consistency and quality: Bringing the data from different sources
at a commonplace, the user can effectively undertake to bring the uniformity
and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degree of
flexibility and quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.

2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
4. Queries that would be complex in many normalized databases could be easier
to build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
6. Data warehousing provides the capabilities to analyze a large amount of
historical data.
Difference between Operational Database and Data Warehouse
Operational Database Data Warehouse
Operational systems are designed to Data warehousing systems are typically

support high-volume transaction designed to support high-volume
processing. analytical processing (i.e., OLAP).
Operational systems are usually Data warehousing systems are usually

concerned with current data. concerned with historical data.
Data within operational systems are Non-volatile, new data may be added
mainly updated regularly according to regularly. Once Added rarely changed.
need.
It is designed for real-time business It is designed for analysis of business

dealing and processes. measures by subject area, categories,
and attributes.
It is optimized for a simple set of It is optimized for extent loads and

transactions, generally adding or high, complex, unpredictable queries
retrieving a single row at a time per table. that access many rows per table.
It is optimized for validation of incoming Loaded with consistent, valid

information during transactions, uses information, requires no real-time
validation data tables. validation.
It supports thousands of concurrent It supports a few concurrent clients

clients. relative to OLTP.
Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented
Operational systems are usually Data warehousing systems are usually

optimized to perform fast inserts and optimized to perform fast retrievals of
updates of associatively small volumes of relatively high volumes of data.
data.
Data In Data Out
Less Number of data accessed. Large Number of data accessed.
Relational databases are created for on- Data Warehouse designed for on-line
line transactional Processing (OLTP) Analytical Processing (OLAP)
Data Warehouse Architecture
Data warehouses and their architectures very depending upon the elements of an
organization's situation.
Three common architectures are:
Data Warehouse Architecture: Basic
Data Warehouse Architecture: With Staging Area
Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic
Operational System
An operational system is a method used in data warehousing to refer to

a system that is used to process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every
file in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business

managers for strategic decision-making. These customers interact with the warehouse
using end-client access tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools

o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Data Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it into the
warehouse.
We can do this programmatically, although data warehouses uses a staging area (A place
where data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses where
all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from source
systems is copied.
Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups within
our organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses
that can provided information for reporting and analysis on a section, unit, department
or operation in the company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In
this example, a financial analyst wants to analyze historical data for purchases and sales
or mine historical information to make predictions about customer behavior.
Three-Tier Data Warehouse Architecture

Generally a data warehouses adopts three-tier architecture. Following are the three tiers of
the data warehouse architecture.
• Bottom Tier - The bottom tier of the architecture is the data warehouse database
server. It is the relational database system. We use the back end tools and utilities to
feed data into the bottom tier. These back end tools and utilities perform the Extract,
Clean, Load, and Refresh functions.
• Middle Tier - In the middle tier, we have the OLAP (Online Analytical Processing) Server
that can be implemented in either of the following ways.
o By Relational OLAP (ROLAP), this is an extended relational database
management system. The ROLAP maps the operations on multidimensional data
to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, this directly implements the
multidimensional data and operations. They map multidimensional storage
engines. The map multi dimensional vies directly to data cube array structures.
Data cube allows fast indexing to precompiled summarized data. Many MOLAP
servers adopt a two level storage representation to handle dense and sparse
data sets.
o By Hybrid OLAP (HOLAP) model, it combines both ROLAP and MOLAP. A HOLAP
server may allow large volumes of detail data to be stored in relational database.
Microsoft SQL server supports a hybrid HOLAP server.
• Top-Tier - This tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse:
Components Data Warehouse
Source Data Component
Source data coming into the data warehouses may be grouped into four broad
categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose
segments of the data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases. This is the
internal data, part of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business.
In every operational system, we periodically take the old data and store it in achieved
files.
External Data: Most executives depend on information from external sources for a
large percentage of the information they use. They use statistics associating to their
industry produced by the external department.
Data Staging Component
After we have been extracted data from various operational systems and external
sources, we have to prepare the files for storing in the data warehouse. The extracted
data coming from several different sources need to be changed, converted, and made
ready in a format that is relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many
different sources. If data extraction for a data warehouse posture big challenges, data
transformation present even significant challenges. We perform several individual
tasks as part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data
transformation contains many forms of combining pieces of data from different
sources. We combine data from single source record or related data parts from many
source records.
On the other hand, data transformation also contains purging source data that is not
useful and separating outsource records into new combinations. Sorting and merging
of data take place on a large scale in the data staging area. When the data
transformation function ends, we have a collection of integrated data that is cleaned,
standardized, and summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When
we complete the structure and construction of the data warehouse and go live for the
first time, we do the initial loading of the information into the data warehouse storage.
The initial load moves high volumes of data using up a substantial amount of time.
Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories for
the operational systems generally include only the current data. Also, these data
repositories include the data structured in highly normalized for fast and efficient
processing.
Information Delivery Component
The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to
some customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system. In the data dictionary, we keep the data about the
logical data structures, the data about the records and addresses, the information
about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users.

The scope is confined to particular selected subjects. Data in a data warehouse should
be a fairly current, but not mainly up to the minute, although development in the data
warehouse industry has made standard and incremental data dumps more achievable.
Data marts are lower than data warehouses and usually contain organization. The
current trends in data warehousing are to developed a data warehouse with several
smaller related data marts for particular kinds of queries and reports.
Management and Control Component
The management and control elements coordinate the services and functions within
the data warehouse.
These components control the data transformation and the data transfer into the
data warehouse storage.
On the other hand, it moderates the data delivery to the clients.
Its work with the database management systems and authorizes data to be correctly
saved in the repositories.
It monitors the movement of information into the staging method and from there
into the data warehouses storage itself.
Why we need a separate Data Warehouse?
Data Warehouse queries are complex because they involve the computation of large
groups of data at summarized levels.
It may require the use of distinctive data organization, access, and implementation
method based on multidimensional views.
Performing OLAP queries in operational database degrade the performance of

functional tasks.
Data Warehouse is used for analysis and decision making in which extensive database
is required, including historical data, which operational database does not typically
maintain.
The separation of an operational database from data warehouses is based on the

different structures and uses of data in these systems.
Because the two systems provide different functionalities and require different kinds
of data, it is necessary to maintain separate databases.
What is ETL?
The mechanism of extracting information from source systems and bringing it into the
data warehouse is commonly called ETL, which stands for Extraction,
Transformation and Loading.
The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs
to change with business changes. ETL is a recurring method (daily, weekly, monthly) of
a Data warehouse system and needs to be agile, automated, and well documented.
How ETL Works?
ETL consists of three separate phases:

Extraction
o Extraction is the operation of extracting information from a source system for
further use in a data warehouse environment. This is the first stage of the ETL
process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus
determining which data needs to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all
changed data to the warehouse and keep it up-to-date.
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to

improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing
mistakes and to recognize synonyms, as well as rule-based cleansing to enforce
domain-specific rules and defines appropriate associations between values.
The following examples show the essential of data cleaning:
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and
up-to-date list of contact addresses, email addresses and telephone numbers must be
available.
If a client or supplier calls, the staff responding should be quickly able to find the
person in the enterprise database, but this need that the caller's name or his/her
company name is listed in the database.
If a user appears in the databases with two or more slightly different names or different
account numbers, it becomes difficult to update the customer's information.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.
Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.
Loading can be carried in two ways:
1. Refresh: Data Warehouse data is completely rewritten. This means that older
file is replaced. Refresh is usually used in combination with static extraction to
populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the
Data Warehouse. An update is typically carried out without deleting or
modifying preexisting data. This method is used in combination with
incremental extraction to update data warehouses regularly.
Data Warehouse Modeling
Data warehouse modeling is the process of designing the schemas of the detailed and
summarized information of the data warehouse. The goal of data warehouse modeling
is to develop a schema describing the reality, or at least a part of the fact, which the
data warehouse is needed to support.
Types of Data Warehouse Models
Enterprise Warehouse
An Enterprise warehouse collects all of the records about subjects spanning the entire
organization. It supports corporate-wide data integration, usually from one or more
operational systems or external data providers, and it's cross-functional in scope. It
generally contains detailed information as well as summarized information and can
range in estimate from a few gigabyte to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be accomplished on traditional mainframes, UNIX

super servers, or parallel architecture platforms. It required extensive business
modeling and may take years to develop and build.
Data Mart
A data mart includes a subset of corporate-wide data that is of value to a specific

collection of users. The scope is confined to particular selected subjects. For example,
a marketing data mart may restrict its subjects to the customer, items, and sales. The
data contained in the data marts tend to be summarized.
Data Marts is divided into two parts:
Independent Data Mart: Independent data mart is sourced from data captured from
one or more operational systems or external data providers, or data generally locally
within a different department or geographic area.
Dependent Data Mart: Dependent data marts are sourced exactly from enterprise
data-warehouses.
Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational database. For
effective query processing, only some of the possible summary vision may be
materialized. A virtual warehouse is simple to build but required excess capacity on
operational database servers.
What is Star Schema?
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as
date, item, or customer.
A star schema is a relational schema where a relational schema whose design

represents a multidimensional data model. The star schema is the explicit data
warehouse schema. It is known as star schema because the entity-relationship
diagram of this schemas simulates a star, with points, diverge from a central table. The
center of the schema consists of a large fact table, and the points of the star are the
dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to
the dimension table. The primary key of the fact tables is generally a composite key
that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated
(fact tables that include aggregated fact are often instead called summary tables). A
fact table generally contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that

categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list. The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional tables
are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of
the following features:
o It creates a DE-normalized database that can quickly provide query responses.

o It provides a flexible design that can be changed easily or added to throughout
the development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the
data.
o It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema
Query Performance
A star schema database has a limited number of table and clear join paths, the query
run faster than they do against OLTP systems. Small single-table queries, frequently of
a dimension table, are almost instantaneous. Large join queries that contain multiple
tables takes only seconds or minutes to run.
Load performance and administration
Structural simplicity also decreases the time required to load large batches of record
into a star schema database. By describing facts and dimensions and separating them
into the various table, the impact of a load structure is reduced. Dimension table can
be populated once and occasionally refreshed. We can add new facts regularly and
selectively by appending records to a fact table.
Built-in referential integrity
A star schema has referential integrity built-in when information is loaded. Referential
integrity is enforced because each data in dimensional tables has a unique primary
key, and all keys in the fact table are legitimate foreign keys drawn from the dimension
table. A record in the fact table which is not related correctly to a dimension cannot
be given the correct key value to be retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end-user because they
represent the fundamental relationship between parts of the underlying business.
Customer can also browse dimension table attributes before constructing a query.
Disadvantage of Star Schema
There is some condition which cannot be meet by star schemas like the relationship
between the user, and bank account cannot describe as star schema as the relationship
between them is many to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several
dimension tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table
has columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH
table has columns for each branch_key, branch_name, branch_type. The LOCATION
table has columns of geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the
dimension tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for
time data, four columns for ITEM data, three columns for BRANCH data, and four
columns for LOCATION data. Thus, the size of the fact table is significantly reduced.
When we need to change an item, we need only make a single change in the dimension
table, instead of making many changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into
several tables. The normalized dimension table is called a Snowflake.
What is Snowflake Schema?
A snowflake schema is equivalent to the star schema. "A schema is known as a

snowflake if one or more dimension tables do not connect directly to the fact table
but must join through other dimension tables."
The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of
snowflake schema resembles a snowflake. Snowflaking is a method of normalizing
the dimension tables in a STAR schemas. When we normalize all the dimension tables
entirely, the resultant structure resembles a snowflake with the fact table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is

diagramed with each fact surrounded by its associated dimensions, and those
dimensions are related to other dimensions, branching out into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension
tables, which can be linked to other dimension tables through a many-to-one
relationship. Tables in a snowflake schema are generally normalized to the third normal
form. Each dimension table performs exactly one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having
three levels. A snowflake schemas can have any number of dimension, and each
dimension can have any number of levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store,
Location, Time, Product, Line, and Family dimension tables. The Market dimension has
two dimension tables with Store as the primary dimension table, and Location as the
outrigger dimension table. The product dimension has three dimension tables with
Product as the primary dimension table, and the Line and Family table are the outrigger
dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This
needed more disk space than a more normalized snowflake schema. Snowflaking
normalizes the dimension by moving attributes with low cardinality into separate
dimension tables that relate to the core dimension table by using foreign keys.
Snowflaking for the sole purpose of minimizing disk space is not recommended,
because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake

dimension tables are damaged into multiple-dimension tables.
Advantage of Snowflake Schema
1. The primary advantage of the snowflake schema is the development in query

performance due to minimized disk storage requirements and joining smaller
lookup tables.
2. It provides greater scalability in the interrelationship between dimension levels
and components.
3. No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema
1. The primary disadvantage of the snowflake schema is the additional

maintenance efforts required due to the increasing number of lookup tables. It
is also known as a multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
What is Fact Constellation Schema?
A Fact constellation means two or more fact tables sharing one or more dimensions.
It is also called Galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or data

mart. Fact Constellation Schema can design with a collection of de-normalized FACT,
Shared, and Conformed Dimension tables.
Fact Constellation Schema is a sophisticated database design that is difficult to

summarize information. Fact Constellation Schema can implement between aggregate
Fact tables or decompose a complex Fact table into independent simplex Fact tables.
Example: A fact constellation schema is shown in the figure below.

This schema defines two fact tables, sales, and shipping. Sales are treated along four
dimensions, namely, time, item, branch, and location. The schema contains a fact table
for sales that includes keys to each of the four dimensions, along with two measures:
Rupee_sold and units_sold. The shipping table has five dimensions, or keys: item_key,
time_key, shipper_key, from_location, and to_location, and two measures: Rupee_cost
and units_shipped.
The primary disadvantage of the fact constellation schema is that it is a more

challenging design because many variants for specific kinds of aggregation must be
considered and selected.
Fact table
The fact table is a central table in the data schemas.
It is found in the center of a star schema or snowflake schema and surrounded by a

dimension table.
It contains the facts of a particular business process, such as sales revenue by month.
Facts are also known as measurements or metrics.
Fact table stored quantitative information of the analysis that is not arranged.
The fact table is a primary table in the dimensional model.

Types of Fact/ Measures
There are three types of facts:
1. Additive
2. Semi-additive
3. Non-additive
1. Additive
The numeric value in a fact table that is more flexible is an additive measure. For
each dimension you can even sum up. If you want to know the total sales of your
company you can easily sum up all the sales.
Additive facts are facts that can be summed up through all of the dimensions
in the fact table. The addition will be performed along with diff dimensions.
Summarized across all dimensions.
2. Semi-additive
With these measures, you have to pay attention. Semi-additive facts are facts
that can be summed up for some of the dimensions in the fact table, but not the
others.
Summarized across the sum of the dimensions.

3. Non-additive
Not summarize across all dimensions.
With these facts you can never make a sum.
Factless Fact Tables
Factless fact tables contain dimension keys but no measure.
These tables are created to reduce the actual fact table size.
Two main categories in factless fact table-
1. Event
2. Coverage
Event
It represents a business process.
In events no associated matrics only dimension keys.
Example – leave tracking table, students enrollment tracking table.
Coverage
use for negative scenarios.
What is OLAP (Online Analytical Processing)?
OLAP stands for On-Line Analytical Processing. OLAP is a classification of software

technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible
views of data that has been transformed from raw information to reflect the real
dimensionality of the enterprise as understood by the clients.
Who uses OLAP and Why?
OLAP applications are used by a variety of the functions of an organization.
Finance and accounting:point
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Sales and Marketing
o Sales analysis and forecasting

o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer segmentation
Production
o Production planning
o Defect analysis
OLAP Operations in the Multidimensional Data Model
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up.
Drill-down is like zooming-in on the data cube. It navigates from less detailed record to
more detailed data. Drill-down can be performed by either stepping down a concept
hierarchy for a dimension or adding additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a
concept hierarchy which is defined as day, month, quarter, and year. Drill-down appears by
descending the time hierarchy from the level of the quarter to a more detailed level of the
month.
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs

aggregation on a data cube, by climbing down concept hierarchies, i.e., dimension reduction.
Roll-up is like zooming-out on the data cubes. Figure shows the result of roll-up operations
performed on the dimension location. The hierarchy for the location is defined as the Order
Street, city, province, or state, country. The roll-up operation aggregates the data by
ascending the location hierarchy from the level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed
from the cube. For example, consider a sales data cube having two dimensions, location and
time. Roll-up may be performed by removing, the time dimensions, appearing in an
aggregation of the total sales by location, relatively than by location and by time.
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of
the dimension. For example, a slice operation is executed when the customer wants a
selection on one dimension of a three-dimensional cube resulting in a two-dimensional site.
So, the Slice operations perform a selection on one dimension of the given cube, thus
resulting in a subcube.
The following diagram illustrates how Slice works.

Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
Dice
The dice operation describes a subcube or sub-set5by operating a selection on two or more
dimension.
Consider the following diagram, which shows the dice operations.

The dice operation on the cubes based on the following selection criteria involves three
dimensions.
o (location = "Toronto" or "Vancouver")

o (time = "Q1" or "Q2")
o (item =" Mobile" or "Modem")
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates
the data axes in view to provide an alternative presentation of the data. It may contain
swapping the rows and columns or moving one of the row-dimensions into the column
dimensions.
Consider the following diagram, which shows the pivot operation.

Types of OLAP server
There are three main types of OLAP servers are as following:
1. ROLAP
2. MOLAP
3. HOLAP
ROLAP(Relational OLAP)
It works directly with a relational database.
Facts and dimensions are stored as relations.
New relations are storing aggregate information.
These are intermediate server which stands in between a relational back-end server
and user front-end tool.
They use relational or extended relational DBMS to save and handle warehouse data.
Advantages-
Handle a large amount of data.
ROLAP tools store and analyze highly volatile data.
Disadvantages
Poor query performance.
Expertise persons can deal with ROLAP.

MOLAP ( Multi dimensional OLAP)
These servers support multi-dimensional data view through array based multi-
dimensional storage engines.
Advantages
It is easy to use.
Information retrieval is very fast because data is stored in dimensions.
It can perform complex computations or data.
Disadvantages
MOLAP is not capable of containing detailed information.
DBMS facility is weak.
HOLAP (Hybrid OLAP)

The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the
greater capability of ROLAP and the faster computation of MOLAP. For example, a HOLAP
server may allow large volumes of detailed data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a
hybrid OLAP server.
Difference between ROLAP, MOLAP, and HOLAP
ROLAP MOLAP HOLAP
ROLAP stands for MOLAP stands for HOLAP stands for Hybrid
Relational Online Multidimensional Online Online Analytical
Analytical Processing. Analytical Processing. Processing.
The ROLAP storage The MOLAP storage mode The HOLAP storage mode
mode causes the principle the aggregations connects attributes of
aggregation of the of the division and a copy of both MOLAP and ROLAP.
division to be stored in its source information to be Like MOLAP, HOLAP
indexed views in the saved in a multidimensional causes the aggregation of
relational database that operation in analysis the division to be stored in
was specified in the services when the a multidimensional
partition's data source. separation is processed. operation in an SQL Server
analysis services instance.
ROLAP does not This MOLAP operation is HOLAP does not causes a
because a copy of the highly optimize to copy of the source
source information to be maximize query information to be stored.
stored in the Analysis performance. The storage For queries that access the
services data folders. area can be on the only summary record in
Instead, when the computer where the the aggregations of a
outcome cannot be partition is described or on division, HOLAP is the
derived from the query another computer running equivalent of MOLAP.
cache, the indexed views Analysis services. Because a
in the record source are copy of the source
accessed to answer information resides in the
queries. multidimensional
operation, queries can be
resolved without accessing
the partition's source
record.
Query response is Query response times can Queries that access source
frequently slower with be reduced substantially by record for example, if we
ROLAP storage than using aggregations. The want to drill down to an
with the MOLAP or record in the partition's atomic cube cell for which
HOLAP storage mode. MOLAP operation is only as there is no aggregation
Processing time is also current as of the most information must retrieve
frequently slower with recent processing of the data from the relational
ROLAP. separation. database and will not be
as fast as they would be if
the source information
were stored in the MOLAP
architecture.

Introduction To Data Warehouse

Uploaded by

Copyright:

Available Formats

Introduction To Data Warehouse

Uploaded by

Copyright:

Available Formats

What is a Data Warehouse?

"Data Warehouse is a subject-oriented, integrated, and time-variant store of

Characteristics of Data Warehouse

Goals of Data Warehousing

o To help reporting as well as analysis

Need for Data Warehouse

1. 1) Business User: Business users require a data warehouse to view summarized

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.

Operational Database Data Warehouse

Operational systems are designed to Data warehousing systems are typically

Operational systems are usually Data warehousing systems are usually

It is designed for real-time business It is designed for analysis of business

It is optimized for a simple set of It is optimized for extent loads and

It is optimized for validation of incoming Loaded with consistent, valid

It supports thousands of concurrent It supports a few concurrent clients

Operational systems are usually Data warehousing systems are usually

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Three common architectures are:

Data Warehouse Architecture: Basic

Data Warehouse Architecture: With Staging Area

Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic

An operational system is a method used in data warehousing to refer to

Meta Data used in Data Warehouse for a variety of purpose, including:

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

End-User access Tools

The principal purpose of a data warehouse is to provide information to the business

The examples of some of the end-user access tools can be:

o Reporting and Query Tools

Data Warehouse Architecture: With Staging Area

Data Warehouse Architecture: With Staging Area and Data Marts

Three-Tier Data Warehouse Architecture

Components Data Warehouse

Source Data Component

Data Staging Component

Data Storage Components

Information Delivery Component

It includes a subset of corporate-wide data that is of value to a specific group of users.

Management and Control Component

On the other hand, it moderates the data delivery to the clients.

Why we need a separate Data Warehouse?

Performing OLAP queries in operational database degrade the performance of

The separation of an operational database from data warehouses is based on the

How ETL Works?

ETL consists of three separate phases:

The cleansing stage is crucial in a data warehouse technique because it is supposed to

The following examples show the essential of data cleaning:

Loading can be carried in two ways:

Data Warehouse Modeling

An enterprise data warehouse may be accomplished on traditional mainframes, UNIX

A data mart includes a subset of corporate-wide data that is of value to a specific

Data Marts is divided into two parts:

What is Star Schema?

A star schema is a relational schema where a relational schema whose design

A dimension is an architecture usually composed of one or more hierarchies that

Characteristics of Star Schema

o It creates a DE-normalized database that can quickly provide query responses.

Advantages of Star Schema

Load performance and administration

Disadvantage of Star Schema

What is Snowflake Schema?

A snowflake schema is equivalent to the star schema. "A schema is known as a

Snowflaking is used to develop the performance of specific queries. The schema is

In snowflake, schema tables are normalized to delete redundancy. In snowflake

Advantage of Snowflake Schema

1. The primary advantage of the snowflake schema is the development in query

1. The primary disadvantage of the snowflake schema is the additional

What is Fact Constellation Schema?

Fact Constellation Schema describes a logical structure of data warehouse or data