0% found this document useful (0 votes)
11 views34 pages

Introduction To Data Warehouse

Download as pdf or txt
0% found this document useful (0 votes)
11 views34 pages

Introduction To Data Warehouse

Download as pdf or txt
Download as pdf or txt
You are on page 1/ 34

What is a Data Warehouse?

A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of


information in support of management's decisions."

Characteristics of Data Warehouse

Subject-Oriented

A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful
concerning the subject and including all data needed by the users to understand the
subject.

Integrated

A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records. It requires performing data cleaning and integration
during data warehousing to ensure consistency in naming conventions, attributes
types, etc., among different data sources.

Time-Variant

Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
These variations with a transactions system, where often only the most current file is
kept.

Non-Volatile

The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed.

Goals of Data Warehousing

o To help reporting as well as analysis


o Maintain the organization's historical information
o Be the foundation for decision making.

Need for Data Warehouse

1. 1) Business User: Business users require a data warehouse to view summarized


data from the past. Since these people are non-technical, the data may be
presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable
data from the past. This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the
data in the data warehouse. So, data warehouse contributes to making strategic
decisions.
4. 4) For data consistency and quality: Bringing the data from different sources
at a commonplace, the user can effectively undertake to bring the uniformity
and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degree of
flexibility and quick response time.

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.


2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
4. Queries that would be complex in many normalized databases could be easier
to build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
6. Data warehousing provides the capabilities to analyze a large amount of
historical data.
Difference between Operational Database and Data Warehouse

Operational Database Data Warehouse

Operational systems are designed to Data warehousing systems are typically


support high-volume transaction designed to support high-volume
processing. analytical processing (i.e., OLAP).

Operational systems are usually Data warehousing systems are usually


concerned with current data. concerned with historical data.

Data within operational systems are Non-volatile, new data may be added
mainly updated regularly according to regularly. Once Added rarely changed.
need.

It is designed for real-time business It is designed for analysis of business


dealing and processes. measures by subject area, categories,
and attributes.

It is optimized for a simple set of It is optimized for extent loads and


transactions, generally adding or high, complex, unpredictable queries
retrieving a single row at a time per table. that access many rows per table.

It is optimized for validation of incoming Loaded with consistent, valid


information during transactions, uses information, requires no real-time
validation data tables. validation.

It supports thousands of concurrent It supports a few concurrent clients


clients. relative to OLTP.

Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented

Operational systems are usually Data warehousing systems are usually


optimized to perform fast inserts and optimized to perform fast retrievals of
updates of associatively small volumes of relatively high volumes of data.
data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on- Data Warehouse designed for on-line
line transactional Processing (OLTP) Analytical Processing (OLAP)
Data Warehouse Architecture

Data warehouses and their architectures very depending upon the elements of an
organization's situation.

Three common architectures are:

Data Warehouse Architecture: Basic

Data Warehouse Architecture: With Staging Area

Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic

Operational System

An operational system is a method used in data warehousing to refer to


a system that is used to process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and every
file in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:


Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the business


managers for strategic decision-making. These customers interact with the warehouse
using end-client access tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools


o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

Data Warehouse Architecture: With Staging Area

We must clean and process your operational information before put it into the
warehouse.

We can do this programmatically, although data warehouses uses a staging area (A place
where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses where
all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from source
systems is copied.

Data Warehouse Architecture: With Staging Area and Data Marts


We may want to customize our warehouse's architecture for multiple groups within
our organization.

We can do this by adding data marts. A data mart is a segment of a data warehouses
that can provided information for reporting and analysis on a section, unit, department
or operation in the company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are separated. In
this example, a financial analyst wants to analyze historical data for purchases and sales
or mine historical information to make predictions about customer behavior.

Three-Tier Data Warehouse Architecture


Generally a data warehouses adopts three-tier architecture. Following are the three tiers of
the data warehouse architecture.
• Bottom Tier - The bottom tier of the architecture is the data warehouse database
server. It is the relational database system. We use the back end tools and utilities to
feed data into the bottom tier. These back end tools and utilities perform the Extract,
Clean, Load, and Refresh functions.

• Middle Tier - In the middle tier, we have the OLAP (Online Analytical Processing) Server
that can be implemented in either of the following ways.
o By Relational OLAP (ROLAP), this is an extended relational database
management system. The ROLAP maps the operations on multidimensional data
to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, this directly implements the
multidimensional data and operations. They map multidimensional storage
engines. The map multi dimensional vies directly to data cube array structures.
Data cube allows fast indexing to precompiled summarized data. Many MOLAP
servers adopt a two level storage representation to handle dense and sparse
data sets.
o By Hybrid OLAP (HOLAP) model, it combines both ROLAP and MOLAP. A HOLAP
server may allow large volumes of detail data to be stored in relational database.
Microsoft SQL server supports a hybrid HOLAP server.
• Top-Tier - This tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse:

Components Data Warehouse

Source Data Component

Source data coming into the data warehouses may be grouped into four broad
categories:

Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose
segments of the data from the various operational modes.

Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases. This is the
internal data, part of which could be useful in a data warehouse.

Archived Data: Operational systems are mainly intended to run the current business.
In every operational system, we periodically take the old data and store it in achieved
files.
External Data: Most executives depend on information from external sources for a
large percentage of the information they use. They use statistics associating to their
industry produced by the external department.

Data Staging Component

After we have been extracted data from various operational systems and external
sources, we have to prepare the files for storing in the data warehouse. The extracted
data coming from several different sources need to be changed, converted, and made
ready in a format that is relevant to be saved for querying and analysis.

We will now discuss the three primary functions that take place in the staging area.

1) Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.

2) Data Transformation: As we know, data for a data warehouse comes from many
different sources. If data extraction for a data warehouse posture big challenges, data
transformation present even significant challenges. We perform several individual
tasks as part of data transformation.

First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data
transformation contains many forms of combining pieces of data from different
sources. We combine data from single source record or related data parts from many
source records.

On the other hand, data transformation also contains purging source data that is not
useful and separating outsource records into new combinations. Sorting and merging
of data take place on a large scale in the data staging area. When the data
transformation function ends, we have a collection of integrated data that is cleaned,
standardized, and summarized.

3) Data Loading: Two distinct categories of tasks form data loading functions. When
we complete the structure and construction of the data warehouse and go live for the
first time, we do the initial loading of the information into the data warehouse storage.
The initial load moves high volumes of data using up a substantial amount of time.

Data Storage Components

Data storage for the data warehousing is a split repository. The data repositories for
the operational systems generally include only the current data. Also, these data
repositories include the data structured in highly normalized for fast and efficient
processing.

Information Delivery Component

The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to
some customer-specified scheduling algorithm.
Metadata Component

Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system. In the data dictionary, we keep the data about the
logical data structures, the data about the records and addresses, the information
about the indexes, and so on.

Data Marts

It includes a subset of corporate-wide data that is of value to a specific group of users.


The scope is confined to particular selected subjects. Data in a data warehouse should
be a fairly current, but not mainly up to the minute, although development in the data
warehouse industry has made standard and incremental data dumps more achievable.
Data marts are lower than data warehouses and usually contain organization. The
current trends in data warehousing are to developed a data warehouse with several
smaller related data marts for particular kinds of queries and reports.

Management and Control Component

The management and control elements coordinate the services and functions within
the data warehouse.

These components control the data transformation and the data transfer into the
data warehouse storage.

On the other hand, it moderates the data delivery to the clients.

Its work with the database management systems and authorizes data to be correctly
saved in the repositories.

It monitors the movement of information into the staging method and from there
into the data warehouses storage itself.

Why we need a separate Data Warehouse?

Data Warehouse queries are complex because they involve the computation of large
groups of data at summarized levels.

It may require the use of distinctive data organization, access, and implementation
method based on multidimensional views.

Performing OLAP queries in operational database degrade the performance of


functional tasks.
Data Warehouse is used for analysis and decision making in which extensive database
is required, including historical data, which operational database does not typically
maintain.

The separation of an operational database from data warehouses is based on the


different structures and uses of data in these systems.

Because the two systems provide different functionalities and require different kinds
of data, it is necessary to maintain separate databases.

What is ETL?

The mechanism of extracting information from source systems and bringing it into the
data warehouse is commonly called ETL, which stands for Extraction,
Transformation and Loading.

The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.

To maintain its value as a tool for decision-makers, Data warehouse technique needs
to change with business changes. ETL is a recurring method (daily, weekly, monthly) of
a Data warehouse system and needs to be agile, automated, and well documented.

How ETL Works?

ETL consists of three separate phases:


Extraction
o Extraction is the operation of extracting information from a source system for
further use in a data warehouse environment. This is the first stage of the ETL
process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus
determining which data needs to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all
changed data to the warehouse and keep it up-to-date.

Cleansing

The cleansing stage is crucial in a data warehouse technique because it is supposed to


improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing
mistakes and to recognize synonyms, as well as rule-based cleansing to enforce
domain-specific rules and defines appropriate associations between values.

The following examples show the essential of data cleaning:

If an enterprise wishes to contact its users or its suppliers, a complete, accurate and
up-to-date list of contact addresses, email addresses and telephone numbers must be
available.

If a client or supplier calls, the staff responding should be quickly able to find the
person in the enterprise database, but this need that the caller's name or his/her
company name is listed in the database.
If a user appears in the databases with two or more slightly different names or different
account numbers, it becomes difficult to update the customer's information.

Transformation

Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.

Loading

The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.

Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older
file is replaced. Refresh is usually used in combination with static extraction to
populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the
Data Warehouse. An update is typically carried out without deleting or
modifying preexisting data. This method is used in combination with
incremental extraction to update data warehouses regularly.

Data Warehouse Modeling

Data warehouse modeling is the process of designing the schemas of the detailed and
summarized information of the data warehouse. The goal of data warehouse modeling
is to develop a schema describing the reality, or at least a part of the fact, which the
data warehouse is needed to support.
Types of Data Warehouse Models

Enterprise Warehouse

An Enterprise warehouse collects all of the records about subjects spanning the entire
organization. It supports corporate-wide data integration, usually from one or more
operational systems or external data providers, and it's cross-functional in scope. It
generally contains detailed information as well as summarized information and can
range in estimate from a few gigabyte to hundreds of gigabytes, terabytes, or beyond.

An enterprise data warehouse may be accomplished on traditional mainframes, UNIX


super servers, or parallel architecture platforms. It required extensive business
modeling and may take years to develop and build.

Data Mart

A data mart includes a subset of corporate-wide data that is of value to a specific


collection of users. The scope is confined to particular selected subjects. For example,
a marketing data mart may restrict its subjects to the customer, items, and sales. The
data contained in the data marts tend to be summarized.

Data Marts is divided into two parts:

Independent Data Mart: Independent data mart is sourced from data captured from
one or more operational systems or external data providers, or data generally locally
within a different department or geographic area.

Dependent Data Mart: Dependent data marts are sourced exactly from enterprise
data-warehouses.
Virtual Warehouses

Virtual Data Warehouses is a set of perception over the operational database. For
effective query processing, only some of the possible summary vision may be
materialized. A virtual warehouse is simple to build but required excess capacity on
operational database servers.

What is Star Schema?

A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as
date, item, or customer.

A star schema is a relational schema where a relational schema whose design


represents a multidimensional data model. The star schema is the explicit data
warehouse schema. It is known as star schema because the entity-relationship
diagram of this schemas simulates a star, with points, diverge from a central table. The
center of the schema consists of a large fact table, and the points of the star are the
dimension tables.

Fact Tables

A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to
the dimension table. The primary key of the fact tables is generally a composite key
that is made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been aggregated
(fact tables that include aggregated fact are often instead called summary tables). A
fact table generally contains facts with the same level of aggregation.
Dimension Tables

A dimension is an architecture usually composed of one or more hierarchies that


categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list. The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional tables
are usually small in size than fact table.

Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.

Characteristics of Star Schema

The star schema is intensely suitable for data warehouse database design because of
the following features:

o It creates a DE-normalized database that can quickly provide query responses.


o It provides a flexible design that can be changed easily or added to throughout
the development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the
data.
o It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema

Query Performance

A star schema database has a limited number of table and clear join paths, the query
run faster than they do against OLTP systems. Small single-table queries, frequently of
a dimension table, are almost instantaneous. Large join queries that contain multiple
tables takes only seconds or minutes to run.

Load performance and administration

Structural simplicity also decreases the time required to load large batches of record
into a star schema database. By describing facts and dimensions and separating them
into the various table, the impact of a load structure is reduced. Dimension table can
be populated once and occasionally refreshed. We can add new facts regularly and
selectively by appending records to a fact table.
Built-in referential integrity

A star schema has referential integrity built-in when information is loaded. Referential
integrity is enforced because each data in dimensional tables has a unique primary
key, and all keys in the fact table are legitimate foreign keys drawn from the dimension
table. A record in the fact table which is not related correctly to a dimension cannot
be given the correct key value to be retrieved.

Easily Understood

A star schema is simple to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end-user because they
represent the fundamental relationship between parts of the underlying business.
Customer can also browse dimension table attributes before constructing a query.

Disadvantage of Star Schema

There is some condition which cannot be meet by star schemas like the relationship
between the user, and bank account cannot describe as star schema as the relationship
between them is many to many.

Example: Suppose a star schema is composed of a fact table, SALES, and several
dimension tables connected to it for time, branch, item, and geographic locations.

The TIME table has a column for each day, month, quarter, and year. The ITEM table
has columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH
table has columns for each branch_key, branch_name, branch_type. The LOCATION
table has columns of geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the
dimension tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for
time data, four columns for ITEM data, three columns for BRANCH data, and four
columns for LOCATION data. Thus, the size of the fact table is significantly reduced.
When we need to change an item, we need only make a single change in the dimension
table, instead of making many changes in the fact table.

We can create even more complex star schemas by normalizing a dimension table into
several tables. The normalized dimension table is called a Snowflake.

What is Snowflake Schema?

A snowflake schema is equivalent to the star schema. "A schema is known as a


snowflake if one or more dimension tables do not connect directly to the fact table
but must join through other dimension tables."

The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of
snowflake schema resembles a snowflake. Snowflaking is a method of normalizing
the dimension tables in a STAR schemas. When we normalize all the dimension tables
entirely, the resultant structure resembles a snowflake with the fact table in the middle.

Snowflaking is used to develop the performance of specific queries. The schema is


diagramed with each fact surrounded by its associated dimensions, and those
dimensions are related to other dimensions, branching out into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension
tables, which can be linked to other dimension tables through a many-to-one
relationship. Tables in a snowflake schema are generally normalized to the third normal
form. Each dimension table performs exactly one level in a hierarchy.

The following diagram shows a snowflake schema with two dimensions, each having
three levels. A snowflake schemas can have any number of dimension, and each
dimension can have any number of levels.

Example: Figure shows a snowflake schema with a Sales fact table, with Store,
Location, Time, Product, Line, and Family dimension tables. The Market dimension has
two dimension tables with Store as the primary dimension table, and Location as the
outrigger dimension table. The product dimension has three dimension tables with
Product as the primary dimension table, and the Line and Family table are the outrigger
dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This
needed more disk space than a more normalized snowflake schema. Snowflaking
normalizes the dimension by moving attributes with low cardinality into separate
dimension tables that relate to the core dimension table by using foreign keys.
Snowflaking for the sole purpose of minimizing disk space is not recommended,
because it can adversely impact query performance.

In snowflake, schema tables are normalized to delete redundancy. In snowflake


dimension tables are damaged into multiple-dimension tables.

Advantage of Snowflake Schema

1. The primary advantage of the snowflake schema is the development in query


performance due to minimized disk storage requirements and joining smaller
lookup tables.
2. It provides greater scalability in the interrelationship between dimension levels
and components.
3. No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional


maintenance efforts required due to the increasing number of lookup tables. It
is also known as a multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.

What is Fact Constellation Schema?

A Fact constellation means two or more fact tables sharing one or more dimensions.
It is also called Galaxy schema.

Fact Constellation Schema describes a logical structure of data warehouse or data


mart. Fact Constellation Schema can design with a collection of de-normalized FACT,
Shared, and Conformed Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult to


summarize information. Fact Constellation Schema can implement between aggregate
Fact tables or decompose a complex Fact table into independent simplex Fact tables.

Example: A fact constellation schema is shown in the figure below.


This schema defines two fact tables, sales, and shipping. Sales are treated along four
dimensions, namely, time, item, branch, and location. The schema contains a fact table
for sales that includes keys to each of the four dimensions, along with two measures:
Rupee_sold and units_sold. The shipping table has five dimensions, or keys: item_key,
time_key, shipper_key, from_location, and to_location, and two measures: Rupee_cost
and units_shipped.

The primary disadvantage of the fact constellation schema is that it is a more


challenging design because many variants for specific kinds of aggregation must be
considered and selected.

Fact table

The fact table is a central table in the data schemas.

It is found in the center of a star schema or snowflake schema and surrounded by a


dimension table.

It contains the facts of a particular business process, such as sales revenue by month.

Facts are also known as measurements or metrics.

Fact table stored quantitative information of the analysis that is not arranged.

The fact table is a primary table in the dimensional model.


Types of Fact/ Measures

There are three types of facts:

1. Additive
2. Semi-additive
3. Non-additive

1. Additive

The numeric value in a fact table that is more flexible is an additive measure. For
each dimension you can even sum up. If you want to know the total sales of your
company you can easily sum up all the sales.

Additive facts are facts that can be summed up through all of the dimensions
in the fact table. The addition will be performed along with diff dimensions.

Summarized across all dimensions.

2. Semi-additive

With these measures, you have to pay attention. Semi-additive facts are facts
that can be summed up for some of the dimensions in the fact table, but not the
others.

Summarized across the sum of the dimensions.


3. Non-additive

Not summarize across all dimensions.

With these facts you can never make a sum.

Factless Fact Tables

Factless fact tables contain dimension keys but no measure.

These tables are created to reduce the actual fact table size.
Two main categories in factless fact table-

1. Event

2. Coverage

Event

It represents a business process.

In events no associated matrics only dimension keys.

Example – leave tracking table, students enrollment tracking table.

Coverage

use for negative scenarios.

What is OLAP (Online Analytical Processing)?

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software


technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible
views of data that has been transformed from raw information to reflect the real
dimensionality of the enterprise as understood by the clients.

Who uses OLAP and Why?

OLAP applications are used by a variety of the functions of an organization.

Finance and accounting:point

o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling

Sales and Marketing

o Sales analysis and forecasting


o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer segmentation

Production

o Production planning
o Defect analysis

OLAP Operations in the Multidimensional Data Model

Drill-Down

The drill-down operation (also called roll-down) is the reverse operation of roll-up.
Drill-down is like zooming-in on the data cube. It navigates from less detailed record to
more detailed data. Drill-down can be performed by either stepping down a concept
hierarchy for a dimension or adding additional dimensions.

Figure shows a drill-down operation performed on the dimension time by stepping down a
concept hierarchy which is defined as day, month, quarter, and year. Drill-down appears by
descending the time hierarchy from the level of the quarter to a more detailed level of the
month.

Roll-Up

The roll-up operation (also known as drill-up or aggregation operation) performs


aggregation on a data cube, by climbing down concept hierarchies, i.e., dimension reduction.
Roll-up is like zooming-out on the data cubes. Figure shows the result of roll-up operations
performed on the dimension location. The hierarchy for the location is defined as the Order
Street, city, province, or state, country. The roll-up operation aggregates the data by
ascending the location hierarchy from the level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed
from the cube. For example, consider a sales data cube having two dimensions, location and
time. Roll-up may be performed by removing, the time dimensions, appearing in an
aggregation of the total sales by location, relatively than by location and by time.

Slice

A slice is a subset of the cubes corresponding to a single value for one or more members of
the dimension. For example, a slice operation is executed when the customer wants a
selection on one dimension of a three-dimensional cube resulting in a two-dimensional site.
So, the Slice operations perform a selection on one dimension of the given cube, thus
resulting in a subcube.

The following diagram illustrates how Slice works.


Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".

It will form a new sub-cubes by selecting one or more dimensions.

Dice

The dice operation describes a subcube or sub-set5by operating a selection on two or more
dimension.

Consider the following diagram, which shows the dice operations.


The dice operation on the cubes based on the following selection criteria involves three
dimensions.

o (location = "Toronto" or "Vancouver")


o (time = "Q1" or "Q2")
o (item =" Mobile" or "Modem")

Pivot

The pivot operation is also called a rotation. Pivot is a visualization operations which rotates
the data axes in view to provide an alternative presentation of the data. It may contain
swapping the rows and columns or moving one of the row-dimensions into the column
dimensions.

Consider the following diagram, which shows the pivot operation.


Types of OLAP server

There are three main types of OLAP servers are as following:

1. ROLAP

2. MOLAP

3. HOLAP

ROLAP(Relational OLAP)
It works directly with a relational database.

Facts and dimensions are stored as relations.

New relations are storing aggregate information.

These are intermediate server which stands in between a relational back-end server
and user front-end tool.

They use relational or extended relational DBMS to save and handle warehouse data.

Advantages-

Handle a large amount of data.

ROLAP tools store and analyze highly volatile data.

Disadvantages

Poor query performance.

Expertise persons can deal with ROLAP.


MOLAP ( Multi dimensional OLAP)
These servers support multi-dimensional data view through array based multi-
dimensional storage engines.

Advantages

It is easy to use.

Information retrieval is very fast because data is stored in dimensions.

It can perform complex computations or data.

Disadvantages

MOLAP is not capable of containing detailed information.

DBMS facility is weak.

HOLAP (Hybrid OLAP)


The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the
greater capability of ROLAP and the faster computation of MOLAP. For example, a HOLAP
server may allow large volumes of detailed data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a
hybrid OLAP server.

Difference between ROLAP, MOLAP, and HOLAP

ROLAP MOLAP HOLAP

ROLAP stands for MOLAP stands for HOLAP stands for Hybrid
Relational Online Multidimensional Online Online Analytical
Analytical Processing. Analytical Processing. Processing.

The ROLAP storage The MOLAP storage mode The HOLAP storage mode
mode causes the principle the aggregations connects attributes of
aggregation of the of the division and a copy of both MOLAP and ROLAP.
division to be stored in its source information to be Like MOLAP, HOLAP
indexed views in the saved in a multidimensional causes the aggregation of
relational database that operation in analysis the division to be stored in
was specified in the services when the a multidimensional
partition's data source. separation is processed. operation in an SQL Server
analysis services instance.
ROLAP does not This MOLAP operation is HOLAP does not causes a
because a copy of the highly optimize to copy of the source
source information to be maximize query information to be stored.
stored in the Analysis performance. The storage For queries that access the
services data folders. area can be on the only summary record in
Instead, when the computer where the the aggregations of a
outcome cannot be partition is described or on division, HOLAP is the
derived from the query another computer running equivalent of MOLAP.
cache, the indexed views Analysis services. Because a
in the record source are copy of the source
accessed to answer information resides in the
queries. multidimensional
operation, queries can be
resolved without accessing
the partition's source
record.

Query response is Query response times can Queries that access source
frequently slower with be reduced substantially by record for example, if we
ROLAP storage than using aggregations. The want to drill down to an
with the MOLAP or record in the partition's atomic cube cell for which
HOLAP storage mode. MOLAP operation is only as there is no aggregation
Processing time is also current as of the most information must retrieve
frequently slower with recent processing of the data from the relational
ROLAP. separation. database and will not be
as fast as they would be if
the source information
were stored in the MOLAP
architecture.

You might also like