0% found this document useful (0 votes)
2 views17 pages

Unit -II Data Warehouseing&OLAP

Download as docx, pdf, or txt
0% found this document useful (0 votes)
2 views17 pages

Unit -II Data Warehouseing&OLAP

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 17

UNIT- II

Data Warehouse and OLAP: Data Warehouse basic concepts, Differences between Operational
Database Systems and Data Warehouses, multidimensional Data model, data warehouse
architecture.

Data Warehouse

Data warehouses serve as a central repository for storing and analyzing information to
make better informed decisions. An organization's data warehouse receives data from a variety
of sources, typically on a regular basis, including transactional systems, relational databases, and
other sources.

A data warehouse is a type of data management system that facilitates and supports business
intelligence (BI) activities, specifically analysis. Data warehouses are primarily designed to
facilitate searches and analyses and usually contain large amounts of historical data.

ETL stands for extract, transform, and load and is a traditionally accepted way for
organizations to combine data from multiple systems into a single database, data store, data
warehouse, or data lake

Key Characteristics of Data Warehouse

Data Warehouse has the following characteristics.


Subject-oriented

A data warehouse focuses on a specific topic like sales, marketing, or distribution. It is


designed to provide information about a particular theme rather than the day-to-day operations of
an organization.

Integrated

A data warehouse combines data from different sources. These sources are mainframes and
relational databases, into a single, reliable format. The data must be organized and structured in a
way that allows for effective analysis.

Time-variant

Data in a data warehouse is maintained over time, in weekly/monthly/annual intervals. So you


can do historical analysis and the ability to track changes over time.

Non-volatile

Data in a data warehouse is permanent. Data cannot be deleted or modified once it's stored.
So you can do historical analysis and ensure that the data is always available in its original state.

By understanding these characteristics, organizations can use data warehouses to make better
decisions by analyzing large amounts of data from different sources in a consistent and reliable
way.

Data warehousing has some advantages and disadvantages.

Advantages
 Makes data easier to understand
 Continuous updating
 Accessibility

Disadvantages
 Accumulation of irrelevant data
 Data loss and erasure
 Data cleansing and transformation
Functions of Data warehouse

A data warehouse is a collection of data that is organized to provide various functions for
managing and analyzing data. Some of the important functions of a data warehouse are −

 Data Consolidation
 Data Cleaning
 Data Integration
 Data Storage
 Data Transformation
 Data Analysis
 Data Reporting
 Data Mining
 Performance Optimization

These functions enable organizations to manage and analyze large amounts of data from
different sources, and make informed decisions based on reliable and accurate information.

---------------------------------------------------------------------------------------------------------------------

Data Warehouse Design Process:

A data warehouse can be built using a top-down approach, a bottom-up approach, or a


combination of both.

 The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that
must be solved are clear and well understood.
 The bottom-up approach starts with experiments and prototypes. This is useful in the
early stage of business modeling and technology development. It allows an organization
to move forward at considerably less expense and to evaluate the benefits of the
technology before making significant commitments.
 In the combined approach, an organization can exploit the planned and strategic nature
of the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.
The warehouse design process consists of the following steps:

 Choose a business process to model, for example, orders, invoices, shipments, inventory,
account administration, sales, or the general ledger. If the business process is
organizational and involves multiple complex object collections, a data warehouse model
should be followed. However, if the process is departmental and focuses on the analysis
of one kind of business process, a data mart model should be chosen.
 Choose the grain of the business process. The grain is the fundamental, atomic level of
data to be represented in the fact table for this process, for example, individual
transactions, individual daily snapshots, and so on.
 Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status. Choose the
measures that will populate each fact table record. Typical measures are numeric additive
quantities like dollars sold and units sold.

A Three Tier Data Warehouse Architecture:


Tier-1:

The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (such as customer profile information provided by external
consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g.,
to merge similar data from different sources into a unified format), as well as load and refresh
functions to update the data warehouse. The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying DBMS and allows
client programs to generate SQL code to be executed at a server. Examples of gateways include
ODBC (Open Database Connection) and OLEDB (Open Linking and Embedding for Databases)
by Microsoft and JDBC (Java Database Connection). This tier also contains a metadata
repository, which stores information about the data warehouse and its contents.

Tier-2:

The middle tier is an OLAP server that is typically implemented using either a relational OLAP
(ROLAP) model or a multidimensional OLAP. OLAP model is an extended relational DBMS
that maps operations on multidimensional data to standard relational operations. A
multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.

Tier-3:

The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).
Online Analytical Processing Server (OLAP)

Online Analytical Processing Server (OLAP) is a software. Users can analyze information from
many different databases all at once. It uses a multidimensional data model where users can ask
questions based on multiple dimensions at the same time. For example, a user could ask for sales
data from Delhi in the year 2018. OLAP databases are split up into cubes, which are also called
hyper-cubes.

OLAP operations

These are used to analyze data in an OLAP cube. There are five basic operations:

Roll up

This makes the data less detailed by climbing up the concept hierarchy or reducing
dimensions. For example, in a cube showing sales data by City, rolling up would show sales
data by Country.
Drill down:

This makes the data more detailed by moving down the concept hierarchy or adding a new
dimension. For example, in a cube showing sales data by Quarter, drilling down would show
sales data by Month.
Slice:

This selects a single dimension and creates a new sub-cube. For example, in a cube showing
sales data by Location, Time, and Item, slicing by Time would create a new sub-cube showing
sales data for Q1.

Here, one dimension is selected, and a new sub-cube is created.


Following diagram explain how slice operation performed:
Dice:

This selects a sub-cube by choosing two or more dimensions and criteria. For example, in a
cube showing sales data by Location, Time, and Item, dicing could select sales data for Delhi or
Kolkata, in Q1 or Q2, for Cars or Buses.
Pivot

This rotates the current view to get a new representation. For example, after slicing by Time,
pivoting could show the same data but with Location and Item as rows instead of columns.
Types of OLAP:

1. Relational OLAP (ROLAP):

 ROLAP works directly with relational databases. The base data and the dimension tables
are stored as relational tables and new tables are created to hold the aggregated
information. It depends on a specialized schema design.
 This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality. In essence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
statement.
 ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
standard relational database and its tables in order to bring back the data required to
answer the question.
 ROLAP tools feature the ability to ask any question because the methodology does not
limit to the contents of a cube. ROLAP also has the ability to drill down to the lowest
level of detail in the database.

2. Multidimensional OLAP (MOLAP):

 MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
 MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a
relational database. Therefore it requires the pre-computation and storage of information
in the cube - the operation known as processing.
 MOLAP tools generally utilize a pre-calculated data set referred to as a data cube. The
data cube contains all the possible answers to a given range of questions.
 MOLAP tools have a very fast response time and the ability to quickly write back data
into the data set.

3. Hybrid OLAP (HOLAP):

 There is no clear agreement across the industry as to what constitutes Hybrid OLAP,
except that a database will divide data between relational and specialized storage. For
example, for some vendors, a HOLAP database will use relational tables to hold the
larger quantities of detailed data, and use specialized storage for at least some aspects of
the smaller quantities of more-aggregate or less-detailed data. HOLAP addresses the
shortcomings of MOLAP and ROLAP by combining the capabilities of both approaches.
HOLAP tools can utilize both pre-calculated cubes and relational data sources.

---------------------------------------------------------------------------------------------------------------------

Differences between Operational Database Systems and Data


Warehouses:

A data warehouse is a repository for structured, filtered data that has already been
processed for a specific purpose. It collects the data from multiple sources and transforms the
data using ETL process, then loads it to the Data Warehouse for business purpose.
An operational database, on the other hand, is a database where the data changes
frequently. They are mainly designed for high volume of data transaction. They are the source
database for the data warehouse. Operational databases are used for recording online transactions
and maintaining integrity in multi-access environments.

What is a Data Warehouse?

A Data Warehouse is a system that is used by the users or knowledge managers for data
analysis and decision-making. It can construct and present the data in a certain structure to fulfill
the diverse requirements of several users. Data warehouses are also known as Online
Analytical Processing (OLAP) Systems.
In a data warehouse or OLAP system, the data is saved in a format that allows the effective
creation of data mining documents. The data structure in a data warehousing has renormalized
schema. Performance-wise, data warehouses are quite fast when it comes to analyzing queries.
Data warehouse systems do the integration of several application systems. These systems then
provide data processing by supporting a solid platform of consolidated historical data for
analysis.

What is an Operational Database?


The type of database system that stores information related to operations of an enterprise is
referred to as an operational database. Operational databases are required for functional lines like
marketing, employee relations, customer service etc. Operational databases are basically the
sources of data for the data warehouses because they contain detailed data required for the
normal operations of the business.
Operational Database Data Warehouse
Operational systems are designed to support high- Data warehousing systems are typically designed
to support high-volume analytical processing (i.e.,
volume transaction processing. OLAP).

Operational systems are usually concerned with Data warehousing systems are usually concerned
with historical data.
current data.

Data within operational systems are mainly Non-volatile, new data may be added regularly.
Once Added rarely changed.
updated regularly according to need.

It is designed for real-time business dealing and It is designed for analysis of business measures by
processes. subject area, categories, and attributes.
It is optimized for a simple set of transactions, It is optimized for extent loads and high, complex,
generally adding or retrieving a single row at a unpredictable queries that access many rows per
time per table. table.
It is optimized for validation of incoming Loaded with consistent, valid information,
information during transactions, uses validation requires no real-time validation.
data tables.
It supports thousands of concurrent clients. It supports a few concurrent clients relative to
OLTP.
Operational systems are widely process-oriented. Data warehousing systems are widely subject-
oriented
Operational systems are usually optimized to Data warehousing systems are usually optimized
perform fast inserts and updates of associatively to perform fast retrievals of relatively high
small volumes of data. volumes of data.
Less Number of data accessed. Large Number of data accessed.
Relational databases are created for on-line Data Warehouse designed for on-line Analytical
transactional Processing (OLTP) Processing (OLAP)
Multidimensional Data model:

The multi-Dimensional Data Model is a method which is used for ordering data in the database
along with good arrangement and assembling of the contents in the database.
The Multi-Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers to
access data in the form of queries. They allow users to rapidly receive answers to the requests
which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi-dimensional databases. It
is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from
many dimensions and perspectives. It is defined by dimensions and facts and is represented by a
fact table. Facts are numerical measures and fact tables contain measures of the related
dimensional tables or names of the facts.

Working on a Multidimensional Data Model

On the basis of the pre-decided steps, the Multidimensional Data Model works.
The following stages should be followed by every project for building a Multi-Dimensional Data
Model:
Stage 1: Assembling data from the client: In first stage, a Multi-Dimensional Data Model
collects correct data from the client. Mostly, software professionals provide simplicity to the
client about the range of data which can be gained with the selected technology and collect the
complete data in detail.
Stage 2: Grouping different segments of the system: In the second stage, the Multi-
Dimensional Data Model recognizes and classifies all the data to the respective section they
belong to and also builds it problem-free to apply step by step.
Stage 3: Noticing the different proportions: In the third stage, it is the basis on which the
design of the system is based. In this stage, the main factors are recognized according to the
user’s point of view. These factors are also known as “Dimensions”.
Stage 4: Preparing the actual-time factors and their respective qualities: In the fourth stage,
the factors which are recognized in the previous step are used further for identifying the related
qualities. These qualities are also known as “attributes” in the database.
Stage 5: Finding the actuality of factors which are listed previously and their qualities: In
the fifth stage, A Multi-Dimensional Data Model separates and differentiates the actuality from
the factors which are collected by it. These actually play a significant role in the arrangement of
a Multi-Dimensional Data Model.
Stage 6: Building the Schema to place the data, with respect to the information collected
from the steps above: In the sixth stage, on the basis of the data which was collected previously,
a Schema is built.

You might also like