Data Warehouse Concepts
Data Warehouse Concepts
accessed.
Time-Variant: Containing a history of the subject, as well as current information.
Historical information is an important component of a data warehouse.
Accessible: The primary purpose of a data warehouse is to provide readily
accessible information to end-users.
Process-Oriented: It is important to view data warehousing as a process for delivery
of information. The maintenance of a data warehouse is ongoing and iterative in
nature.
Note: - Data Warehouse does not require transaction processing, recovery and
concurrency control because it is physically stored separate from the operational
database.
The Data Warehouse is that database which is kept separate from the
organization's operational database.
There is no frequent updation done in data warehouse.
Data warehouse possess consolidated historical data which help the
organization to analyse its business.
Data warehouse helps the executives to organize, understand and use their
data to take strategic decision.
Data warehouse systems available which helps in integration of diversity of
application systems.
The Data warehouse system allows analysis of consolidated historical data
analysis.
Financial services
Banking Services
Consumer goods
Retail sectors.
Controlled manufacturing
Data Mining - Data Mining supports knowledge discovery by finding the hidden
patterns and associations, constructing analytical models, performing classification
and prediction. These mining results can be presented using the visualization tools.
Operational System
Informational System
2 Transaction driven
Analysis driven
4 Repetitive processing
Heuristic processing
7 Application oriented
Subject oriented
community
METADATA
It is data about data. It is used as
Column sizes
Performance metrics
Data Warehouse
Data content
current values
Data organization
application by application
Nature of data
Dynamic
Data
structure, suitable for operational simple; suitable for
complex; format
computation
business analysis
Access probability
High
moderate to low
Data update
updated on a
field-by-field basis
direct update
Usage
highly
structured highly unstructured
repetitive processing
processing
Response time
analytical
This model contains consistent data marts and these data marts can be delivered
quickly.
As the data marts are created first, reports can be generated quickly.
The data warehouse can be extended easily to accommodate new business units. It is
just creating new data marts and then integrating with other data marts.
TOP-DOWN DESIGN:
In the top-down design approach the, data warehouse is built first. The data marts
are then created from the data warehouse.
This approach is robust against business changes. Creating a new data mart
from the data warehouse is very easy.
Bottom Tier - The bottom tier of the architecture is the data warehouse
database server. It is the relational database system. We use the back end tools and
utilities to feed data into bottom tier. These back end tools and utilities perform the
Extract, Clean, Load, and refresh functions.
Middle Tier - In the middle tier we have OLAP Server. The OLAP Server can
be implemented in either of the following ways.
Top-Tier - This tier is the front-end client layer. This layer holds the query
tools and reporting tool, analysis tools and data mining tools.
Following diagram explains the Three-tier Architecture of Data warehouse:
Virtual Warehouse
Data mart
Enterprise Warehouse
VIRTUAL WAREHOUSE
DATA MART
The implementation cycle of data mart is measured in short period of time i.e.
in weeks rather than months or years.
The life cycle of a data mart may be complex in long run if it's planning and
designs are not organisation-wide.
The enterprise warehouse collects all the information all the subjects spanning
the entire organization
The size and complexity of load manager varies between specific solutions
from data warehouse to data warehouse.
Perform simple transformations into structure similar to the one in the data
warehouse.
EXTRACT
SOURCE
DATA
FROM
information
In order to minimize the total load window the data need to be loaded into
the warehouse in the fastest possible time.
It is more effective to load the data into relational database prior to applying
transformations and checks.
Gateway technology proves to be not suitable; since they tend not be
SIMPLE TRANSFORMATIONS
While loading it may be required to perform simple transformations. After this has
been completed we are in position to do the complex checks. Suppose we are loading
the EPOS sales transaction we need to perform the following checks:
Strip out all the columns that are not required within the warehouse.
solutions.
Backup/Recovery tool
SQL Scripts
integrity checks.
Creates the indexes, business views, partition views against the base data.
Generates the new aggregations and also updates the existing aggregation.
Generates the normalizations.
Warehouse manager Warehouse manager transforms and merge the source
data into the temporary store into the published data warehouse.
Warehouse Manager archives the data that has reached the end of its captured
life.
Note: Warehouse Manager also analyses query profiles to determine index and
aggregations are appropriate.
QUERY MANAGER
Query Manager is responsible for directing the queries to the suitable tables.
By directing the queries to appropriate table the query request and response
process is speed up.
Stored procedures.
The detailed information is not kept online rather is aggregated to the next level of
detail and then archived to the tape. The detailed information part of data
warehouse keeps the detailed information in the star flake schema. The detailed
information is loaded into the data warehouse to supplement the aggregated data.
Note: If the detailed information is held offline to minimize the disk storage we
should make sure that the data has been extracted, cleaned up, and transformed then
into star flake schema before it is archived.
In general, all data warehouse systems have the following layers:
Data Source Layer
Data Extraction Layer
Staging Area
ETL Layer
Data Storage Layer
Data Logic Layer
Data Presentation Layer
Metadata Layer
System Operations Layer
The picture below shows the relationships among the different components of the
data warehouse architecture:
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the data warehouse system. There is likely
some minimal data cleansing, but there is unlikely any major data transformation.
Staging Area
This is where data sits prior to being scrubbed and transformed into a data
warehouse / data mart. Having one common area makes it easier for subsequent
data processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied to transform the data
from a transactional nature to an analytical nature. This layer is also where data
cleansing happens. The ETL design phase is often the most time-consuming phase in
a data warehousing project, and an ETL tool is often used in this layer.
Data Storage Layer
This is where the transformed and cleansed data sit. Based on scope and
functionality, 3 types of entities can be found here: data warehouse, data mart, and
operational data store (ODS). In any given system, you may have just one of the
three, two of the three, or all three types.
Data Logic Layer
This is where business rules are stored. Business rules stored here do not affect the
underlying data transformation rules, but do affect what the report looks like.
Data Presentation Layer
This refers to the information that reaches the users. This can be in a form of a
tabular / graphical report in a browser, an emailed report that gets automatically
generated and sent every day, or an alert that warns users of exceptions, among
others. Usually a tool and/or a reporting tool are used in this layer.
Metadata Layer
This is where information about the data stored in the data warehouse system is
stored. A logical data model would be an example of something that's in the
metadata layer. A metadata is often used to manage metadata.
System Operations Layer
This layer includes information on how the data warehouse system operates, such as
ETL job status, system performance, and user access history.
OTHER DEFINITIONS
Data Warehouse: A data structure that is optimized for distribution. It collects and
stores integrated sets of historical data from multiple operational systems and feeds
them to one or more data marts. It may also provide end-user access to support
enterprise views of data.
Data Mart: A data structure that is optimized for access. It is designed to facilitate
end-user analysis of data. It typically supports a single, analytic application used by
a distinct set of workers.
Staging Area: Any data store that is designed primarily to receive data into a
warehousing environment.
Operational Data Store: A collection of data that addresses operational needs of
various operational units. It is not a component of a data warehousing architecture,
but a solution to operational needs.
OLAP (On-Line Analytical Processing): A method by which multidimensional
analysis occurs.
Multidimensional Analysis: The ability to manipulate information by a variety of
relevant categories or dimensions to facilitate analysis and understanding of the
underlying data. It is also sometimes referred to as drilling-down, drilling-across
and slicing and dicing
Star Schema: A means of aggregating data based on a set of known dimensions. It
stores data multi-dimensionally in a two dimensional Relational Database
Management System (RDBMS), such as Oracle.
Snowflake Schema: An extension of the star schema by means of applying
additional dimensions to the dimensions of a star schema in a relational
environment.
Multidimensional Database: Also known as MDDB or MDDBS. A class of
proprietary, non-relational database management tools that store and manage data
METADATA RESPIRATORY
The Metadata Respiratory is an integral part of data warehouse system. The
Metadata Respiratory contains the following metadata:
Business Metadata - This metadata has the data ownership information, business
definition and changing policies.
Operational Metadata -This metadata includes currency of data and data lineage.
Currency of data means whether data is active, archived or purged. Lineage of data
means history of data migrated and transformation applied on it.
Data for mapping from operational environment to data warehouse -This metadata
includes source databases and their contents, data extraction, data partition,
cleaning, transformation rules, data refresh and purging rules.
The algorithms for summarization - This includes dimension algorithms, data on
granularity, aggregation, summarizing etc.
DATA MART
A data mart is a subject-oriented archive that stores data and uses the retrieved set of
information to assist and support the requirements involved within a particular
business function or department. Data marts exist within a single
organizational data warehouse repository
A data mart is a repository of data gathered from operational data and other sources
that is designed to serve a particular community of knowledge workers.
A data mart is a repository of data that is designed to serve a particular community
of knowledge workers.
Data marts improve end-user response time by allowing users to have access to the
specific type of data they need to view most often by providing the data in a way
that supports the collective view of a group of users.
Metadata is simply defined as data about data. The data that are used to represent
other data is known as metadata. For example the index of a book serves as metadata
for the contents in the book. In other words we can say that metadata is the
summarized data that leads us to the detailed data. In terms of data warehouse we
can define metadata as following.
The metadata act as a directory. This directory helps the decision support
system to locate the contents of data warehouse.
Note: In data warehouse we create metadata for the data names and definitions of a
given data warehouse. Along with this metadata additional metadata are also
created for time stamping any extracted data, the source of extracted data.
Categories of Metadata
Business Metadata - This metadata has the data ownership information, business
definition and changing policies.
Technical Metadata - Technical metadata includes database system names, table
and column names and sizes, data types and allowed values. Technical metadata
also includes structural information such as primary and foreign key attributes
and indices.
Operational Metadata - This metadata includes currency of data and data
lineage. Currency of data means whether data is active, archived or purged.
Lineage of data means history of data migrated and transformation applied on it.
ROLE OF METADATA
Metadata has very important role in data warehouse. The role of metadata in
warehouse is different from the warehouse data yet it has very important role. The
various roles of metadata are explained below.
This directory helps the decision support system to locate the contents of data
warehouse.
Metadata helps in decision support system for mapping of data when data are
transformed from operational environment to data warehouse environment.
Note: Do not data mart for any other reason since the operation cost of data marting
could be very high. Before data marting, make sure that data marting strategy is
appropriate for your particular solution.
Steps to determine that data mart appears to fit the bill
Following steps need to be followed to make cost effective data marting:
DATA WAREHOUSE:
Does not necessarily use a dimensional model but feeds dimensional models.
DATA MART:
Often holds only one subject area- for example, Finance, or Sales
May hold more summarized data (although many hold full detail)
Ease of creation
Potential users are more clearly defined than in a full data warehouse
DSS uses the summary information, exceptions, patterns and trends using the
analytical models. Decision Support System helps in decision making but does not
always give a decision itself. The decision makers compile useful information from
raw data, documents, personal knowledge, and/or business models to identify and
solve problems and make decisions.
Programmed and Non-programmed Decisions
There are two types of decisions - programmed and non-programmed decisions.
Programmed decisions are basically automated processes, general routine work,
where:
These decisions are based on the manger's discretion, instinct, perception and
judgment
For example, investing in a new technology is a non-programmed decision
Decision support systems generally involve non-programmed decisions. Therefore,
there will be no exact report, content or format for these systems. Reports are
generated on the fly.
ATTRIBUTES OF A DSS
Ease of use
Ease of development
Extendibility
Support for individuals and groups. Less structured problems often requires
the involvement of several individuals from different departments and organization
level.
BENEFITS OF DSS
COMPONENTS OF A DSS
Following are the components of the Decision Support System:
Model Management system: It stores and accesses models that managers use
to make decisions. Such models are used for designing manufacturing facility,
analyzing the financial health of an organization. Forecasting demand of a product
or service etc.
Support Tools: Support tools like online help; pull down menus, user interfaces,
graphical analysis, error correction mechanism, facilitates the user interactions with
the system.
Classification of DSS
There are several ways to classify DSS. Hoi Apple and Whinstone classify DSS in
following:
TYPES OF DSS
Following are some typical DSSs:
Market intelligence
Investment intelligence
Technology intelligence
Examples of Intelligent Information
intelligent
information,
External databases
Market
reports
Government policies
ADVANTAGES OF ESS:
Better understanding
Time management
DISADVANTAGE OF ESS
personalized information
Intranet
Data warehouses and knowledge repositories
Decision support tools
Groupware for supporting collaboration
Networks of knowledge workers
Internal expertise
DEFINITION OF KMS
Improved performance
Competitive advantage
Innovation
Sharing of knowledge
Integration
Start with the business problem and the business value to be delivered first.
Identify what kind of strategy to pursue to deliver this value and address the
KM problem
Think about the system required from a people and process point of view.
Data load takes extracted data and loads it into data warehouse.
Note: Before loading the data into data warehouse the information extracted from
external sources must be reconstructed.
Points to remember while extract and load process:
Make sure data is consistent with other data within the same data source.
AGGREGATION
Aggregation is required to speed up the common queries. Aggregation relies on the
fact that most common queries will analyse a subset or an aggregation of the
detailed data.
This process should also ensure that all system sources are used in most
effective way.
This process does not generally operate during regular load of information
into data warehouse.
INTRODUCTION
OLTP
OLAP
Purpose
Structure
RDBMS
RDBMS
Data Model
Normalized
Multidimensional
Access
SQL
Type of Data
Historical, descriptive
Relational OLAP(ROLAP)
OLAP Operations
As we know that the OLAP server is based on the multidimensional view of data
hence we will discuss the OLAP operations in multidimensional data.
Roll-up
Drill-down
Pivot (rotate)
ROLL-UP
This operation performs aggregation on a data cube in any of the following way:
By dimension reduction.
Consider the following diagram showing the roll-up operation.
Initially the concept hierarchy was "street < city < province < country".
When roll-up operation is performed then one or more dimensions from the
data cube are removed.
DRILL-DOWN
Drill-down operation is reverse of the roll-up. This operation is performed by either
of the following way:
Initially the concept hierarchy was "day < month < quarter < year."
On drill-up the time dimension is descended from the level quarter to the
level of month.
It navigates the data from less detailed data to highly detailed data.
SLICE
The slice operation performs selection of one dimension on a given cube and gives us
a new sub cube. Consider the following diagram showing the slice operation.
The Slice operation is performed for the dimension time using the criterion
time ="Q1".
It will form a new sub cube by selecting one or more dimensions.
DICE
The Dice operation performs selection of two or more dimension on a given cube
and gives us a new sub cube. Consider the following diagram showing the dice
operation:
The dice operation on the cube based on the following selection criteria that involve
three dimensions.
PIVOT
The pivot operation is also known as rotation. It rotates the data axes in view in
order to provide an alternative presentation of data. Consider the following diagram
showing the pivot operation.
In this the item and location axes in 2-D slice are rotated.
As we have seen that the size of the open database has grown approximately
double the magnitude in last few years. This change in magnitude is of greater
significance.
As the size of the databases grows, the estimates of what constitutes a very
large database continue to grow.
The Hardware and software that are available today do not allow keeping a
large amount of data online. For example a Telco call record requires 10TB of data to
be kept online which is just a size of one month record. If it requires keeping record
of sales, marketing customer, employee etc. then the size will be more than 100 TB.
The record not only contains the textual information but also contain some
multimedia data. Multimedia data cannot be easily manipulated as text data.
Searching the multimedia data is not an easy task whereas the textual information
can be retrieved by the relational software available today.
Apart from size planning, building and running ever-larger data warehouse
systems are very complex. As the number of users increases the size of the data
warehouse also increases. These users will also require to access to the system.
and
data
A: The functions performed by Data warehouse tool and utilities are Data Extraction,
Data Cleaning, Data Transformation, Data Loading and Refreshing
Q: What do you mean by Data Extraction?
A: Data Extraction means gathering the data from multiple heterogeneous sources.
Q: Define Metadata?
A: Metadata is simply defined as data about data. In other words we can say that
metadata is the summarized data that lead us to the detailed data.
Q: What does Metadata Respiratory contains?
A: Metadata respiratory contains Definition of data warehouse, Business Metadata,
Operational Metadata, Data for mapping from operational environment to data
warehouse and the Algorithms for summarization
Q: How does a Data Cube help?
A: Data cube help us to represent the data in multiple dimensions. The data cube is
defined by dimensions and facts.
Q: Define Dimension?
A: The dimensions are the entities with respect to which an enterprise keeps the
records.
Q: Explain Data mart?
A: Data mart contains the subset of organisation-wide data. This subset of data is
valuable to specific group of an organisation. In other words we can say that data
mart contains only that data which is specific to a particular group.
Q: What is Virtual Warehouse?
A: The view over an operational data warehouse is known as virtual warehouse.
Q: List the phases involved in Data warehouse delivery Process?
A: The stages are IT strategy, Education, Business Case Analysis, technical Blueprint,
Build the version, History Load, Ad hoc query, Requirement Evolution, Automation,
Extending Scope.
Q: Explain Load Manager?
A: This Component performs the operations required to extract and load process.
The size and complexity of load manager varies between specific solutions from data
warehouse to data warehouse.
Q: Define the function of Load Manager?
A: Extract the data from source system. Fast Load the extracted data into temporary
data store. Perform simple transformations into structure similar to the one in the
data warehouse.
Q: Explain Warehouse Manager?
A: Warehouse manager is responsible for the warehouse management process. The
warehouse manager consists of third party system software, C programs and shell
scripts. The size and complexity of warehouse manager varies between specific
solutions.
Q: Define functions of Warehouse Manager?
A: The Warehouse Manager performs consistency and referential integrity checks,
Creates the indexes, business views, partition views against the base data, transforms
and merge the source data into the temporary store into the published data
warehouse, Backup the data in the data warehouse and archives the data that has
reached the end of its captured life.
Q: What is Summary Information?
A: Summary Information is the area in data warehouse where the predefined
aggregations are kept.
Q: What does the Query Manager responsible for?
A: Query Manager is responsible for directing the queries to the suitable tables.
Q: List the types of OLAP server?
A: There are four types of OLAP Server namely Relational OLAP, Multidimensional
OLAP, Hybrid OLAP, and Specialized SQL Servers
Q: Which one is faster Multidimensional OLAP or Relational OLAP?
FACTOR ANALYSIS
Variables
Factor 1
Factor 2
Income
0.65
0.11
Education
0.59
0.25
Occupation
0.48
0.19
House value
0.38
0.60
0.57
0.55
The variable with the strongest association to the underlying latent variable. Factor 1,
is income, with a factor loading of 0.65.
Since factor loadings can be interpreted like standardized regression coefficients, one
could also say that the variable income has a correlation of 0.65 with Factor 1. This
would be considered a strong association for a factor analysis in most research fields.
Two other variables, education and occupation, are also associated with Factor 1.
Based on the variables loading highly onto Factor 1, we could call it Individual
socioeconomic status.
House value, number of public parks, and number of violent crimes per year,
however, have high factor loadings on the other factor, Factor 2. They seem to
indicate the overall wealth within the neighbourhood, so we may want to call Factor
2 Neighbourhood socioeconomic status.
Notice that the variable house value also is marginally important in Factor 1 (loading
= 0.38). This makes sense, since the value of a persons house should be associated
with his or her income.