Data Warehousing Concepts
Data Warehousing Concepts
Data Warehousing Concepts
Data
Warehousing > Data
Warehouse
Definition > Data Warehouse Architecture
Different data warehousing systems have different
structures. Some may have an ODS (operational data store),
while some may have multiple data marts. Some may have
a small number of data sources, while some may have
dozens of data sources. In view of this, it is far more
reasonable to present the different layers of a data
warehouse architecture rather than discussing the specifics
of any one system.
In general, all data warehouse systems have the following
layers:
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the data
warehouse system. There is likely some minimal data
cleansing, but there is unlikely any major data
transformation.
Staging Area
This is where data sits prior to being scrubbed and
transformed into a data warehouse / data mart. Having one
common area makes it easier for subsequent data
processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied
to transform the data from a transactional nature to an
analytical nature. This layer is also where data cleansing
happens. The ETL design phase is often the most timeconsuming phase in a data warehousing project, and an ETL
tool is often used in this layer.
Data Storage Layer
This is where the transformed and cleansed data sit. Based
on scope and functionality, 3 types of entities can be found
of
junk
Types of Facts
There are three types of facts:
Name
State
1001
Christina
Illinois
Slowly
Name
State
1001
Christina
Illinois
Name
State
1001
Christina
California
Advantages:
Slowly
Name
State
1001
Christina
Illinois
Name
State
1001
Christina
Illinois
1005
Christina
California
Advantages:
Name
State
1001
Christina
Illinois
Customer Key
Name
Original State
Current State
Effective Date
Name
Effective Date
1001
Christina
Illinois
15-JAN-2003
California
Advantages:
- This does not increase the size of the table, since new
information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an
attribute is changed more than once. For example, if
Christina later moves to Texas on December 15, 2003, the
California information will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when
it is necessary for the data warehouse to track historical
changes, and when such changes will only occur for a finite
number of time.
Next Page: Conceptual Data Model
Data
Model
Data
From the figure above, we can see that the only information
shown via the conceptual data model is the entities that
describe the data and the relationships between those
entities. No other information is shown through the
conceptual data model.
Next Page: Logical Data Model
Data Warehousing > Concepts > Logical Data Model
Feature
Entity Names
Entity
Relationships
Attributes
Primary Keys
Foreign Keys
Table Names
Column Names
Column Data
Types
ETL process
For each step of the ETL process, data integrity checks
should be put in place to ensure that source data is the
same as the data in the destination. Most common checks
include record counts or record sums.
Access level
We need to ensure that data is not altered by any
unauthorized means either during the ETL process or in the
data warehouse. To do this, there needs to be safeguards
against unauthorized access to data (including physical
access to the servers), as well as logging of all data access
history. Data integrity can only ensured if there is no
unauthorized access to the data.
Next Page: What Is OLAP
Data Warehousing > Concepts > What Is OLAP
OLAP stands for On-Line Analytical Processing. The first
attempt to provide a definition to OLAP was by Dr. Codd,
who proposed 12 rules for OLAP. Later, it was discovered
that this particular white paper was sponsored by one of the
OLAP tool vendors, thus causing it to lose objectivity. The
OLAP Report has proposed the FASMI test, Fast Analysis
of Shared Multidimensional Information. For a more detailed
description of both Dr. Codd's rules and the FASMI test,
please visit The OLAP Report.
For people on the business side, the key feature out of the
above list is "Multidimensional." In other words, the ability to
analyze metrics in different dimensions such as time,
geography, gender, product, etc. For example, sales for the
company are up. What region is most responsible for this
increase? Which store in this region is most responsible for
the increase? What particular product category or categories
contributed the most to the increase? Answering these types
of questions in order means that you are performing an
OLAP analysis.
Depending on the underlying technology used, OLAP can be
broadly divided into two different camps: MOLAP and ROLAP.
Disadvantages:
ROLAP
Disadvantages:
HOLAP
HOLAP technologies attempt to combine the advantages of
MOLAP and ROLAP. For summary-type information, HOLAP
leverages cube technology for faster performance. When
detail information is needed, HOLAP can "drill through" from
the cube into the underlying relational data.
Next Page: Bill Inmon vs. Ralph Kimball
The content of the junk dimension table would look like the
following:
Requirement Gathering
Physical Environment Setup
Data Modeling
ETL
OLAP Cube Design
Front End Development
Report Development
Performance Tuning
Query Optimization
Quality Assurance
Rolling out to Production
Production Maintenance
Incremental Enhancements
Task Description
The first thing that the project team should engage in is
gathering requirements from end users. Because end users
are typically not familiar with the data warehousing process
or concept, the help of the business sponsor is essential.
Requirement gathering can happen as one-to-one meetings
or as Joint Application Development (JAD) sessions, where
multiple people are talking about the project scope in the
same meeting.
The primary goal of this phase is to identify what constitutes
as a success for this particular phase of the data warehouse
project. In particular, end user reporting / analysis
requirements are identified, and the project team will spend
the remaining period of time trying to satisfy these
requirements.
Associated with the identification of user requirements is a
more concrete definition of other details such as hardware
sizing information, training requirements, data source
identification, and most importantly, a concrete project plan
indicating the finishing date of the data warehousing project.
Based on the information gathered above, a disaster
recovery plan needs to be developed so that the data
warehousing system can recover from accidents that disable
the system. Without an effective backup and restore
strategy, the system will only last until the first major
disaster, and, as many data warehousing DBA's will attest,
this can happen very quickly after the project goes live.
Time Requirement
2 - 8 weeks.
Deliverables
Possible Pitfalls
This phase often turns out to be the trickiest phase of the
data warehousing implementation. The reason is that
because data warehousing by definition includes data from
multiple sources spanning many different departments
within the enterprise, there are often political battles that
center on the willingness of information sharing. Even
though a successful data warehouse benefits the enterprise,
there are occasions where departments may not feel the
same way. As a result of unwillingness of certain groups to
release data or to participate in the data warehousing
requirement definition, the data warehouse effort either
never gets off the ground, or could not start in the direction
originally defined.
When this happens, it would be ideal to have a strong
business sponsor. If the sponsor is at the CXO level, she can
often exert enough influence to make sure everyone
cooperates.
Next Section: Physical Environment Setup
Task Description
Once the requirements are somewhat clear, it is necessary
to set up the physical servers and databases. At a minimum,
it is necessary to set up a development environment and a
production environment. There are also many data
warehousing projects where there are three environments:
Development, Testing, and Production.
Time Requirement
Getting the servers and databases ready should take less
than 1 week.
Deliverables
Possible Pitfalls
To save on capital, often data warehousing teams will decide
to use only a single database and a single server for the
different environments. Environment separation is achieved
by either a directory structure or setting up distinct
instances of the database. This is problematic for the
following reasons:
Task Description
This is a very important step in the data warehousing
project. Indeed, it is fair to say that the foundation of the
data warehousing system is the data model. A good data
model will allow the data warehousing system to grow
easily, as well as allowing for good performance.
In data warehousing project, the logical data model is built
based on user requirements, and then it is translated into
the physical data model. The detailed steps can be found in
the Conceptual, Logical, and Physical Data Modeling section.
Part of the data modeling exercise is often the identification
of data sources. Sometimes this step is deferred until the
ETL step. However, my feeling is that it is better to find out
where the data exists, or, better yet, whether they even
exist anywhere in the enterprise at all. Should the data not
be available, this is a good time to raise the alarm. If this
was delayed until the ETL phase, rectifying it will becoming a
much tougher and more complex process.
Time Requirement
2 - 6 weeks.
Deliverables
Possible Pitfalls
It is essential to have a subject-matter expert as part of the
data modeling team. This person can be an outside
consultant or can be someone in-house who has extensive
experience in the industry. Without this person, it becomes
difficult to get a definitive answer on many of the questions,
and the entire project gets dragged out.
Next Section: ETL
Data Warehousing > Data Waraehouse Design > ETL
Task Description
The ETL (Extraction, Transformation, Loading) process
typically takes the longest to develop, and this can easily
take up to 50% of the data warehouse implementation cycle
or longer. The reason for this is that it takes time to get the
source data, understand the necessary columns, understand
the business rules, and understand the logical and physical
data models.
Time Requirement
1 - 6 weeks.
Deliverables
Possible Pitfalls
There is a tendency to give this particular phase too little
development time. This can prove suicidal to the project
because end users will usually tolerate less formatting,
longer time to run reports, less functionality (slicing and
dicing), or fewer delivered reports; one thing that they will
not tolerate is wrong information.
A second common problem is that some people make the
ETL process more complicated than necessary. In ETL
Task Description
Usually the design of the olap cube can be derived from
the Requirement Gathering phase. More often than not,
however, users have some idea on what they want, but it is
difficult for them to specify the exact report / analysis they
want to see. When this is the case, it is usually a good idea
to include enough information so that they feel like they
have gained something through the data warehouse, but not
so much that it stretches the data warehouse scope by a
mile. Remember that data warehousing is an iterative
process - no one can ever meet all the requirements all at
once.
Time Requirement
1 - 2 weeks.
Deliverables
Possible Pitfalls
Make sure your olap cube-building process is optimized. It is
common for the data warehouse to be on the bottom of the
nightly batch load, and after the loading of the data
warehouse, there usually isn't much time remaining for the
olap cube to be refreshed. As a result, it is worthwhile to
Task Description
Regardless of the strength of the OLAP engine and the
integrity of the data, if the users cannot visualize the
reports, the data warehouse brings zero value to them.
Hence front end development is an important part of a data
warehousing initiative.
So what are the things to look out for in selecting a front-end
deployment methodology? The most important thing is that
the reports should need to be delivered over the web, so the
only thing that the user needs is the standard browser.
These days it is no longer desirable or feasible to have the IT
department doing program installations on end users
desktops just so that they can view reports. So, whatever
strategy one pursues, make sure the ability to deliver over
the web is a must.
The front-end options ranges from an internal front-end
development using scripting languages such as ASP, PHP, or
Perl, to off-the-shelf products such as Seagate Crystal
Reports, to the more higher-level products such as Actuate.
In addition, many OLAP vendors offer a front-end on their
own. When choosing vendor tools, make sure it can be easily
customized to suit the enterprise, especially the possible
changes to the reporting requirements of the enterprise.
Possible changes include not just the difference in report
layout and report content, but also include possible changes
in the back-end structure. For example, if the enterprise
decides to change from Solaris/Oracle to Microsoft 2000/SQL
Server, will the front-end tool be flexible enough to adjust to
the changes without much modification?
Another area to be concerned with is the complexity of the
reporting tool. For example, do the reports need to be
published on a regular interval? Are there very specific
Time Requirement
1 - 4 weeks.
Deliverables
Front End Deployment Documentation
Possible Pitfalls
Just remember that the end users do not care how complex
or how technologically advanced your front end
infrastructure is. All they care is that they receives their
information in a timely manner and in the way they
specified.
Next Section: Report Development
Data Warehousing > Data Warehouse Design > Report
Development
Task Description
Report specification typically comes directly from the
requirements phase. To the end user, the only direct
touchpoint he or she has with the data warehousing system
is the reports they see. So, report development, although
not as time consuming as some of the other steps such
as ETL and data modeling, nevertheless plays a very
important role in determining the success of the data
warehousing project.
One would think that report development is an easy task.
How hard can it be to just follow instructions to build the
report? Unfortunately, this is not true. There are several
points the data warehousing team needs to pay attention to
before releasing the report.
User customization: Do users need to be able to select
their own metrics? And how do users need to be able to filter
the information? The report development process needs to
Time Requirement
1 - 2 weeks.
Deliverables
Possible Pitfalls
Make sure the exact definitions of the report are
communicated to the users. Otherwise, user interpretation
of the report can be erroneous.
Next Section: Performance Tuning
Data Warehousing > Data Warehouse
Design > Performance Tuning
Task Description
There are three major areas where a data warehousing
system can use a little performance tuning:
ETL - Given that the data load is usually a very timeconsuming process (and hence they are typically
relegated to a nightly load job) and that data
warehousing-related batch jobs are typically of lower
priority, which means that the window for data loading
is not very long. A data warehousing system that has
its ETL process finishing right on-time is going to have
a lot of problems simply because often the jobs do not
get started on-time due to factors that is beyond the
control of the data warehousing team. As a result, it is
always an excellent idea for the data warehousing
group to tune the ETL process as much as possible.
Query Processing - Sometimes, especially in a ROLAP
environment or in a system where the reports are run
directly against the relationship database, query
performance can be an issue. A study has shown that
users typically lose interest after 30 seconds of waiting
for a report to return. My experience has been that
ROLAP reports or reports that run directly against the
RDBMS often exceed this time limit, and it is hence
ideal for the data warehousing team to invest some
time to tune the query, especially the most popularly
ones. We present a number of query
optimization ideas.
Report Delivery - It is also possible that end users are
experiencing significant delays in receiving their
reports due to factors other than the query
Time Requirement
3 - 5 days.
Deliverables
Possible Pitfalls
Make sure the development environment mimics the
production environment as much as possible - Performance
enhancements seen on less powerful machines sometimes
do not materialize on the larger, production-level machines.
Next Section: Query Optimization
Data Warehousing > Data Warehouse Design > Query
Optimization
For any production database, SQL query performance
becomes an issue sooner or later. Having long-running
queries not only consumes system resources that makes the
server and application run slowly, but also may lead to table
locking and data corruption issues. So, query optimization
becomes an important task.
First, we offer some guiding principles for query
optimization:
1. Understand how your database is executing your
query
Nowadays all databases have their own query optimizer, and
offers a way for users to understand how a query is
executed. For example, which index from which table is
being used to execute the query? The first step to query
optimization is understanding what the database is doing.
Different databases have different commands for this. For
Use Index
Using an index is the first strategy one should use to
speed up a query. In fact, this strategy is so important
that index optimization is also discussed.
Aggregate Table
Pre-populating tables at higher levels so less amount of
data need to be parsed.
Vertical Partitioning
Partition the table by columns. This strategy decreases
the amount of data a SQL query needs to process.
Horizontal Partitioning
Partition the table by data value, most often time. This
Task Description
Once the development team declares that everything is
ready for further testing, the QA team takes over. The QA
team is always from the client. Usually the QA team
members will know little about data warehousing, and some
of them may even resent the need to have to learn another
tool or tools. This makes the QA process a tricky one.
Sometimes the QA process is overlooked. On my very first
data warehousing project, the project team worked very
hard to get everything ready for Phase 1, and everyone
thought that we had met the deadline. There was one
mistake, though, the project managers failed to recognize
that it is necessary to go through the client QA process
before the project can go into production. As a result, it took
five extra months to bring the project to production (the
original development time had been only 2 1/2 months).
Time Requirement
1 - 4 weeks.
Deliverables
QA Test Plan
Possible Pitfalls
As mentioned above, usually the QA team members know
little about data warehousing, and some of them may even
resent the need to have to learn another tool or tools. Make
sure the QA team members get enough education so that
they can complete the testing themselves.
Next Section: Rollout to Production
Task Description
Once the QA team gives thumbs up, it is time for the data
warehouse system to go live. Some may think this is as easy
as flipping on a switch, but usually it is not true. Depending
on the number of end users, it sometimes takes up to a full
week to bring everyone online! Fortunately, nowadays most
end users access the data warehouse over the web, making
going production sometimes as easy as sending out an URL
via email.
Time Requirement
1 - 3 days.
Deliverables
Possible Pitfalls
Take care to address the user education needs. There is
nothing more frustrating to spend several months to develop
and QA the data warehousing system, only to have little
usage because the users are not properly trained.
Regardless of how intuitive or easy the interface may be, it
Task Description
Once the data warehouse goes production, it needs to be
maintained. Tasks as such regular backup and crisis
management become important and should be planned out.
In addition, it is very important to consistently monitor end
user usage. This serves two purposes: 1. To capture any
runaway requests so that they can be fixed before slowing
the entire system down, and 2. To understand how much
users are utilizing the data warehouse for return-oninvestment calculations and future enhancement
considerations.
Time Requirement
Ongoing.
Deliverables
Consistent availability of the data warehousing system to
the end users.
Possible Pitfalls
Usually by this time most, if not all, of the developers will
have left the project, so it is essential that proper
documentation be left for those who are handling production
maintenance. There is nothing more frustrating than staring
at something another person did, yet unable to figure it out
due to the lack of proper documentation.
Task Description
Once the data warehousing system goes live, there are often
needs for incremental enhancements. I am not talking about
a new data warehousing phases, but simply small changes
that follow the business itself. For example, the original
geographical designations may be different, the company
may originally have 4 sales regions, but now because sales
are going so well, now they have 10 sales regions.
Deliverables
Possible Pitfalls
Because a lot of times the changes are simple to make, it is
very tempting to just go ahead and make the change in
production. This is a definite no-no. Many unexpected
problems will pop up if this is done. I would very strongly
recommend that the typical cycle of development --> QA -->
Production be followed, regardless of how simple the change
may seem.
Next Section: Data Warehousing Trends
A dimension table consists of the attributes about the facts. Dimensions store the textual
descriptions of the business. With out the dimensions, we cannot measure the facts. The
different types of dimension tables are explained in detail below.
Conformed Dimension:
Conformed dimensions mean the exact same thing with every possible fact table to which
they are joined.
Eg: The date dimension table connected to the sales facts is identical to the date
dimension connected to the inventory facts.
Junk Dimension:
A junk dimension is a collection of random transactional codes flags and/or text attributes
that are unrelated to any particular dimension. The junk dimension is simply a structure
that provides a convenient place to store the junk attributes.
Eg: Assume that we have a gender dimension and marital status dimension. In the fact
table we need to maintain two keys referring to these dimensions. Instead of that create a
junk dimension which has all the combinations of gender and marital status (cross join
gender and marital status table and create a junk table). Now we can maintain only one
key in the fact table.
Degenerated Dimension:
A degenerate dimension is a dimension which is derived from the fact table and doesn't
have its own dimension table.
Eg: A transactional code in a fact table.
Role-playing dimension:
Dimensions which are often used for multiple purposes within the same database are
called role-playing dimensions. For example, a date dimension can be used for date of
sale", as well as "date of delivery", or "date of hire".
A fact table is the one which consists of the measurements, metrics or facts of business
process. These measurable facts are used to know the business value and to forecast
the future business. The different types of facts are explained in detail below.
Additive:
Additive facts are facts that can be summed up through all of the dimensions in the fact
table. A sales fact is a good example for additive fact.
Semi-Additive:
Semi-additive facts are facts that can be summed up for some of the dimensions in the
fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers dimension but not
through the time dimension.
Non-Additive:
Non-additive facts are facts that cannot be summed up for any of the dimensions present
in the fact table.
Eg: Facts which have percentages, ratios calculated.
Factless Fact Table:
In the real world, it is possible to have a fact table that contains no measures or facts.
These tables are called "Factless Fact tables".
Eg: A fact table which has only product key and date key is a factless fact. There are no
measures in this table. But still you can get the number products sold over a period of
time.
A fact tables that contain aggregated facts are often called summary tables.