In T e G R A Ti o N: Integration of Data
In T e G R A Ti o N: Integration of Data
According to, Ralph Kimball: A data warehouse is a relational database that is designed for querying
and analyzing the business but not for transaction processing.
It usually contains historical data derived from transactional data (different sourcesystems).
According to, W.H.Inmon: A Data warehouse is a Subject oriented, integrated, timevariant and non-
volatile collection of Data used to support strategic decision Making process.
1. Subject Oriented.
2. Integrated.
3. Nonvolatile.
4. Time Variant.
Subject Oriented: The data warehouses are designed as a Subject-oriented that are used to analyze
the business by top level management, or middle level management, or for a individual department in
an enterprise.
Storage: For example, to learn more about your company's sales data, you can build a warehouse that
concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer
for this item last year?" This ability to define a data warehouse by subject matter, sales in this case
makes the data warehouse subject oriented.
Integrated: A data warehouse is an integrated database which contains the business information
collected from various operational data sources In
t
e
Integration of Data
g
Encoding
I n t e g
Appl. A -M, F
r a t ri o n
Appl. B -1, 0 a M, F
Appl. C -X, Y
ti
o
n
Unit of Attributes Appl. A -pipeline cm.
Appl. B -pipeline inches pipeline cm
Appl. C -pipeline mcf
PhysicalAttributes Appl. A -balance dec(13,2)
Appl. B -balance PIC 9(9)V99 balance dec(13, 2)
Appl. C -balance float
NamingConventions Appl. A -bal-on-hand
Appl. B -current_balance balance
Appl. C -balance
DataConsistency Appl. A -date (Julian)
Appl. B -date (yymmdd) date (Julian)
Appl. C -date (absolute)
Time Variant :A Data warehouse is a time variant database which allows you to analyze and compare
the business with respect to various time periods (Year, Quarter, Month, Week, Day) because which
maintains historical data.
Nonvolatile: A Data warehouse is a non-volatile database. That means once the data
entered into data warehouse cannot change. It doesn’t reflect to the changes taken
place in operational database. Hence the data is static.
1. To Store Large Volumes of Historical Detail Data from Mission Critical Applications.
2. Better business intelligence for end-users
3. Data Security - To prevent unauthorized access to sensitive data
4. Replacement of older, less-responsive decision support systems
5. Reduction in time to locate, access, and analyze information
Evaluation:
1. 60’s: Batch reports
1. hard to find and analyze information
2. Inflexible and expensive, reprogram every new request
3. 70’s: Terminal -based DSS and EIS (executive information systems)
1. Still inflexible, not integrated with desktop tools
4. 80’s: Desktop data access and analysis tools
1. Query tools, spreadsheets, GUIs
2. Easier to use, but only access operational databases
5. 90’s: Data warehousing with integrated OLAP engines and tools.
In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP
systems help to analyze it.
4. Federated data warehouse: A federated DW is an active union and cooperation across separate
DWs.
Top-Down Approach:
This approach is developed by W.H.Inmon. According to him first we need to develop enterprise data
warehouse. Then from that enterprise data warehouse develop subject-orient databases those are
called as Data Marts.
Bottom-Up Approach:
This approach is developed by Ralph Kimball. According to him first we need to develop the data marts
to support the business needs of middle-level management. Then integrate all the data marts into an
enterprise data warehouse.
Top-Down Vs Bottom Up
Top Down Bottom Up
More planning and design initially. Can plan initially without waiting for global
infrastructure.
Involve people from different work-groups, Built incrementally.
departments.
Data marts may be built later from Global DW. Can be built before or in parallel with Global DW.
Overall data model to be decided up-front. Less complexity in design.
High cost, lengthy process, time-consuming. Low cost of Hardware and other resources.
Involved people from different workgroups, It is built in the incremental manner.
departments.
Data Sources ETL Software Data Stores Data Analysis Tools and Applications
Users
Source System: A System which provides the data to build a DW is known as Source System. Which are
of two types?
Internal Source
External Source
Data Acquisition: It is the process of extracting the relevant business information, transforming the
data into the required business format and loading into the data warehouse. A data acquisition is
defined with the following process.
1. Extraction
2. Transformation
3. Loading
Extraction: The first part of an ETL process involves extracting the data from the different source
systems like Operational Sources, XML files, Flat files, COBOL files, SAP, People soft, Sybase etc.,
1. Full Extraction
2. Periodic/Incremental Extraction
Transformation: It is the process of transforming the data into a required business format. The
following are the data transformation activities takes place in staging area.
1. Data Merging.
2. Data Cleansing.
3. Data Scrubbing.
4. Data Aggregation.
Data Merging: It is a process of integrating the data from multiple input pipe lines of similar structure
or dissimilar structure into a single output pipeline.
Data Cleansing: It is a process of identifying or changing the inconsistencies and inaccuracies.
1. Trillium: Used for cleansing the name and address data. The software is able to identify and match
households, business contacts and other relationships to eliminate duplicates in large databases using
fuzzy matching techniques.
2. First Logic
Data Scrubbing: It is a process of deriving the new data definitions from existing source data
definitions.
Data Aggregation: It is a process where multiple detailed values are summarized into a single summary
values.
Ex: SUM, AVERAGE, MAX, MIN etc.
Loading: It is the process of inserting the data into Data Warehouse. And loading are of two types:
1. Initial load
2. Incremental load
ETL Products:
Code Based ETL Tools: In these tools, the data acquisition processes can be developed with the help
of programming languages.
1. SAS ACCESS
2. SAS BASE
3. TERADATA ETL TOOLS
ETL Approach:
Landing Area – This is the area where the source files and tables will reside.
Staging – This is where all the staging tables will reside. While loading into staging tables, data
validation will be done and if the validation is successful, data will be loaded into the staging table and
simultaneously the control table will be updated. If the validation fails, the invalid data details will be
captured in the Error Table.
Pre Load – This is a file layer. After complete transformations, business rule applications and surrogate
key assignment, the data will be loaded into individual Insert and Update text files which will be a
ready to load snapshot of the Data Mart table. Once files for all the target tables are ready, these files
will be bulk loaded into the Data Mart tables. Data will once again be validated for defined business
rules and key relationships and if there is a failure then the same should be captured in the Error
Table.
Data Mart – This is the actual Data Mart. Data will be inserted/updated from respective Pre Load files
into the Data Mart Target tables.
Operational Data Store (ODS): An operational data store (or "ODS") is a database designed to
integrate data from multiple sources to make analysis and reporting easier.
Definition: The ODS is defined to be a structure that is:
1. Integrated
2. Subject oriented
3. Volatile, where update can be done
4. Current valued, containing data that is a day or perhaps a month old
5. Contains detailed data only.
To obtain a “system of record” that contains the best data (Complete, Up to date, Accurate) that
exists in a legacy environment as a source of information.
Note: An "ODS" is not a replacement or substitute for an enterprise data warehouse but in turn could
become a source.
1. Detailed data
1. Records of Business Events
(e.g. Orders capture)
2. Data from heterogeneous sources
3. Does not store summary data
4. Contains current data
Data Mart:
Data mart is a decentralized subset of data found either in a data warehouse or as standalone subset
designed to support the unique business requirements of a specific decision-support system.
A data mart is a subject-oriented database which supports the business needs of middle management
like departments.
Dependent data marts: In the top-down approach data mart development depends on enterprise data
warehouse. Such data marts are called as dependent data marts.
Dependent data marts are marts that are fed directly by the DW, sometimes supplemented with other
feeds, such as external data
Independent data marts: In the bottom -up approach data mart development is independent on
enterprise data warehouse. Such data marts are called as independent data marts. Independent data
marts are marts that are fed directly by external sources and do not use the DW.
Embedded data marts are marts that are stored within the central DW. They can be stored relationally
as files or cubes.
1. Low cost
2. Contain less information than the warehouse.
3. Easily understood and navigated than an enterprise data warehouse.
4. Within the range of divisional or departmental budgets
Schema: A schema is a collection of objects including tables, views, indexes and synonyms.
Star Schema: A Star schema is a logical database design which contains a centrally located fact table
surrounded by dimension tables. The data base design looks like a star. Hence it is called as
“Star schema”
The star schema is also called star -join schema, data cube, or multi-dimensional schema
A fact table contains facts and facts are numeric. Not every numeric is a fact but the numeric’s which
are of type key performance indicators are known as facts. Facts are business measures which are used
to evaluate the performance of an enterprise. A fact table contains the fact information at the lowest
level granularity. The level at which fact information is stored in a fact table is known as grain of fact
or fact granularity or fact event level. A dimension is a descriptive data about the major aspects of our
business. Dimensions are stored in dimension table which provides the answers to the four basic
business questions (who, what, when, where).Dimensional table contains de-normalized business
information. If a Star Schema
Contains more than one fact table then that Star schema is called “Complex Star Schema”.
E.g. the “Store” dimension can have attributes such as the street and block number, the city, the
region and the country where it is located in addition to its name.
Dimension tables are ENTRY POINTS into the fact table. Typically
1. The number of rows selected and processed from the fact table depends on the conditions
(“WHERE” clauses) the user applies on the dimensional attributes selected
2. Dimension tables are typically DE-NORMALIZED in order to reduce the number of joins in resulting
queries.
Dimension table attributes are generally STATIC, DESCRIPTIVE fields describing aspects of the
dimension.
Dimension tables typically designed to hold IN-FREQUENT CHANGES to attribute values over time using
SCD concepts.
Every column in the dimension table is TYPICALLY either the primary key or a dimensional attribute.
Every non-key column in the dimension table is typically used in the GROUP BY clause of a SQL Query.
Conformed Dimension: If a dimension table is shared by multiple fact tables then that dimension is
known as conformed dimension table.
Junk dimension: Junk dimensions are dimensions that contain miscellaneous data like flags; gender,
text values etc and which are not useful to generate reports.
Dirty dimension: In this dimension table records are maintained more than once by the difference of
non-key attributes.
Slowly changing dimension: If the data values are changed slowly in a column or in a row over the
period of time then that dimension table is called as slowly changing dimension.
Type – I SCD: A type-1 dimension keeps the most recent data in the target.
Type –II SCD: keeps full history in the target. For every update it keeps a new record in the target.
Type – III SCD: keeps the current and previous information in the target (partial history).
Surrogate key:
1. Surrogate Key is an artificial identifier for an entity. In surrogate key values are generated by
the system sequentially (Like Identity property in SQL Server and Sequence in Oracle). They do
not describe anything.
2. Joins between fact and dimension tables should be based on surrogate keys.
3. Surrogate keys should not be composed of natural keys glued together.
4. Users should not obtain any information by looking at these keys.
5. These keys should be simple integers.
6. Using surrogate key will be faster.
7. Can handle Slowly Changing dimensions well.
Degenerated dimension: A degenerate dimension is data that is dimensional in nature but stored in a
fact table. For example, if you have a dimension that only has Order Number and Order Line Number;
you would have a 1:1 relationship with the Fact table. Therefore, this would be a degenerate
dimension and Order Number and Order Line Number would be stored in the Fact table.
Fast Changing Dimension: A fast changing dimension is a dimension whose attribute or attributes for a
record (row) change rapidly over time.
Facts can be detailed level facts or summarized facts. Each fact typically represents a business item, a
business transaction, or an event that can be used in analyzing the business or business process.
The term FACT represents a single business measure. E.g. Sales, Qty Sold Fact tables express MANY-
MANY RELATIONSHIPS between dimensions in dimensional models.
1. One product can be sold in many stores while a single store typically sells many different
products at the same time.
2. The same customer can visit many stores and a single store typically has many customers.
The fact table is typically the MOST NORMALIZED TABLE in a dimensional model.
Fact tables can contain HUGE DATA VOLUMES running into millions of rows.
All facts within the same fact table must be at the SAME GRAIN every foreign key in a Fact table is
usually a DIMENSION TABLE PRIMARY KEY.
Every column in a fact table is either a foreign key to a dimension table primary key or a fact
Types of Facts:
1. A fact may be measure, metric or a dollar value. Measure and metric are no additive facts.
2. Dollar value is additive fact. If we want to find out the amount for a particular place for a
particular period of time, we can add the dollar amounts and come up with the total amount.
3. A non additive fact, for e.g. measure height(s) for 'citizens by geographical location' , when we
rollup 'city' data to 'state' level data we should not add heights of the citizens rather we may
want to use it to derive 'count'.
Fact less fact table: A fact table without facts (measures) is known as fact less fact table.
1. First type of factless fact table is called Event Tracking table or records an event i.e.
Attendance of the student. Many event tracking tables in the dimensional DWH turn out to be
factless table.
2. Second type of factless fact table is called coverage table. Coverage tables are frequently
needed in (Dimensional DWH) when the primary fact table is sparse.
Conformed fact table: If two fact tables measures same then that fact table is called conformed fact
table.
Snap-shot fact table: This type of fact table describes the state of things in a particular instance of
time, and usually includes more semi-additive and non-additive facts.
Transaction fact table: A transaction is a set of data fields that record a basic business event.
Cumulative fact table: This type of fact table describes what has happened over a period of time. For
example, this fact table may describe the total sales by product by store by day. The facts for this type
of fact tables are mostly additive facts.
Conceptual modeling:
1. In this the data modeler needs to understand the scope of the business and business
requirements.
2. After understand the business requirements modeler needs to identify the lowest level grains
such as entities and attributes.
Logical modeling:
1. Design the dimension table with the lowest level grains which are identified at the conceptual
modeling.
3. Provide the relationship between the dimensional and fact tables using primary key and foreign
key
Physical modeling:
1. Logical design is what we draw with a pen and paper or design with Oracle Designer or ERWIN
before building our warehouse.
3. During the physical design process, you convert the data gathered during the logical design
phase into a description of the physical database structure.
1. Oracle Designer
2. Erwin (Entity Relationship for windows).
3. Informatica (Cubes/Dimensions).
4. Embarcadero.
5. Power Designer Sybase
It is a set of specifications which allows the client applications (Reports) in retrieving the data from
data warehouse.
In the OLAP, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP
(ROLAP). Hybrid OLAP (HOLAP) is the combination of MOLAP and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube.
The storage is not in the relational database, but in proprietary formats.
Advantages:
1. Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for
slicing and dicing operations.
2. Can perform complex calculations: All calculations have been pre-generated when the cube is
created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
1. Limited in the amount of data it can handle: Because all calculations are performed when the
cube is built, it is not possible to include a large amount of data in the cube itself. This is not
to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is
possible. But in this case, only summary-level information will be included in the cube itself.
2. Requires additional investment: Cube technology are often proprietary and do not already exist
in the organization. Therefore, to adopt MOLAP technology, chances are additional investments
in human and capital resources are needed.
ROLAP:-This methodology relies on manipulating the data stored in the relational database to give the
appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and
dicing is equivalent to adding a "WHERE" clause in the SQL statement. E.g. Oracle, Micro strategy, SAP
BW, SQL Server, Sybase, Informix, DB2
Advantages:
1. Can handle large amounts of data: The data size limitation of ROLAP technology is the
limitation on data size of the underlying relational database. In other words, ROLAP itself
places no limitation on data amount.
2. Can leverage functionalities inherent in the relational database: Often, relational database
already comes with a host of functionalities. ROLAP technologies, since they sit on top of the
relational database, can therefore leverage these functionalities.
Disadvantages:
1. Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple
SQL queries) in the relational database, the query time can be long if the underlying data size
is large.
2. Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL
statements to query the relational database, and SQL statements do not fit all needs (for
example, it is difficult to perform complex calculations using SQL), ROLAP technologies are
therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by
building into the tool out-of-the-box complex functions as well as the ability to allow users to
define their own functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type
information, HOLAP leverages cube technology for faster performance.
When detail information is needed, HOLAP can "drill through" from the cube into the underlying
relational data.