Data Warehousing & Mining: Unit - Ii

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 41

Data Warehousing & Mining

UNIT – II

Prof. S. K. Pandey, I.T.S, Ghaziabad


Syllabus of Unit - II
 DATA Warehousing
 Data Warehousing Components
 Building a Data Warehouse
 Warehouse Database
 Mapping the Data Warehouse to a Multiprocessor
Architecture
 DBMS Schemas for Decision Support
 Data Extraction, Cleanup & Transformation Tools
 Metadata.
Prof. S.K. Pandey, I.T.S, Ghaziabad 2
Data Warehouse
• The Data warehouse is an environment, not a product.
• It is an architectural construct of an information system that provides users with current and historical
decision support information that is hard to access or present in traditional operational data store.
• Data warehousing is a blend of technologies and components aimed at effective integration of
operation database into an environment that enables strategic use of data.
• These technologies include relational and multi-dimensional database management system, client/
server architecture, meta-data modeling and repositories, graphical user interface etc.

Prof. S.K. Pandey, I.T.S, Ghaziabad 3


Data Warehousing Components

Prof. S.K. Pandey, I.T.S, Ghaziabad 4


Data Warehousing Components
 The data warehouse architecture is based on a relational
database management system server that functions as the
central repository for informational data. Operational
data and processing is completely separated from data
warehouse processing. This central information
repository is surrounded by a number of key components
designed to make the entire environment functional,
manageable and accessible by both the operational
systems that source data into the warehouse and by end-
user query and analysis tools.

Prof. S.K. Pandey, I.T.S, Ghaziabad 5


Components of Data Warehouse continued…
 There are following seven components of a Data
Warehouse:
– Data Warehouse Database
– Sourcing, Acquisition, Cleanup and Transformation Tools
– Meta Data
– Access (Query) Tools
The query tool allows executives and other users real-time access to the
Data Warehouse database for query generation, result displays, reports
and data exports
– Data Marts
– Data Warehouse Administration and Management
– Information Delivery System
Prof. S.K. Pandey, I.T.S, Ghaziabad 6
Components & Framework

Prof. S.K. Pandey, I.T.S, Ghaziabad 7


1. Data Warehouse Database
The central data warehouse database is the cornerstone of the data warehousing
environment. Certain data warehouse attributes, such as very large database size,
ad hoc query processing and the need for flexible user view creation including
aggregates, multi-table joins and drill-downs, have become drivers for different
technological approaches to the data warehouse database. These approaches
include:
– Parallel relational database designs for scalability that include shared-memory, shared
disk, or shared-nothing models implemented on various multiprocessor
configurations (symmetric multiprocessors or SMP, massively parallel processors or
MPP, and/or clusters of uni- or multiprocessors).
– An innovative approach to speed up a traditional RDBMS by using new index
structures to bypass relational table scans.
– Multidimensional databases (MDDBs) that are based on proprietary database
technology. Multi-dimensional databases are designed to overcome any limitations
placed on the warehouse by the nature of the relational data model. MDDBs enable
on-line analytical processing (OLAP) tools that architecturally belong to a group of
data warehousing components jointly categorized as the data query, reporting,
analysis and mining tools.
Prof. S.K. Pandey, I.T.S, Ghaziabad 8
2. Sourcing, Acquisition, Cleanup and
Transformation Tools
The data sourcing, cleanup, transformation and migration tools
perform all of the conversions, summarizations, key changes,
structural changes and condensations needed to transform disparate
data into information that can be used by the decision support tool.
They produce the programs and control statements, including the
COBOL programs, MVS job-control language (JCL), UNIX
scripts, and SQL data definition language (DDL) needed to move
data into the data warehouse for multiple operational systems.
These tools also maintain the meta data. The functionality includes:
– Removing unwanted data from operational databases
– Converting to common data names and definitions
– Establishing defaults for missing data
– Accommodating source data definition changes

Prof. S.K. Pandey, I.T.S, Ghaziabad 9


3. Meta Data
Meta data is data about data that describes the data
warehouse. It is used for building, maintaining,
managing and using the data warehouse. Meta data
can be classified into:

– Technical meta data, which contains information about


warehouse data for use by warehouse designers and
administrators when carrying out warehouse development
and management tasks.
– Business meta data, which contains information that gives
users an easy-to-understand perspective of the information
stored in the data warehouse.
Prof. S.K. Pandey, I.T.S, Ghaziabad 10
4. Access (Query) Tools

Query and Reporting tools can be divided into two groups:


– Reporting Tools and Managed Query Tools
– Reporting tools can be further divided into production
reporting tools and report writers.
 Production reporting tools let companies generate regular operational
reports or support high-volume batch jobs such as calculating and
printing paychecks.
 Report writers, on the other hand, are inexpensive desktop tools
designed for end-users.
– Managed query tools shield end users from the complexities
of SQL and database structures by inserting a meta-layer
between users and the database. These tools are designed for
easy-to-use, point-and-click operations that either accept SQL
or generate SQL database queries.
Prof. S.K. Pandey, I.T.S, Ghaziabad 11
5. Data Mart

 the term data mart means different things to


different people. A rigorous definition of this term
is a data store that is subsidiary to a data warehouse
of integrated data. The data mart is directed at a
partition of data (often called a subject area) that is
created for the use of a dedicated group of users.
These could be classified in two categories:
– Dependent Data Marts
– Independent Data Marts

Prof. S.K. Pandey, I.T.S, Ghaziabad 12


Dependent Data Marts: These types of data marts, data is
sourced from the data warehouse, have a high value because no
matter how they are deployed and how many different enabling
technologies are used, different users are all accessing the
information views derived from the single integrated version of
the data.

Independent Data Marts: Unfortunately, the misleading


statements about the simplicity and low cost of data marts
sometimes result in organizations or vendors incorrectly
positioning them as an alternative to the data warehouse. This
viewpoint defines independent data marts that in fact, represent
fragmented point solutions to a range of business problems in the
enterprise. This type of implementation should be rarely deployed
in the context of an overall technology or applications
architecture. Indeed, it is missing the ingredient that is at the
heart of the data warehousing concept -- that of data integration.
Prof. S.K. Pandey, I.T.S, Ghaziabad 13
6. Data Warehouse Administration and Management

Managing data warehouses includes:


1. Security and priority management
2. Monitoring updates from the multiple sources
3. Data quality checks
4. Managing and updating meta data
5. Auditing and reporting data warehouse usage
and status
6. Purging data
7. Replicating, sub-setting and distributing data
8. Backup and Recovery and
9. Data warehouse storage management.
Prof. S.K. Pandey, I.T.S, Ghaziabad 14
14
7. Information Delivery System
• The information delivery component is used to enable the process of
subscribing for data warehouse information and having it delivered to one or
more destinations according to some user-specified scheduling algorithm.

• In other words, the information delivery system distributes warehouse-stored


data and other information objects to other data warehouses and end-user
products such as spreadsheets and local databases.

•Delivery of information may be based on time of day or on the completion of


an external event.

•The rationale for the delivery systems component is based on the fact that
once the data warehouse is installed and operational, its users don't have to be
aware of its location and maintenance.

Prof. S.K. Pandey, I.T.S, Ghaziabad 15


15
Building a Data Warehouse
Why a Data Warehouse Application – Business Perspectives
There are several reasons why organizations consider Data
Warehousing a critical need. From a business prospective, to
strive and succeed in today’s highly competetive global
environment, business users demand business answers mainly
because:
• Decisions need to be made quickly and correctly, using all available
data
• Users are business domain experts, not computer professionals
• The amount of data increasing in the data stores, which affects
response time and the sheer ability to comprehend its content.
• Competitions is heating up in the areas of business intelligence and
added information value.

Prof. S.K. Pandey, I.T.S, Ghaziabad 16


Building a Data Warehouse
Why a Data Warehouse Application – Technology Perspectives
• There are several technology reasons also for existence of Data
Warehousing.
• First, the Data Warehouse is designed to address the incompatibility of
informational and operational transactional systems. These two classes of
information systems are designed to satisfy different , often incompatible,
requirements.
• Secondly, the IT infrastructure is changing rapidly, and its capabilities are
increasing, as evidenced by the following:
• The prices of MIPS continues to decline, while the power of processors
doubles every 2 years
• The prices of digital storage is rapidly dropping
• Network bandwidth is increasing, while the price of high bandwidth is
decreasing
• The workplace is increasingly heterogeneous with respect to both the
hardware and software
• Legacy systems need to, and can, be integrated with new applications
Prof. S.K. Pandey, I.T.S, Ghaziabad 17
Building a Data Warehouse

1. Business Considerations (Return on Investment)


2. Design Considerations
3. Technical Considerations
4. Implementation Considerations
5. Integrated Solutions
6. Benefits of Data Warehousing

Prof. S.K. Pandey, I.T.S, Ghaziabad 18


Building a Data Warehouse Contd..
1. Business Considerations (Return on Investment)
1. Approach
• The Top-down Approach, meaning that an organization has
developed an enterprise data model, collected enterprise-wide business
requirements, and decided to build an enterprise data warehouse with
subset data marts.
• The Bottom-up Approach, implying that the business priorities
resulted in developing individual data marts, which are then integrated
into enterprise data warehouse.

2. Organizational Issues
A Data Warehouse, in general, is not truly a technological issue, rather, it
should be more concerned with identifying and establishing information
requirements, the data sources to fulfill these requirements, and timeliness.

Prof. S.K. Pandey, I.T.S, Ghaziabad 19


Building a Data Warehouse Contd..
2. Design Consideration
To be a successful, a data warehouse designer must take a
holistic approach – consider all data warehouse components as
parts of a single complex system and take into the account all
possible data stores and all known usage requirements. Failing
to do so may easily result in a data warehouse design that is
skewed toward a particular business requirement, a particular
data sources, or a selected access tool. This is also one of the
reasons why a data warehouse is rather difficult to build. The
main factors include:
• Heterogeneity of Data sources, which affects data conversion,
quality, timeliness
• Use of historical data, while implies that data may be “old”.
• Tendency of databases to grow very large
Prof. S.K. Pandey, I.T.S, Ghaziabad 20
Building a Data Warehouse Contd..
2. Design Consideration - In addition to the general considerations,
there are several specific points relevant to the data warehouse
design:
• Data Content
• Metadata
• Data Distribution
One of the biggest challenge when designing a data warehouse is the data
placement and distribution strategy.
• Tools
These tools provide facilities for defining the transformation and cleanup
rules, data movement (from operational sources to the warehouses, end-
user query, reporting, and data analysis.
• Performance consideration

Prof. S.K. Pandey, I.T.S, Ghaziabad 21


Building a Data Warehouse Contd..

3. Technical Considerations
A number of technical issues are to be considered when
designing and implementing a Data Warehouse environment.
1. The Hardware Platform that would house the Data Warehouse for
parallel query scalability. (Uni-Processor, Multi-processor, etc)

2. The DBMS that supports the warehouse database

3. The communication infrastructure that connects the warehouse, data


marts, operational systems, and end users

4. The hardware platform and software to support the metadata


repository

5. The systems management framework that enables centralized


management and administration to the entire environment.
Prof. S.K. Pandey, I.T.S, Ghaziabad 22
Building a Data Warehouse Contd..
4. Implementation Considerations
i. Access Tools
Currently no single tool in the market can handle all possible data warehouse
access needs. Therefore, most implementations rely on a suite of tools.
Examples of Access types include:
a. Simple Tabular for reporting
b. Ranking
c. Multi-variable Analysis
d. Time Series Analysis
e. Data Visualization, Graphing, Charting and pivoting
f. Complex Textual Search
g. Statistical Analysis
h. AI Techniques for testing of hypothesis, trends discovery, definition,
validation of Data Clusters and segments
i. Information Mapping (i.e. mapping of Spatial Data in geographic information systems)
j. Ad-hoc User Specified Queries
k. Pre-defined repeatable queries
l. Interactive drill-down reporting and analysis
m. Complex queries with multiple joins, multi-level subquesries, and sophisticated
search criteria. Prof. S.K. Pandey, I.T.S, Ghaziabad 23
Building a Data Warehouse Contd..
4. Implementation Considerations
ii. Data Extraction, Cleanup, Transformation, and Migration
As a components of the Data Warehouse architecture, proper attention must be given to
Data Extraction, which represents a critical success factor for a data warehouse
architecture.
1. The ability to identify data in the data source environments that can be read by
conversion tool is important. This additional step may affect the timeliness of data
delivery to the warehouse.
2. Support for the flat files. (VSAM, ISM, IDMS) is critical, since bulk of the corporate
data is still maintained in this type of data storage.
3. The capability to merge data from multiple data stores is required in many
installations.
4. The specification interface to indicate the data to extracted and the conversion criteria
is important.
5. The ability to read information from data dictionaries or import information from
repository product is desired.
6. The ability to perform data-type and character-set translation is a requirement when
moving data moving between incompatible systems.
7. The capability to create summarization, aggregation, and derivation records and fields
is very important.
Prof. S.K. Pandey, I.T.S, Ghaziabad 24
Building a Data Warehouse Contd..
4. Implementation Considerations
iii. Data Placement Strategies

As Data Warehouse grows, there are at least two options for Data
Placement. One is to put some of the data in the data warehouse
into another storage media (WORM, RAID). Second option is to
distribute data in data warehouse across multiple servers. Some
criteria must be established for dividing it over the servers – by
geography, organization unit, time, function, etc. However, the
data is divided, a single source of meta data across the entire
organization is required. Hence this configuration requires both
corporation-wide and the meta data managed for any given server.

Prof. S.K. Pandey, I.T.S, Ghaziabad 25


Building a Data Warehouse Contd..

4. Implementation Considerations
iv. Metadata
A frequently occurring problem in Data Warehouse is the
problem of communicating to the end user what
information resides in the data warehouse and how it can be
accessed. The key to providing users and applications with
a roadmap to the information stored in the warehouse is the
metadata. It can define all data elements and their
attributes, data sources and timing, and the rules that
govern data use and data transformations. Meta data needs
to be collected as the warehouse is designed and built.

Prof. S.K. Pandey, I.T.S, Ghaziabad 26


Building a Data Warehouse Contd..
4. Implementation Considerations
v. User Sophistication Levels
Data Warehousing is relatively new phenomenon, and a certain
degree of sophistication is required on the end user’s part to
effectively use the warehouse. The users can be classified on the
basis of their skill level in accessing the warehouse:
1. Casual Users: These users are most comfortable retrieving information
from the warehouse in pre-defined formats, and running preexisting queries and
reports.
2. Power Users: In their delay activities, these users typically combine
predefined queries with some relatively simple and ad-hoc queries that they
create themselves. These users need access tools that combine the simplicity of
pre-defined queries and reports with a certain degree of flexibility.
3. Experts: These users tend to create their own queries and perform
sophisticated analysis on the information they retrieve from the warehouse.
These users know the data, tools and database well enough to demand tools that
allow for maximum flexibility and adaptability.
Prof. S.K. Pandey, I.T.S, Ghaziabad 27
Benefits of Data Warehouse
Successfully implemented data warehousing can realize some significance
benefits which can be categorized in two categories:
1. Tangible Benefits:
1. Product inventory turnover is improved
2. Costs of product introduction are decreased with improved
selection of target markets.
3. More cost effective decision making is enabled by separating (ad-
hoc) query processing from running against operational database.
4. Better business intelligence is enabled by increased quality and
flexibility of market analysis available through multi-level data structures, which
may range from detailed to highly summarized.
2. Intangible Benefits:
1. Improved productivity
2. Reduced redundant processing, support, and software to support
overlapping decision support applications
3. Enhanced Customer relations through improved knowledge of
individual requirements and trends, through customization, improved
communications, and tailored product offerings.
4. Enabling business process reengineering – data warehousing can
provide useful insights into work process themselves, 28
Prof. S.K. Pandey, I.T.S, Ghaziabad
Warehouse Database
 The organizations that embarked on data warehousing
development deal with ever increasing amounts of data. Generally
speaking, the size of a data warehouse rapidly approaches the point
where the search for better performance and scalability becomes a
real necessity. This search aims to pursue two goals:
– Speed-up: the ability to execute the same request on the same
amount data in less time
– Scale-up: the ability to obtain the same performance on the
same request as the database size increases.
An additional and important goal is to achieve linear speed-up and scale-up,
doubling the number of processors cuts the response time in half (linear
speed-up) or provides the same performance on twice as much data (linear
scale-up).

Prof. S.K. Pandey, I.T.S, Ghaziabad 29


Mapping the Data Warehouse to a
Multiprocessor Architecture
 The goals of linear performance and scalability (discussed in
previous slide) can be satisfied by parallel hardware
architectures, parallel operating systems, and parallel DBMSs.
Parallel hardware architectures are based on Multi-processor
systems designed as a Shared-memory model (symmetric
multiprocessors), Shared-disk model or distributed-memory
model (MPP and Clusters of SMPs). Parallelism can be achieved
in two different ways:
– Horizontal Parallelism (Database is partitioned across different disks)
– Vertical Parallelism (occurs among different tasks – all components query
operations i.e. scans, join, sort)
– Data Partitioning

Prof. S.K. Pandey, I.T.S, Ghaziabad 30


Database Architectures for Parallel Processing

 Shared-memory Architecture
 Shared Disk Architecture
 Shared-nothing Architecture
 Combined Architecture

Prof. S.K. Pandey, I.T.S, Ghaziabad 31


Parallel RDBMS Features
 Data Warehouse development requires a good understanding of all
architectural components, including the data warehouse DBMS
Platform. Understanding the basic architecture of Warehouse
database is the first step in evaluating and selecting a product.
 State of the art parallel features the developers and users of the
Warehouse should demand from the DBMS vendor:
 Scope and techniques of Parallel DBMS
 Queries (Insert/ Update/Delete)
 DBMS that supports parallel database load, backup,
reorganization and recovery is much better positioned for VLDBs.
 Optimizer Implementation
 Application Transparency
 The Parallel environment
 DBMS Management Tools
 Price/ Performance
Prof. S.K. Pandey, I.T.S, Ghaziabad 32
Parallel DBMS Vendors
 ORACLE – Oracle supports Parallel Database processing with its add-on
Oracle Parallel Server Option (OPS) and Parallel Query Option (PQO) with
Query Coordinator.
 Informix – Informix developed its Dynamic Scalable Architecture (DSA) to
support Shared-Memory, Shared-Disk, and Shared-Nothing Models. Informix
OnLine release 8, also known as XPS (eXtended Parallel Server), supports MPP
Hardware platforms that include IBM, SP, AT & T, Sun, HP, ICL Goldrush, with
sequent, Siemens, Pyramid etc.
 IBM – DB2 Parallel Edition (DB2 PE), a Database based on DB2/6000 Server
Architecture; latest version is DB2 Universal Database.
 Sybase – Sybase implemented its parallel DBMS functionality in a product
called SYBASE MPP (formerly Navigational Server). It was jointly developed by
Sybase and NCR (formerly AT&T GIS), and its first release was targeted for the
AT&T 3400, 3500 (both SMP) and 3600 (MPP) Platforms.
 Other RDBMS Products i. NCR Teradata ii. Tandem NonStop SQL/MP
 Specialized Database Products - i. Red Brick Systems
ii. White Cross Systems Inc.
Prof. S.K. Pandey, I.T.S, Ghaziabad 33
DBMS Schemas for Decision
Support
 Data Warehousing projects were forced to choose
between a data model and a corresponding database
schema that is intuitive for analysis but performs poorly
and a model-schema that performs better but is not well
suited for analysis.
 As Data Warehousing continued to mature, new
approaches to schema design resulted in schemas better
suited to business analysis that is so crucial to successful
data warehousing.
 The schema methodology that is gaining widespread
acceptance for Data Warehousing is the Star Schema.
Prof. S.K. Pandey, I.T.S, Ghaziabad 34
Data Layout for best Access
 The original objective in developing an abstract model known as
Relational Model were to address a number of shortcomings of
non-relational DBMS and application development.
 The typical requirements for the RDBMS supporting operational
systems are based on the need to effectively support a large
number of small but simultaneous read and write requests.
 The demand placed on the RDBMS by a Data Warehouse are very
different. A data warehouse RDBMS typically needs to process
queries that are large, complex, ad-hoc and data intensive.
 Solving modern business problems such as market analysis and
financial forecasting requires query-centric database schemas that
are array-oriented and multi-dimensional in nature.

Prof. S.K. Pandey, I.T.S, Ghaziabad 35


Multi-dimensional Data Model

 The Multi-dimensional nature of business questions


is reflected in the fact that, for example, marketing
managers are no longer satisfied by asking simple
one-dimensional questions such as “How much
revenue did the new product generate by month, in
northeastern division, broken down by user
demographic, by sales office, relative to the previous
version of the product, compared with the plan?” – a
six dimensional question.

Prof. S.K. Pandey, I.T.S, Ghaziabad 36


STAR SCHEMA

 The Multi-dimensional view of Data that is expressed using


relational database semantics is provided by the database
schema design called Star Schema.
 The basic premise of Star Schema is that information can be
classified into two groups: facts and dimensions.
 Facts are the core Data element being analyzed. For example,
units of individual items sold are facts.
 Dimensions are attributes about the facts. For example,
dimensions are the product types purchased and date of
purchase.

Prof. S.K. Pandey, I.T.S, Ghaziabad 37


Data Extraction, Cleanup &
Transformation Tools
 The task of capturing data from a source data system,
cleaning and transforming it and then loading the results into
a target data system can be carried out either by separate
products, or by a single integrated solution. More
contemporary integrated solutions can fall into one of the
categories described below:
– Code Generators
– Database data Replications
– Rule-driven Dynamic Transformation Engines (Data Mart
Builders)

Prof. S.K. Pandey, I.T.S, Ghaziabad 38


Code Generator
– It creates 3GL/4GL transformation programs based on source
and target data definitions, and data transformation and
enhancement rules defined by the developer.
– This approach reduces the need for an organization to write
its own data capture, transformation, and load programs.
These products employ DML Statements to capture a set of
the data from source system.
– These are used for data conversion projects, and for building
an enterprise-wide data warehouse, when there is a significant
amount of data transformation to be done involving a variety
of different flat files, non-relational, and relational data
sources.

Prof. S.K. Pandey, I.T.S, Ghaziabad 39


Database Data Replication Tools
– These tools employ database triggers or a recovery log to
capture changes to a single data source on one system and
apply the changes to a copy of the data source data located on
a different system.
– Most replication products do not support the capture of
changes to non-relational files and databases, and often do not
provide facilities for significant data transformation and
enhancement.
– These point-to-point tools are used for disaster recovery and
to build an operational data store, a data warehouse, or a data
mart when the number of data sources involved are small and
a limited amount of data transformation and enhancement is
required.
Prof. S.K. Pandey, I.T.S, Ghaziabad 40
Rule-driven Dynamic Transformation
Engines
– They are also known as Data Mart Builders and capture data from a
source system at User-defined intervals, transform data, and then send
and load the results into a target environment, typically a data mart.
– To date most of the products of this category support only relational
data sources, though now this trend have started changing.
– Data to be captured from source system is usually defined using query
language statements, and data transformation and enhancement is
done on a script or a function logic defined to the tool.
– With most tools in this category, data flows from source systems to
target systems through one or more servers, which perform the data
transformation and enhancement. These transformation servers can
usually be controlled from a single location, making the job of such
environment much easier.

Prof. S.K. Pandey, I.T.S, Ghaziabad 41

You might also like