Unit2__1_DWstrategy(1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

UNIT-II

Data Warehouse Process & Technology

Odd Semester
B.Tech (AI)

Prepared by
Utkarsh Mishra Assistant Professor Miet Meerut

[email protected]
Warehousing Strategy
• Traditional Information Strategy Plan (ISP) addresses
operational computing needs thoroughly but don’t give
sufficient attention to decisional information requirements.
• warehouse strategy focus on decisional needs of enterprise

What will happen if management don’t make


strategy?
Strategy Components
 Preliminary data warehouse rollout plan-
-divide warehouse development into phased, successive rollouts
-each rollout focuses on meeting an agreed set of requirements
-it is iterative in nature
-manageable
 Preliminary data warehouse architecture-
-Define overall data warehouse architecture for pilot and
subsequent warehouse rollouts
- ensure scalability of warehouse
- define initial technical architecture of each rollout.
 Short-listed data warehouse environment and tools-
- Create a short-list for the tools and environment
- selection should be according to warehousing needs
Warehousing Strategy Activities

 Determine Organizational Context


 Conduct Preliminary Survey of Requirements
-Interview Categories and Sample Questions
-Interviewing Tips
 Conduct Preliminary Source System Audit
 Identify External Data Sources (If Applicable)
 Define Warehouse Rollouts (Phased Implementation)
 Define Preliminary Data Warehouse Architecture
 Evaluate Development and Production Environment and
Tools

5
Determine Organizational Context

• Answers to organizational background questions are obtained


from the Project Sponsor, the CIO, or the Project Manager
assigned to the warehousing effort.
Typical organizational background questions include:

 Who is the Project Sponsor for this project?

 What are the IS & IT groups in the organization which are involved in
the DWing effort?

 What are the roles & responsibilities of the individual who have
been involved in this effort?
Conduct Preliminary Survey of
Requirements
🞂 Obtain an inventory of requirements of users.
🞂 requirements inventory provides information that the
warehouse is expected to eventually provide.
🞂 Objective is to understand user needs enough to prioritize
the requirements.
🞂 critical input for identifying scope of each data warehouse
rollout.
Interview Categories & sample Questions
Questions related to following categories are asked:
I. Functions
• What is the mission of your group or unit?
• How do you go about fulfilling this mission?
• How do you know if you've been successful with your mission?
• What are the key performance indicators and critical success factors?
I. Customers
• How do you group or classify your customers?
• Does your grouping affect how you treat your customers?
• What kind of information do you track for each type of client?
I. Profit
At what level do you measure profitability in your group? Per
agent? Per customer?
I. Systems
II. Time
• Queries and reports. —
• What reports do you use now?
• What information do you actually use in each of the reports you now receive?
• Can we obtain samples of these reports?
• How often are these reports produced?
• What reports do you produce for other people?
• Product. —
• What products do you sell, and how do you classify them?
• Do you have a product hierarchy?
• Do you analyze data for all products at the same time, or do you analyze one
product type at a time?
• How do you handle changes in product hierarchy and product description?
• Geography. —
• Does your company operate in more than one location?
• Do you divide your market into geographical areas?
• Do you track sales per geographic region?
Conduct Preliminary Source System Audit

Conduct audit of source system w.r.t.


-data access
-network facilities
-data quality
-documentation
-possible extraction mechanism.
Identify External Data Sources(If Applicble)
• Examples of external data sources could be:
• Data from credit agencies
• Zip code or mail code data
• Statistical or census data
• Data from industry organizations
• Data from publications & news agencies
• Use of external data presents opportunities for enriching the
DW
Define Warehouse Rollouts (Phased
Implementation)
• Dividing DW development into phased, successive rollouts
• helps manage user exceptions through the clear definition of scope for each
rollout.
• Figure shows a sample table listing all requirements identified during initial
round of interviews with end users.
• Each requirement is assigned a priority level.
Define Preliminary DW Architecture
• Define the preliminary architecture of each rollout based on the approved rollout
scope.
• Explore the possibility of using a mix of relational & multidimensional databases
and tools as shown in fig.
• The preliminary architecture should indicate the following:
 Data Warehouses & Data Marts
 Number of Users
 Location

Evaluate Development & Production


Environments and Tools
 Enterprise can choose from several environments &
tools for the data warehouse initiative
 Select the best combination of tools
 Produce a short-list from which each rollout or project
will choose its tool set
Warehouse Management and Support Processes
Warehouse Management and Support Processes

 Designed to address aspects of planning and managing a data


warehouse project
 These are critical to successful implementation and subsequent
extension
 processes are defined to assist project manager and warehouse driver
during warehouse development projects.
 Following issues are covered in it:
1. Define Issue Tracking and Resolution Process
2. Perform Capacity Planning
3. Define Warehouse Purging Rules
4. Define Security Measures
5. Define Backup and Recovery Strategy
6. Set Up Collection of Warehouse Usage Statistics
1.Define Issue Tracking and Resolution Process

• During course of a project a number of business and technical issues will


surface.A sample issue log tracks all issues that arise during project.
• Issue logs formalize issue resolution process. They serve as a formal record
of key decisions made throughout the project.
Some issue tracking guidelines are:-

• Issue description: State the issue briefly in two to three


sentences.
• Urgency: Indicate the priority level of the issue: high,
medium, or low.
• Raised by: Identify the person who raised the issue.
• Assigned to: Identify the person on the team who is
responsible for resolving the issue.
• Date opened: This is the date when the issue was first logged.
• Date closed: This is the date when the issue was finally
resolved.
• Resolved by: The person who resolved the issue.
• Resolution description: State briefly the resolution of this
issue in two or three sentences.
2. Perform Capacity Planning

1. Space Requirements. . Space requirements are determined by the


following:
• schema design, expected volume, and expected growth rate;
• indexing strategy used;
• backup and recovery strategy;
• aggregation strategy;
• staging and duplication area required; and
• metadata space requirements.
2. Machine Processing Power.
-MPP (massively parallel processing) and SMP (symmetric
multiprocessing) machines are ideal Choose a configuration that is
scalable and that meets minimum processing requirements.
3. Network Bandwidth. . The network bandwidth must not be allowed
to slow down the warehouse extraction and warehouse performance.
Verify all assumptions about the network bandwidth before proceeding
with each rollout.
4. Number of concurrent user
3.Define Warehouse Purging Rules
Purging rules specify when data are to be removed from the data
warehouse.
 Companies are interested in tracking performance over 3-5 to years.
 In cases where a longer retention period is required, the end users will
require only high-level summaries for comparison.
 Define the mechanisms for archiving or removing older data from the data
warehouse.
 Check for any legal, regulatory, or auditing requirements that may warrant
the storage of data in other media prior to actual purging from the
warehouse.
 Acquire the software and devices that are required for archiving.
4. Define Security Measures

🞂 Keep the data warehouse secure to prevent the loss of


competitive information either to unforeseen disasters or
to unauthorized users.
🞂 Define the security measures for the data warehouse,
taking into consideration both physical security (i.e., where
the data warehouse is physically located), as well as user-
access security.
5. Define Backup and Recovery Strategy
• Consider the following factors:
• Data to be backed up.
• Batch window of the warehouse.
• Maximum acceptable time for recovery.
• Acceptable costs for backup and recovery.
• Also consider the following when selecting the backup mechanism:
• Archive format.
• Automatic backup devices.
• Parallel data streams.
• Incremental backups.
• Offsite backups.
• Backup and recovery procedures.
6. Set Up Collection of Warehouse Usage Statistics

• These are collected to provide data warehouse designer with inputs


for further refining the data warehouse design
• to track general usage and acceptance of warehouse.
• Define mechanism for collecting these statistics
• assign resources to monitor and review these regularly.
Data Warehouse Planning
Data Warehouse Planning
 It is conducted to define scope of one data warehouse rollout.
 combination of top-down and bottom-up tracks gives planning process best
of both worlds—a requirements-driven approach that is grounded on
available data.
 clear separation of front-end and back-end tracks encourages development
of warehouse subsystems for extracting, transporting, cleaning, and loading
independently of front-end tools
 four tracks converge when a prototype of warehouse is created and
when actual warehouse implementation takes place. 
 Each rollout repeatedly executes four tracks :-
-top-down
- bottom-up
- back-end
-front-end
Activities in Data Warehouse Planning
1. Assemble and Orient Team:
o Identify all parties who will be involved in DW
implementation
o brief them about the project.
2. Conduct Decisional Requirements Analysis:
o Analyze to gain understanding of information
needs of decision-makers.
o top-down aspect of data warehousing.
3. Conduct Decisional Source System Audit:
o It is a survey of all information systems that are
potential sources of data
o Data sources are primarily internal.
o If external data sources are available, they may be
integrated into warehouse.
4. Design Logical and Physical Warehouse Schema:
Design data warehouse schema that best meet information
requirements of this rollout.
Two schema design techniques are:
Normalization: database schema is designed using the
normalization techniques traditionally used for OLTP applications;
Dimensional modeling: produces de-normalized, star schema
designs consisting of fact and dimension tables.
-snowflake schema
-star schema
5. Produce Source-to-Target Field Mapping:
The Source-To-Target Field Mapping documents how
fields in the operational (source) systems are
transformed into data warehouse fields.
 To eliminate any confusion as to how data are
transformed
 data items are moved from source systems to warehouse
database
 create a source-to-target field mapping that maps each
source field in each source system to target field in DW
 This is required for each field in the source-to-target field
mapping.
 critical to successful development and maintenance of DW
 mapping serves as basis for extraction and transformation
Example :Mapping
Many-to-many Mapping
• A single field in the data warehouse may be populated by data from
more than one source system. This is due to integration of data from
multiple sources.
• A field called Customer Name or Product Name will be populated by
data from more than one system.
• A single field in OS may need to be split into several fields
• Other examples are numeric figures or balances that have to be
allocated correctly to two or more different fields.

Street name
Address line 1
City
Country
Address line 2
Pin code
Historical Data and Evolving Data Structures
If users require loading of historical data two things are determined :
Changes in schema. —
• Determine if schemas of all source systems have changed over the relevant
time period.
• For example, if the retention period of the data warehouse is 2 years and
data from past 2 years have to be loaded , team must check for possible
changes in source system schemas over past two years.
• If schemas have changed over time, the task of extracting the data
immediately becomes more complicated.
• Each different schema require a different source-to-target field mapping.
Availability of historical data. —
• Determine also if historical data are available for loading.
• Backups during the relevant time period may not contain the required data
items.
• Verify assumptions about the availability and suitability of backups for
historical data loads.
6. Select Development and Production Environment and Tools:
Finalize the computing environment and tool set for this rollout. If an exhaustive
study and selection had been performed during the strategy definition stage, this
activity becomes optional.
7. Create Prototype for This Rollout:
Using the short-listed or final tools and production environment, create a
prototype of the data warehouse.

8. Create Implementation Plan of This Rollout:


-With scope now fully defined and the source-to-target field mapping fully
specified
- it is now possible to draft an implementation plan for this rollout.
9. Warehouse Planning Tips and Caveats:
• The actual data warehouse planning activity will rarely be a
straightforward exercise.
• Before conducting your planning activity, understand the
concept of Data Trail, Limitations Imposed by Currently
Available Data.
Data Warehouse Implementation
• data warehouse implementation team builds or extends an existing
warehouse schema based on the final logical schema design
produced during planning.
• The team also builds the warehouse subsystems that ensure a
steady, regular flow of clean data from the operational systems into
the data warehouse. Other team members install and configure the
selected front-end tools to provide users with access to warehouse
data.
Data Warehouse Implementation
 Most challenging part
 Describes the activities related to implementing one rollout of the
date warehouse.
 Team resolve technical difficulties of moving, integrating, and
cleaning data
 they do the task of addressing policy issues, resolving
organizational conflicts, and untangling logistical delays.
 An implementation project should be scoped to last between three
to six months.
Implementation Steps
1. Acquire and Set Up Development Environment
2. Obtain Copies of Operational Tables
3. Finalize Physical Warehouse schema Design
4. Build or Configure Extraction and Transformation Subsystems
5. Build or Configure Data Quality Subsystem
6. Build Warehouse Load Subsystem
7. Set Up Warehouse Metadata
8. Set Up Data Access and Retrieval Tools
9. Perform the Production Warehouse Load
10. Conduct User Training
11. Conduct User Testing and Acceptance
Activities in Data Warehouse Implementation
1. Acquire and Set Up Development Environment: includes the following tasks:
install the hardware, the operating system, the relational database engine; install
all warehousing tools; create all necessary network connections; and create all
required user IDs and user access definitions.
2. Obtain Copies of Operational Tables
3. Finalize Physical Warehouse schema Design: Translate the detailed logical and
physical warehouse design from the warehouse planning stage into a final
physical warehouse design, taking into consideration the specific, selected
database management system.
4. Build or Configure Extraction and Transformation Subsystems: Easily 60 percent
to 80 percent of a warehouse implementation project is devoted to the back-end
of the warehouse. The back-end subsystems must extract, transform, clean, and
load the operational data into the data warehouse.
5. Build or Configure Data Quality Subsystem:
User will use DW if they know that information they will retrieve from
it is correct
6. Build Warehouse Load Subsystem:

It takes load images created by extraction and transformation subsystems


and loads these images..
 Loading Dirty Data:
-Depending on extent of data errors,use of only clean data can be equally or
more dangerous than relying on a mix of clean and dirty data.
 The Need for Load Optimization:
-load subsystem does task of optimizing load process to reduce total time
required.
 Test Loads:
-team test accuracy and performance of load subsystem on dummy data
before attempting a real load with actual load images.
-The team should know how much load optimization work is still required.
 Set Up Data Warehouse Schema:
Create the data warehouse schema in the development environment while
the team is constructing or configuring the warehouse back-end subsystems.
7. Set Up Warehouse Metadata:
 metadata describe contents of data warehouse
 indicate where the warehouse data originally came from
 document business rules that govern transformation of the data
 Warehousing tools also use metadata as the basis for automating
certain aspects of the warehousing project.
8. Set Up Data Access and Retrieval Tools: The data access and retrieval
tools are equivalent to the tip of the warehousing iceberg. While they may
represent as little as 10 percent of the entire warehousing effort, they are all
that users see of the warehouse. As a result, these tools are critical to the
acceptance and usability of the warehouse.
9. Perform the Production Warehouse Load: The production data
warehouse load can be performed only when the load images are ready and
both the warehouse schema and metadata are set up.
10. Conduct User Training: The IT organization is encouraged to fully take over
the responsibility of conducting user training.
11. Conduct User Testing and Acceptance: The data warehouse, like any
system, must undergo user testing and acceptance.
Hardware and Operating Systems
for Data Warehousing
Hardware and Operating Systems

• It refers to server platforms and operating systems that serve as computing


environment
• Warehousing environments are separate from operational computing
environments (i.e., a different machine is used) to avoid potential resource
contentions between operational and decisional processing.
• Enterprises are correctly wary of computing solutions that may compromise
performance levels of mission-critical operational systems.
Parallel Hardware Technology
Two primary categories of parallel hardware used are:-
--Symmetric multiprocessing (SMP) machines
-- Massively parallel processing (MPP) machines.
• SMPs have multiple CPUs that share a common memory and
input/output. Known also as a "Shared Everything" architecture, this
machine is limited by the scalability and performance limits of the
bus that connects its various components. Such architectures scale
up by adding more CPUs, upgrading existing ones, or by clustering
together several SMP machines.

• MPPs in contrast, allow multiple, independent CPUs, connected to


each other by a high-speed network. Each CPU has its own copy of
the operating system and can essentially function as an independent
processor. MPP architectures scale up by adding nodes or CPUs.
Unfortunately, not all applications can take advantage of the parallel
architecture of MPPs; applications that have been designed to work
on only one processor will fail to take advantage of parallel
processing on multiple processors.
Hardware Selection Criteria
Following selection criteria are recommended:-
• Scalability. . solution is able to scale up in terms of space and processing power
• Financial stability. . product vendor is strong and visible player in hardware
segment, and its financial performance indicates growth or stability.
• Price/performance. .product performs well in a price/performance comparison
with other vendors of similar products.
• Delivery lead time. . product vendor can deliver hardware or an equivalent service
unit within required time frame.
• Reference sites. . The hardware vendor has a reference site that is using a similar
unit for the same purpose
• Availability of support. . Support for the hardware and its operating system is
available, and support response times are within the acceptable down time for the
warehouse.
Client/Server Computing Model & Data
Warehousing
C/S Architecture for DW

• mainframe and minicomputer platforms were utilized in the early


implementations of data warehouses
• Warehouses are built using client/server architecture.
• These are multi-tiered and second-generation client/server
architectures.
• Data warehouse DBMS executes on data server component.
• Data repository of data warehouse sits on this machine.
Purpose of Application Servers:
• To run middleware and establish connectivity
• To execute management and control software
• To handle data access from the Web
• To manage metadata
• For authentication
• As front end
• For managing and running standard reports
• For sophisticated query management
• For OLAP applications
• Generally, the client workstations still handle the presentation logic and
provide the presentation services
Considerations for Client Workstations:
 Determine a minimum configuration on an appropriate platform that
would supports a standard information delivery tools
 Apply this for most users.
 Add a few more functions as necessary.
 For the power users, select another configuration that would support
tools for complex analysis. Generally, this configuration for power users
also supports OLAP.
 Checklist while considering workstations:
o Workstation operating system
o Processing power
o Memory
o Disk storage
o Network and data transport
o Tool support
Parallel Processors & Cluster Systems
Parallel Processing
 DW is a user-centric and query-intensive environment where users will
constantly be executing complex queries
 Each query need large volumes of data to produce result sets.
 If the data warehouse is not tuned properly for handling large, complex,
simultaneous queries efficiently, the value of the data warehouse will be lost.
Performance is of primary importance.
 To speed up query processing, data loading, and index creation this is used
 Split problem in smaller tasks that are executed concurrently.

 Advantages:
Increasing speed & optimizing resources utilization
 Disadvantages:
Complex programming models – difficult development
• A computer cluster is a group of linked computers working
together
• components of a cluster are connected through fast local
area networks.
• deployed to improve performance and availability
• In such environments, each PU executes a copy of a
standard operations and inter-PU communications are
performed over an open system based
interconnect(Ethernet or TCP/IP)

58
Cluster consists of:
Nodes(master+computing)
 Network
 OS
 Cluster middleware: Middleware such as MPI
which permits compute clustering programs to be
portable to a wide variety of clusters

Cluster Middle ware

High Speed Local Network

CPU CPU … CPU

Cluster
Some hardware examples are:-
• Digital-64-bit AlphaServers and Digital Unix or Open VMS. Both SMP and MPP
• HP-HP 9000 Enterprise Parallel Server.
• IBM-RS6000 ,AIX OS have been positioned for data warehousing
• AS/400 -used for data mart implementations
• Microsoft- -Windows NT operating system us successful for datamart
deployments.
• Sequent-Sequent NUMA-Q and the DYNIX operating system.
Parallel processing software perform following steps:

1. Analyzing a large task to identify independent units that can be


executed in parallel
2. Identifying smaller units that must be executed one after the
other
3. Executing independent units in parallel and the dependent units
in the proper sequence
4. Collecting, collating, and consolidating results returned by
smaller units

• Parallel server option allows each node to have its own separate database
instance, and enables all database instances to access a common set of
underlying database files.
• parallel query option supports key operations such as query processing, data
loading, and index creation to be parallelized.
Advantages of Using Parallel Processing in Data
Warehouse
• Performance improvement for query processing, data loading, and
index creation
• Scalability, allowing addition of CPUs and memory modules without
any changes to the existing application
• Fault tolerance so that database would be available even when some
of the parallel processors fail
• Single logical view of the database even though the data may reside
on the disks of multiple nodes

View publication stats

You might also like