NOTES

unit 1
Data warehousing is a process that involves storing and analyzing data from multiple sources
in a centralized location:
 Purpose
Data warehouses are used to support business intelligence (BI) activities, such as reporting,
analytics, and data mining.
 Data sources
Data warehouses can store data from a variety of sources, including point-of-sale systems,
business applications, and relational databases.
 Data types
Data warehouses can store structured data, such as database tables and Excel sheets, and
semi-structured data, such as XML files and webpages.
 Benefits
Data warehouses can help organizations make better decisions by providing reliable data,
improved data consistency, and easier access to enterprise data.
 Features
Data warehouses often include data governance and security capabilities, and they can
support ad hoc analysis and custom reporting.
 Process
The process of combining data from multiple sources into a data warehouse is called extract,
transform, and load (ETL).
Operational Database Data Warehouse
Data warehousing systems are

typically designed to support
Operational systems are designed to support high-
volume transaction processing.
high-volume analytical processing
(i.e., OLAP).

Operational systems are usually concerned with
usually concerned with historical
current data.
data.
Non-volatile, new data may be

Data within operational systems are mainly updated added regularly. Once Added rarely
regularly according to need.
changed.
It is designed for analysis of
It is designed for real-time business dealing and business measures by subject area
processes.
, categories, and attributes.
It is optimized for extent loads and

It is optimized for a simple set of transactions, high, complex, unpredictable
generally adding or retrieving a single row at a time
per table. queries that access many rows per
table.
Loaded with consistent, valid

It is optimized for validation of incoming information information, requires no real-time
during transactions, uses validation data tables.
validation.
It supports a few concurrent clients

It supports thousands of concurrent clients.
relative to OLTP.

Operational systems are widely process-oriented.
widely subject-oriented

Operational systems are usually optimized to perform usually optimized to perform
fast inserts and updates of associatively small volumes
of data. fast retrievals of relatively high
volumes of data.
Data In Data Out
Less Number of data accessed. Large Number of data accessed.
Relational databases are created for on-line Data Warehouse designed for on-
transactional Processing (OLTP) line Analytical Processing (OLAP)
Need for Data Warehouse

Data Warehouse is needed for the following reasons:
1. 1) Business User: Business users require a data warehouse to view summarized data from the past.
Since these people are non-technical, the data may be presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data from the past.
This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a commonplace,
the user can effectively undertake to bring the uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types
of queries, which demands a significant degree of flexibility and quick response time.
Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand, and
query.
4. Queries that would be complex in many normalized databases could be easier to build and
maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from lots of
users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.
A data warehouse is a centralized repository for storing and analyzing data from multiple
sources. It can be a crucial tool for businesses because it helps them:
 Make data-driven decisions
A data warehouse provides access to data from multiple sources, allowing businesses to make
better decisions faster.
 Analyze large amounts of data
Data warehouses can analyze large amounts of data from different sources and extract value
from it.
 Consolidate data
Data warehouses can pull data from multiple sources and bring it together in one location.
 Enable business reporting
Data warehouses can be used to create reports and dashboards.
 Implement machine learning and AI
Data warehouses can collect historical and real-time data to develop algorithms that can
provide predictive insights.
 Analyze trends over time
Data warehouses can retain historical data, allowing organizations to analyze trends over
time.
 Process complex questions
Data warehouses are designed to process complex questions that may be distributed to
several AI tools.
The metadata repository stores information that defines DW objects. It includes the
following parameters and information for the middle and the top-tier applications:
1. A description of the DW structure, including the warehouse schema, dimension,

hierarchies, data mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data,
i.e., active, archived or purged, and warehouse monitoring information, i.e., usage
statistics, error reports, audit, etc.
3. System performance data, which includes indices, used to improve data access and
retrieval performance.
4. Information about the mapping from operational databases, which provides
source RDBMSs and their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which
include business terms and definitions, ownership information, etc.
Data Warehouse Architecture: Basic

Load Manager system component that performs all the operations necessary to support the
extract and load process constructed using a combination of off-the-•shelf tools, bespoke
coding, C programs and shell scripts it performs the following operations: 1. Extract the data
from the source systems. 2. Fast-load the extracted data into a temporary data store. 3.
Perform simple transformations into a structure similar to the one in the data warehouse
Each of these functions has to operate automatically, and recover from any errors it
encounters, to a very large extent with no human intervention this process tends to run
overnight at the close of the business day
Warehouse Manager system component that performs all the operations necessary to support
the warehouse mgmnt. process This system is typically constructed .using a combination of
third--party systems management software, bespoke coding, C programs and shell scripts
complexity of the warehouse manager is driven by the extent to which the operational
management of the data warehouse has been automated
Operations performed by warehouse manager 1. Analyze the data to perform consistency and
referential integrity checks. 2. Transform and merge the source data in the temporary data
store into the published data warehouse. 3. Create indexes, business views, partition views,
business synonyms against the base data. 4. Generate denormalizations if appropriate. 5.
Generate any new aggregations that may be required. 6. Update all existing aggregations. 7.
Back up incrementally or totally the data within the data warehouse. 8. Archive data that has
reached the end of its capture life
Query Manager system component that performs all the operations necessary to support the
query management process constructed using a combination of user access tools, specialist
data warehousing monitoring tools, native database facilities, bespoke coding, C programs,
and Shell scripts
Detailed Information This is the area of the data warehouse that stores all the detailed
information in the starflake schema In many cases, all the detailed information is not held
online the whole time aggregated to the next level of detail, and the detailed information is
then offloaded into tape archive. On a rolling basis, detailed information is loaded into the
warehouse to supplement the aggregated data to implement a rolling strategy, historical data
will have to be loaded at regular intervals
Summary Information area of the DW that stores all the predefined aggregations generated
by the warehouse manager This area of the data warehouse should be treated as transient. It
will change on an ongoing basis in order to respond to changing query profiles Summary
information is essentially a replication of detailed information already in the data warehouse.
Meta Data area within the data warehouse that stores all the metadata definitions used by all
processes within the data warehouse. Metadata is data about data: it is like a card index
describing how information is structured within the data warehouse The structure of
metadata will differ between each process copy management tools o use metadata to
understand the mapping rules that they need to apply in order to convert source data into a
common form User access tools o use metadata to understand how to build up a query
Operational System
An operational system is a method used in data warehousing to refer to a system that is used
to process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in
the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work
with particular instances of data more accessible. For example, author, data build, and data
changed, and file size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers
for strategic decision-making. These customers interact with the warehouse using end-client
access tools.
The examples of some of the end-user access tools can be:
 Reporting and Query Tools

 Application Development Tools
 Executive Information Systems Tools
 Online Analytical Processing Tools
 Data Mining Tools
data Warehouse Modeling

Data warehouse modeling is the process of designing the schemas of the detailed and
summarized information of the data warehouse. The goal of data warehouse modeling is to
develop a schema describing the reality, or at least a part of the fact, which the data
warehouse is needed to support.
Data warehouse modeling is an essential stage of building a data warehouse for two main
reasons. Firstly, through the schema, data warehouse clients can visualize the relationships
among the warehouse data, to use them with greater ease. Secondly, a well-designed schema
allows an effective data warehouse structure to emerge, to help decrease the cost of
implementing the warehouse and improve the efficiency of using it.
Data modeling in data warehouses is different from data modeling in operational database
systems. The primary function of data warehouses is to support DSS processes. Thus, the
objective of data warehouse modeling is to make the data warehouse efficiently support
complex queries on long term information.
Types of Data Warehouse Models
Data Warehouse Models

From the perspective of data warehouse architecture, we have the following
data warehouse models:
 Virtual Warehouse
 Data mart
 Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual
warehouse. It is easy to build a virtual warehouse. Building a virtual
warehouse requires excess capacity on operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is
valuable to specific groups of an organization.
In other words, we can claim that data marts contain data specific to a
particular group. For example, the marketing data mart may contain data
related to items, customers, and sales. Data marts are confined to subjects.
Points to remember about data marts:
 Window-based or Unix/Linux-based servers are used to implement data
marts. They are implemented on low-cost servers.
 The implementation data mart cycles is measured in short periods of
time, i.e., in weeks rather than months or years.
 The life cycle of a data mart may be complex in long run, if its planning
and design are not organization-wide.
 Data marts are small in size.
 Data marts are customized by department.
 The source of a data mart is departmentally structured data
warehouse.
 Data mart are flexible.
Enterprise Warehouse
 An enterprise warehouse collects all the information and the subjects
spanning an entire organization
 It provides us enterprise-wide data integration.
 The data is integrated from operational systems and external
information providers.
 This information can vary from a few gigabytes to hundreds of
gigabytes, terabytes or beyond.

ETL (Extract, Transform, and Load) Process
What is ETL?
The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and
Loading.
The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to
change with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data
warehouse system and needs to be agile, automated, and well documented.
How ETL Works?

ETL consists of three separate phases:
Extraction
 Extraction is the operation of extracting information from a source system for further
use in a data warehouse environment. This is the first stage of the ETL process.
 Extraction process is often one of the most time-consuming tasks in the ETL.
 The source systems might be complicated and poorly documented, and thus
determining which data needs to be extracted can be difficult.
 The data has to be extracted several times in a periodic manner to supply all changed
data to the warehouse and keep it up-to-date.
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality. The primary data cleansing features found in ETL tools are rectification
and homogenization. They use specific dictionaries to rectify typing mistakes and to
recognize synonyms, as well as rule-based cleansing to enforce domain-specific rules and
defines appropriate associations between values.
The following examples show the essential of data cleaning:
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date
list of contact addresses, email addresses and telephone numbers must be available.
If a client or supplier calls, the staff responding should be quickly able to find the person in
the enterprise database, but this need that the caller's name or his/her company name is listed
in the database.
If a user appears in the databases with two or more slightly different names or different
account numbers, it becomes difficult to update the customer's information.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its operational
source format into a particular data warehouse format. If we implement a three-layer
architecture, this phase outputs our reconciled data layer.
The following points must be rectified in this phase:
 Loose texts may hide valuable information. For example, XYZ PVT Ltd does not
explicitly show that this is a Limited Partnership company.
 Different formats can be used for individual data. For example, data can be saved as a
string or as three integers.
Following are the main transformation processes aimed at populating the reconciled data
layer:
 Conversion and normalization that operate on both storage formats and units of
measure to make data uniform.
 Matching that associates equivalent fields in different sources.
 Selection that reduces the number of source fields and records.
Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load step, it is
necessary to ensure that the load is performed correctly and with as little resources as
possible.
Loading can be carried in two ways:
1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a
data warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying
preexisting data. This method is used in combination with incremental extraction to
update data warehouses regularly.
Difference between ETL and ELT

ETL (Extract, Transform, and Load)
Extract, Transform and Load is the technique of extracting the record from sources (which is
present outside or on-premises, etc.) to a staging area, then transforming or reformatting with
business manipulation performed on it in order to fit the operational needs or data analysis,
and later loading into the goal or destination databases or data warehouse.
Strengths
Development Time: Designing from the output backwards provide that only information
applicable to the solution is extracted and processed, potentially decreasing development,
delete, and processing overhead.
Targeted data: Due to the targeted feature of the load process, the warehouse contains only
information relevant to the presentation. Reduced warehouse content simplify the security
regime enforce and hence the administration overheads.
Tools Availability: The number of tools available that implement ETL provides the
flexibility of approach and the opportunity to identify the most appropriate tool. The
proliferation of tools has to lead to a competitive functionality war, which often results in loss
of maintainability.
Weaknesses
Flexibility: Targeting only relevant information for output means that any future
requirements that may need data that was not included in the original design will need to be
added to the ETL routines. Due to the nature of tight dependency between the methods
developed, this often leads to a need for fundamental redesign and development. As a result,
this increase the time and cost involved.
Hardware: Most third-party tools utilize their engine to implement the ETL phase.
Regardless of the estimate of the solution, this can necessitate the investment in additional
hardware to implement the tool's ETL engine. The use of third-party tools to achieve the ETL
process compels the information of new scripting languages and processes.
Learning Curve: Implementing a third-party tools that uses foreign processes and languages
results in the learning curve that is implicit in all technologies new to an organization and can
often lead to consecutive blind alleys in their use due to shortage of experience.
ELT (Extract, Load and Transform)
ELT stands for Extract, Load and Transform is the various sight while looking at data
migration or movement. ELT involves the extraction of aggregate information from the
source system and loading to the target method instead of transformation between the
extraction and loading phase. Once the data is copied or loaded into the target method, then
change takes place.
The extract and load step can be isolated from the transformation process. Isolating the load
phase from the transformation process delete an inherent dependency between these phases.
In addition to containing the data necessary for the transformations, the extract and load
process can include components of data that may be essential in the future. The load phase
could take the entire source and loaded it into the warehouses.
Separating the phases enables the project to be damaged down into smaller chunks, thus
making it more specific and manageable.
Performing the data integrity analysis in the staging method enables a further phase in the
process to be isolated and dealt with at the most appropriate point in the process. This method
also helps to ensure that only cleaned and checked information is loaded into the warehouse
for transformation.
Isolating the transformations from the load steps helps to encourage a more staged way to the
warehouse design and implementation.
Strengths
Project Management: Being able to divide the warehouse method into specific and isolated
functions, enables a project to be designed on a smaller function basis, therefore the project
can be broken down into feasible chunks.
Flexible & Future Proof: In general, in an ELT implementation, all record from the sources
are loaded into the data warehouse as part of the extract and loading process. This, linked
with the isolation of the transformation phase, means that future requirements can easily be
incorporated into the data warehouse architecture.
Risk minimization: Deleting the close interdependencies between each technique of the
warehouse build system enables the development method to be isolated, and the individual
process design can thus also be separated. This provides a good platform for change,
maintenance and management.
Utilize Existing Hardware: In implementing ELT as a warehouse build process, the
essential tools provided with the database engine can be used.
Utilize Existing Skill sets: By using the functionality support by the database engine, the
existing investment in database functions are re-used to develop the warehouse. No new skills
need to be learned, and the full weight of the experience in developing the engine?s
technology is utilized, further reducing the cost and risk in the development process.
Weaknesses
Against the Norm: ELT is a new method to data warehouse design and development. While
it has proven itself many times over through its abundant use in implementations throughout
the world, it does require a change in mentality and design approach against traditional
methods.
Tools Availability: Being an emergent technology approach, ELT suffers from the limited
availability of tools.
What is
Metadata?
Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book. In other words, we can say that metadata is the summarized data that leads us to
detailed data. In terms of data warehouse, we can define metadata as follows.
 Metadata is the road-map to a data warehouse.

 Metadata in a data warehouse defines the warehouse objects.
 Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Categories of Metadata
Metadata can be broadly categorized into three categories −
 Business Metadata − It has the data ownership information, business definition, and
changing policies.
 Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
 Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
Explore our latest online courses and learn new skills at your own pace. Enroll and become
a certified expert to boost your career.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse
is different from the warehouse data, yet it plays an important role. The various roles of
metadata are explained below.
 Metadata acts as a directory.

 This directory helps the decision support system to locate the contents of the data
warehouse.
 Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
 Metadata helps in summarization between current detailed data and highly
summarized data.
 Metadata also helps in summarization between lightly detailed data and highly
summarized data.
 Metadata is used for query tools.
 Metadata is used in extraction and cleansing tools.
 Metadata is used in reporting tools.
 Metadata is used in transformation tools.
 Metadata plays an important role in loading functions.
The following diagram shows the roles of metadata.

Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following
metadata −
 Definition of data warehouse − It includes the description of structure of data

warehouse. The description is defined by schema, view, hierarchies, derived data
definitions, and data mart locations and contents.
 Business metadata − It contains has the data ownership information, business
definition, and changing policies.
 Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
 Data for mapping from operational environment to data warehouse − It includes
the source databases and their contents, data extraction, data partition cleaning,
transformation rules, data refresh and purging rules.
 Algorithms for summarization − It includes dimension algorithms, data on
granularity, aggregation, summarizing, etc.
Data warehouses provide many benefits to businesses, including:
 Better data quality: Data warehouses cleanse data and create a consistent format for
analytics.
 Improved business analytics: Data warehouses provide access to data from multiple
sources, which helps decision-makers avoid incomplete information.
 Faster queries: Data warehouses are designed for fast data retrieval and analysis.
 Historical insight: Data warehouses store historical data, which can help businesses learn
from past trends and make predictions.
 Enhanced customer insights: Data warehouses can integrate customer data from multiple
sources, which can help businesses create more personalized marketing strategies.
 Improved decision-making: Data warehouses provide information for data-driven decisions,
such as new product development and inventory levels.
 Increased ROI: Data warehouses can help businesses increase their overall return on
investment.

NOTES

Uploaded by

NOTES

Uploaded by

unit 1

Data warehousing systems are

Data warehousing systems are

Non-volatile, new data may be

It is optimized for extent loads and

Loaded with consistent, valid

It supports a few concurrent clients

Data warehousing systems are

Data warehousing systems are

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Need for Data Warehouse

Benefits of Data Warehouse

1. A description of the DW structure, including the warehouse schema, dimension,

Data Warehouse Architecture: Basic

 Reporting and Query Tools

data Warehouse Modeling

Types of Data Warehouse Models

Data Warehouse Models

 Window-based or Unix/Linux-based servers are used to implement data

marts. They are implemented on low-cost servers.

 The implementation data mart cycles is measured in short periods of

time, i.e., in weeks rather than months or years.

and design are not organization-wide.

 Data marts are small in size.

 Data marts are customized by department.

 The source of a data mart is departmentally structured data

 Data mart are flexible.

spanning an entire organization

 It provides us enterprise-wide data integration.

 The data is integrated from operational systems and external

 This information can vary from a few gigabytes to hundreds of

gigabytes, terabytes or beyond.

How ETL Works?

Difference between ETL and ELT

 Metadata is the road-map to a data warehouse.

Metadata can be broadly categorized into three categories −

 Metadata acts as a directory.

The following diagram shows the roles of metadata.

 Definition of data warehouse − It includes the description of structure of data

You might also like