NOTES
NOTES
Data warehousing is a process that involves storing and analyzing data from multiple sources
in a centralized location:
Purpose
Data warehouses are used to support business intelligence (BI) activities, such as reporting,
analytics, and data mining.
Data sources
Data warehouses can store data from a variety of sources, including point-of-sale systems,
business applications, and relational databases.
Data types
Data warehouses can store structured data, such as database tables and Excel sheets, and
semi-structured data, such as XML files and webpages.
Benefits
Data warehouses can help organizations make better decisions by providing reliable data,
improved data consistency, and easier access to enterprise data.
Features
Data warehouses often include data governance and security capabilities, and they can
support ad hoc analysis and custom reporting.
Process
The process of combining data from multiple sources into a data warehouse is called extract,
transform, and load (ETL).
Operational Database Data Warehouse
Relational databases are created for on-line Data Warehouse designed for on-
transactional Processing (OLTP) line Analytical Processing (OLAP)
1. 1) Business User: Business users require a data warehouse to view summarized data from the past.
Since these people are non-technical, the data may be presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data from the past.
This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a commonplace,
the user can effectively undertake to bring the uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types
of queries, which demands a significant degree of flexibility and quick response time.
A data warehouse is a centralized repository for storing and analyzing data from multiple
sources. It can be a crucial tool for businesses because it helps them:
Make data-driven decisions
A data warehouse provides access to data from multiple sources, allowing businesses to make
better decisions faster.
Analyze large amounts of data
Data warehouses can analyze large amounts of data from different sources and extract value
from it.
Consolidate data
Data warehouses can pull data from multiple sources and bring it together in one location.
Enable business reporting
Data warehouses can be used to create reports and dashboards.
Implement machine learning and AI
Data warehouses can collect historical and real-time data to develop algorithms that can
provide predictive insights.
Analyze trends over time
Data warehouses can retain historical data, allowing organizations to analyze trends over
time.
Process complex questions
Data warehouses are designed to process complex questions that may be distributed to
several AI tools.
The metadata repository stores information that defines DW objects. It includes the
following parameters and information for the middle and the top-tier applications:
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual
warehouse. It is easy to build a virtual warehouse. Building a virtual
warehouse requires excess capacity on operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is
valuable to specific groups of an organization.
In other words, we can claim that data marts contain data specific to a
particular group. For example, the marketing data mart may contain data
related to items, customers, and sales. Data marts are confined to subjects.
Points to remember about data marts:
The life cycle of a data mart may be complex in long run, if its planning
warehouse.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects
information providers.
Extraction is the operation of extracting information from a source system for further
use in a data warehouse environment. This is the first stage of the ETL process.
Extraction process is often one of the most time-consuming tasks in the ETL.
The source systems might be complicated and poorly documented, and thus
determining which data needs to be extracted can be difficult.
The data has to be extracted several times in a periodic manner to supply all changed
data to the warehouse and keep it up-to-date.
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality. The primary data cleansing features found in ETL tools are rectification
and homogenization. They use specific dictionaries to rectify typing mistakes and to
recognize synonyms, as well as rule-based cleansing to enforce domain-specific rules and
defines appropriate associations between values.
The following examples show the essential of data cleaning:
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date
list of contact addresses, email addresses and telephone numbers must be available.
If a client or supplier calls, the staff responding should be quickly able to find the person in
the enterprise database, but this need that the caller's name or his/her company name is listed
in the database.
If a user appears in the databases with two or more slightly different names or different
account numbers, it becomes difficult to update the customer's information.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its operational
source format into a particular data warehouse format. If we implement a three-layer
architecture, this phase outputs our reconciled data layer.
The following points must be rectified in this phase:
Loose texts may hide valuable information. For example, XYZ PVT Ltd does not
explicitly show that this is a Limited Partnership company.
Different formats can be used for individual data. For example, data can be saved as a
string or as three integers.
Following are the main transformation processes aimed at populating the reconciled data
layer:
Conversion and normalization that operate on both storage formats and units of
measure to make data uniform.
Matching that associates equivalent fields in different sources.
Selection that reduces the number of source fields and records.
Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load step, it is
necessary to ensure that the load is performed correctly and with as little resources as
possible.
Loading can be carried in two ways:
1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a
data warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying
preexisting data. This method is used in combination with incremental extraction to
update data warehouses regularly.
Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book. In other words, we can say that metadata is the summarized data that leads us to
detailed data. In terms of data warehouse, we can define metadata as follows.
Categories of Metadata
Business Metadata − It has the data ownership information, business definition, and
changing policies.
Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
Explore our latest online courses and learn new skills at your own pace. Enroll and become
a certified expert to boost your career.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse
is different from the warehouse data, yet it plays an important role. The various roles of
metadata are explained below.
Metadata repository is an integral part of a data warehouse system. It has the following
metadata −