DW Module-1
DW Module-1
The primary goal of a data warehouse is to provide a single, consistent view of an organization's
data that can be easily queried and analyzed to support decision-making processes. This is
achieved through the use of a process called Extract, Transform, and Load (ETL), which
involves extracting data from various sources, transforming it into a consistent format, and
loading it into the data warehouse.
*(Figure in notes)
Processes involved in DW
(concise form of it av. In college notes)
1. Data Extraction: This process involves gathering data from various sources, such as
operational databases, external systems, files, APIs, or other data repositories. The data
extraction can be performed through techniques like batch processing, real-time streaming, or
change data capture (CDC).
2. Data Transformation: Once the data is extracted, it goes through a transformation process.
This includes cleaning, filtering, validating, and converting the data into a consistent format
suitable for analysis. Data transformation may also involve data enrichment, aggregation,
normalization, and data quality checks.
3. Data Loading: After transformation, the prepared data is loaded into the data warehouse. The
loading process can be done using different strategies, such as full load or incremental load.
Full load involves loading all the transformed data into the data warehouse, while incremental
load only loads the changes or updates since the last load.
4. Data Modeling: Data modeling involves designing the structure and relationships of the data
within the data warehouse. It includes creating entities, attributes, relationships, hierarchies,
dimensions, and measures. Data modeling provides a logical representation of the data, which
aids in efficient querying and analysis.
5. Data Storage: The data is stored in the data warehouse in a structured manner, typically in a
relational database or a specialized data warehousing platform. The storage structure may
include tables, indexes, partitions, and other optimization techniques to enhance performance.
6. Data Integration: Data integration involves combining data from multiple sources and systems
into a unified view within the data warehouse. It includes reconciling data inconsistencies,
resolving conflicts, and ensuring data quality and consistency across different sources. Data
integration enables a comprehensive and unified view of the organization's data.
7. Data Querying and Analysis: Users can perform queries and analysis on the data stored in
the data warehouse. This involves using business intelligence (BI) tools, SQL queries, or data
visualization tools to extract insights, generate reports, and analyze trends within the data.
8. Data Governance: Data governance refers to the overall management and control of data
assets within the data warehouse. It involves defining data standards, policies, and procedures
to ensure data quality, security, privacy, and compliance. Data governance also includes
establishing data ownership, access controls, and data lifecycle management.
9. Data Maintenance and Monitoring: Ongoing maintenance and monitoring activities are
performed to ensure the data warehouse's reliability and performance. This includes data
updates, error handling, backup and recovery, performance monitoring, and capacity planning to
ensure optimal functioning of the data warehouse.
These basic processes collectively enable organizations to store, manage, integrate, and
analyze data in a data warehouse to support decision-making, reporting, and business
intelligence activities.
*(figure in notes)
Unlike operational databases, which are optimized for transactional processing, data
warehouses are optimized for analytical processing. This means that data is organized and
structured in a way that is optimized for querying, analysis, and reporting, rather than for fast
and efficient updates and inserts.
Data warehouses typically consolidate data from multiple sources, including operational
databases, external sources, and other data warehouses. The data is then transformed and
loaded into the data warehouse using a process called Extract, Transform, and Load (ETL).
One of the key features of a data warehouse is that it provides a single, consistent view of an
organization's data. This means that data is standardized, and inconsistencies and
redundancies are removed. The data is also structured in a way that enables analysts and
decision-makers to easily query and analyze it to gain insights into the organization's operations
and performance.
Data warehouses typically use a specific type of data modeling called dimensional modeling.
This involves organizing data into dimensions (e.g., time, geography, product) and measures
(e.g., sales, revenue) to support analysis.
That's a different perspective on the hierarchy of data warehousing evolution. Here's a brief
explanation of each step in this hierarchy:
1. Punch cards: In the early days of computing, data was stored on punch cards. These cards
contained holes that represented binary data and could be read by specialized machines.
Punch cards were slow and cumbersome, but they were a major innovation at the time.
2. Magnetic tape: In the 1950s and 1960s, magnetic tape became the standard for data storage.
This technology enabled faster data access and larger storage capacity than punch cards.
3. Disk storage: In the 1970s, disk storage became the dominant technology for data storage.
Hard drives provide faster access to data than magnetic tape and enable random access to
data.
4. DBMS: In the 1980s and 1990s, relational database management systems (DBMS) emerged
as a way to manage structured data. These systems enabled businesses to store, query, and
manage data more efficiently than flat files or spreadsheets.
5. Online applications: In the 1990s and 2000s, businesses began to develop online
applications that enabled users to access data and applications over the internet. These
applications provided real-time access to data and enabled collaboration among remote teams.
6. Data warehouse: In the 1990s, data warehousing emerged as a way to consolidate data from
multiple sources and provide a centralized repository for analysis and reporting. Data
warehouses were optimized for complex queries and provided a historical perspective on
business data.
Overall, this hierarchy shows how data storage and management technologies have evolved
over time, with each new innovation building upon the previous one. The emergence of data
warehousing marked a major shift in how businesses approached data management and
analysis, enabling them to make better use of their data and drive more informed
decision-making.
Module - 2
Features of DW
Bill Inmon, often referred to as the "father of data warehousing," has outlined several features or
principles of a data warehouse. Here are some key features according to Bill Inmon:
1. Subject-Oriented: A data warehouse is designed to focus on specific subject areas that are
relevant to the organization's business operations. It organizes data around subjects such as
customers, products, sales, finance, or any other domain-specific areas of interest.
3. Non-Volatile: Data in a data warehouse is non-volatile, meaning it is stable and not frequently
changed. Once data is loaded into the data warehouse, it is rarely updated or deleted. Instead,
historical data is preserved and new data is appended, allowing for analysis of trends and
patterns over time.
4. Time-Variant: A data warehouse stores and manages data across different points in time. It
captures historical snapshots of data, allowing users to analyze trends and perform time-based
comparisons. Time-variant data enables analysis of changes and patterns over specific periods.
These features outlined by Bill Inmon provide a framework for designing and implementing a
data warehouse that meets the requirements of data integration, analysis, and decision support
within an organization.