Data Mining and Warehousing

DATA MINING AND WAREHOUSING
INFORMATION CRISIS
Organisations generate data. Think of all the various computer applications in a company
(order apps, human resources apps, payment apps, result compilation apps, hostel
accommodation apps etc). Think of all the databases and the quantities of data that support
the operations of a company. How many years’ worth of customer/student data is saved and
available? How many years’ worth of financial data is kept in storage? Ten years? Fifteen
years? Where is all this data? On one platform? In client/server applications?
We are faced with two startling facts: (1) organizations have lots of data, (2) information
technology resources and systems are not effective at turning all that data into useful strategic
information. Over the past two decades, companies have accumulated tons and tons of data
about their operations. Mountains of data exist. Why then do we talk about an information
crisis? Most companies are faced with an information crisis not because of lack of sufficient
data, but because the available data is not readily usable for strategic decision making. There
is a need to turn these mountains of data to strategic information to aid quality decisions for
organisations’ benefits.
Most organisations stored only operational data – data created by business operations
involved in daily management processes such as student registration, purchase management,
sales management and so on. However, organisations need a quick, comprehensive access to
the information required by decision-making processes.
DATA WAREHOUSE
One of the most important assets of any organization is its information. This asset is almost
always used for two purposes: operational record keeping and analytical decision making.
The definition of a data warehouse by Bill Inmon that is accepted as the standard by the
industry states that data warehouse is a subject-oriented, non-volatile, integrated, time variant
collection of data in support of management’s decision. A data warehouse is a repository of
subjectively selected and adapted operational data, which can successfully answer any ad
hoc, complex, statistical or analytical queries. A data warehouse is a relational database that is
designed for query and analysis rather than transaction processing. It usually contains
historical data that is derived from transaction data, but it can include data from other
sources. It separates analysis workload from transaction workload and enables an
organization to consolidate data from several sources. Facebook is an example of data
warehouse. Facebook gathers data including friends, likes, photos, post and so on. Data
warehouse is vital for decision support system of an organisation and contains integrated
data. Data warehousing is a new paradigm specifically intended to provide vital strategic
information.
What to extract, where to extract and how to extract from
Data warehouses are used across an organisation while data marts (DM) are used in
individual customised reporting. Data marts are subset of data warehouse.
Definition of Concepts
 Data Mart – A departmental data warehouse that stores only relevant data.
 Dependent data mart – A subset that is created directly from a data warehouse.
 Independent data mart – A small data warehouse designed for a strategic
business unit or a department.
WHY DO BUSINESS/ORGANISATIONS NEED DATA WAREHOUSE

 The primary reason for a data warehouse is, for a company to get that extra edge over
its competitors.
 This extra edge can be gained by taking smarter decisions.
 Smarter decisions can be taken only if executives responsible for taking such
decisions have data at their disposal.
For example: Let’s consider some strategic questions that a manager or an executive has to
answer to get an extra edge over his company’s competitors.
 How do we increase the market share of this company by 5%?
 Which product is not doing well in the market?
 Which agent needs help with selling policies?
 What is the quality of the customer service provided and what improvements are
needed?
These questions may not be needed to run a business but are needed for the survival and
growth of the business.
DATA WAREHOUSE CHARACTERISTICS
Characteristics of data warehousing:

 Subject oriented.
 Integrated.
 Time Variant (Time series).
 Nonvolatile
 Web based
 Relational/multidimensional.
 Client/server.
 Real time.
 Include metadata.
Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more about
your company's sales data, you can build a warehouse that concentrates on sales. Using
this warehouse, you can answer questions like "Who was our best customer for this item
last year?" This ability to define a data warehouse by subject matter, sales in this case,
makes the data warehouse subject oriented.
Integrated
Integration is closely related to subject orientation. Data warehouses must put data
from disparate sources into a consistent format. They must resolve such problems as
naming conflicts and inconsistencies among units of measure. When they achieve this,
they are said to be integrated.
NonVolatile
Nonvolatile means that, once entered into the warehouse, data should not change. This
is logical because the purpose of a warehouse is to enable you to analyze what has
occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is
very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A data
warehouse's focus on change over time is what is meant by the term time variant.
Typically, data flows from one or more online transaction processing (OLTP)
databases into a data warehouse on a monthly, weekly, or daily basis.
COMPARING DATA WAREHOUSE AND OPERATIONAL DATABASE

Data warehousing is a phenomenon that grew from the huge amount of electronic data stored
in recent years and from the urgent need to use that data to accomplish goals that go beyond
routine tasks linked to daily processing. In a typical scenario, a large corporation has many
branches, senior managers need to quantify and evaluate how each branch contributes to the
global business performance. The corporate database stores detailed data on the tasks
performed by branches. To meet the manager’s need, tailor-made queries can be issued to
retrieve the required data. In order for this process to work, database administrators must first
formulate the desired query (typically an aggregate SQL query) after closely studying
database catalogs. Then the query is processed. This can take a few hours because of huge
amount of data, the query complexity, and the concurrent effects of other regular workload
queries on data. Over time, database designers realized that such an approach is hardly
feasible, because it is very demanding in terms of time and resources and it does not always
achieve the desired results. Moreover, a mix of analytical queries with transactional routine
queries inevitably slows down the system and this does not meet the needs of users of either
type of query. Today’s advanced data warehousing processes separate Online Analytical
Processing (OLAP) from Online Transactional Processing (OLTP) by creating a new
information repository that integrates basic data from various sources, properly arranged data
formats, and then makes data available for analysis and evaluation aimed at planning and
decision-making processes.
Operational systems are online transaction processing (OLTP) systems. These are the systems
that are used to run the day-to-day core business of the company. These systems typically get
the data into the database. Each transaction processes information about a single entity such
as a single order, or single student.
Characteristics Operational Database Data Warehouse

Currency Current Historical
Details Level Individual Individual and summary
Orientation Process Subject
Records per request Few Thousands
Normalization Level Mostly normalized Normalization relaxed
Update level Highly Volatile Mostly refreshed (non
volatile)
Data model Relational Relational (star schemas) and
multidimensional (data
cubes).
Schema is described as the layout or blueprint of a database that outlines the way data is
organised into tables or structure of a database. Schema is a logical description of the entire
database. Star Schema, Snowflake Schema and Fact Constellation Schema are the common
schemas used in data warehousing.
Star schema consists of a fact table with a single table for each dimension. Dimension tables
contain attributes. A fact table contains data columns for the numeric measurements of a business.
It also includes a set of columns that form a concatenated or composite key. Each column of the
concatenated key is a foreign key drawn from a dimensional table primary key. Fact tables usually
have few columns and many rows, which result in relatively long and narrowly shaped tables. The
shape of dimension tables is typically wide and short because they contain few records and many
columns. The columns of a dimension table are also called attributes of the dimension table. Each
dimension table in a star schema database has a single-part primary key joined to the fact table.
Snowflake schema is a variation on the star schema in which dimensional tables from a star
schema are organised into hierarchy by normalization. Due to normalization, snowflake
schema normally has more tables.
A fact constellation is a set of fact tables that share some dimension tables. A fact constellation
has multiple fact tables.
Data Extraction, Transformation, and Loading Techniques

During the ETL process, data is extracted from an OLTP database, transformed to match the
data warehouse schema, and loaded into the data warehouse database. Many data warehouses
also incorporate data from non-OLTP systems, such as text files, legacy systems, and
spreadsheets; such data also requires extraction, transformation, and loading. The ETL
process is not a one-time event; new data is added to a data warehouse periodically. Typical
periodicity may be monthly, weekly, daily, or even hourly, depending on the purpose of the
data warehouse and the type of business it serves. Because ETL is an integral, ongoing, and
recurring part of a data warehouse, ETL processes must be automated and operational
procedures documented. ETL also changes and evolves as the data warehouse evolves, so
ETL processes must be designed for ease of modification. A solid, well-designed, and
documented ETL system is necessary for the success of a data warehouse project.
ETL Functional Elements
Regardless of how they are implemented, all ETL systems have a common purpose: they move data
from one database to another. Generally, ETL systems move data from OLTP systems to a data
warehouse, but they can also be used to move data from one data warehouse to another. An ETL
system consists of four distinct functional elements:
• Extraction
• Transformation
• Loading
• Meta data
Extraction
The ETL extraction element is responsible for extracting data from the source system. During
extraction, data may be removed from the source system or a copy made and the original data
retained in the source system. It is common to move historical data that accumulates in an
operational OLTP system to a data warehouse to maintain OLTP performance and efficiency.
Legacy systems may require too much effort to implement such offload processes, so legacy
data is often copied into the data warehouse, leaving the original data in place. Extracted data
is loaded into the data warehouse staging area (a relational database usually separate from the
data warehouse database), for manipulation by the remaining ETL processes.
Data extraction is generally performed within the source system itself, especially if it is a
relational database to which extraction procedures can easily be added. It is also possible for
the extraction logic to exist in the data warehouse staging area and query the source system
for data using ODBC, OLE DB, or other APIs. For legacy systems, the most common method
of data extraction is for the legacy system to produce text files, although many newer systems
offer direct query APIs or accommodate access through ODBC or OLE DB.
Data extraction processes can be implemented using Transact-SQL stored procedures, Data
Transformation Services (DTS) tasks, or custom applications developed in programming or
scripting languages.
Transformation
It involves transforming the data into a suitable format that can be easily loaded into a DW
system. Data transformation involves applying calculations, joins, and defining primary and
foreign keys on the data. For example, if you want % of total revenue which is not in
database, you will apply % formula in transformation and load the data. Similarly, if you have
the first name and the last name of users in different columns, then you can apply a
concatenate operation before loading the data. Some data doesn’t require any transformation;
such data is known as direct move or pass through data.
Data transformation also involves data correction and cleansing of data, removing incorrect
data, incomplete data formation, and fixing data errors. It also includes data integrity and
formatting incompatible data before loading it into a DW system. The ETL transformation
element is responsible for data validation, data accuracy, data type conversion, and business
rule application. It is the most complicated of the ETL elements. It may appear to be more
efficient to perform some transformations as the data is being extracted (inline
transformation); however, an ETL system that uses inline transformations during extraction is
less robust and flexible than one that confines transformations to the transformation element.
Transformations performed in the OLTP system impose a performance burden on the OLTP
database. They also split the transformation logic between two ETL elements and add
maintenance complexity when the ETL logic changes. Listed below are some basic examples
that illustrate the types of transformations performed by this element:
Data Validation
Check that all rows in the fact table match rows in dimension tables to enforce data integrity.
Data Accuracy
Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.
Data Type Conversion
Ensure that all values for a specified field are stored the same way in the data warehouse
regardless of how they were stored in the source system. For example, if one source system
stores "off" or "on" in its status field and another source system stores "0" or "1" in its status
field, then a data type conversion transformation converts the content of one or both of the
fields to a specified common value such as "off" or "on".
Business Rule Application
Ensure that the rules of the business are enforced on the data stored in the warehouse. For
example, check that all customer records contain values for both FirstName and LastName
fields.
Loading
The ETL loading element is responsible for loading transformed data into the data warehouse
database. Data warehouses are usually updated periodically rather than continuously, and
large numbers of records are often loaded to multiple tables in a single data load. The data
warehouse is often taken offline during update operations so that data can be loaded faster
and SQL Server 2000 Analysis Services can update OLAP cubes to incorporate the new data.
Meta Data
The ETL meta data functional element is responsible for maintaining information (meta data)
about the movement and transformation of data, and the operation of the data warehouse. It
also documents the data mappings used during the transformations. Meta data logging
provides possibilities for automated administration, trend prediction, and code reuse.
Examples of data warehouse meta data that can be recorded and used to analyze the activity and
performance of a data warehouse include:
• Data Lineage, such as the time that a particular set of records was loaded into the data
warehouse.
• Schema Changes, such as changes to table definitions.
• Data Type Usage, such as identifying all tables that use the "Birthdate" user-defined data
type.
• Transformation Statistics, such as the execution time of each stage of a transformation,
the number of rows processed by the transformation, the last time the transformation was
executed, and so on.
• DTS Package Versioning, which can be used to view, branch, or retrieve any historical
version of a particular DTS package.
• Data Warehouse Usage Statistics, such as query times for reports.
The most common ETL tools include − SAP BO Data Services (BODS), Informatica –
Power Center, Microsoft – SSIS, Oracle Data Integrator ODI, Talend Open Studio, Clover
ETL Open source, etc.
ETL Tool Function

A typical ETL tool-based data warehouse uses staging area, data integration, and access
layers to perform its functions. It’s normally a 3-layer architecture.
 Staging Layer − The staging layer or staging database is used to store the data
extracted from different source data systems.
 Data Integration Layer − The integration layer transforms the data from the staging
layer and moves the data to a database, where the data is arranged into hierarchical
groups, often called dimensions, and into facts and aggregate facts. The combination
of facts and dimensions tables in a DW system is called a schema.
 Access Layer − The access layer is used by end-users to retrieve the data for
analytical reporting and information.
WHY IS DATA DIRTY?

In a data warehouse, dirty data is a database record that contains errors/mistakes. Dirty data
can be caused by a number of factors which include:
i. Incomplete data which comes from not available data value when collected, different
consideration between the time when data was collected and when it is analysed. The
incomplete data can also be attributed to human, hardware or software problems.
ii. Noisy data comes from the process of data collection, entry and transmission.
iii. Inconsistent data as a result of different data sources and functional dependency
violation.
Online Analytical Processing (OLAP) and OLAP Tools
Types of OLAP Servers

There are basically three types of OLAP servers:
 Relational OLAP (ROLAP)
 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
Relational OLAP (ROLAP)

Data are stored in relational model (tables). It uses schema called Star schema or snowflake
schema. One relation is called fact table while the other tables are called dimension tables.
ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following components:
 Database server
 ROLAP server
 Front-end tool.
ROLAP is illustrated below in the diagrams:
Sales table is the fact table while product, store and period are the dimension table.
Advantages
 ROLAP servers can be easily used with existing RDBMS.
 Data can be stored efficiently, since no zero facts can be stored.
 ROLAP tools do not use pre-calculated data cubes.
 DSS server of micro-strategy adopts the ROLAP approach.
Disadvantages
 Poor query performance.
 Some limitations of scalability depending on the technology architecture that is
utilized.
Multi-dimensional OLAP (MOLAP).

Unlike ROLAP, in MOLAP data are in special structures called data cubes (Array-based
storage). MOLAP uses array-based multidimensional storage engines for multidimensional
views of data. With multidimensional data stores, the storage utilization may be low if the
data set is sparse. Therefore, many MOLAP models use two levels of data storage
representation to handle dense and sparse data sets. As long as the data does not change
frequently, the overhead of data cubes is manageable. MOLAP includes the following
components:
 Database server.
 MOLAP server.
 Front-end tool.
Advantages
 MOLAP allows fastest indexing to the pre-computed summarized data.
 Helps the users connected to a network who need to analyze larger, less-defined data.
 Easier to use, therefore MOLAP is suitable for inexperienced users.
Disadvantages
 MOLAP are not capable of containing detailed data.
 The storage utilization may be low if the data set is sparse.
Hybrid OLAP (HOLAP) = ROLAP + MOLAP

Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP model allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations:
 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways:
 By climbing up a concept hierarchy for a dimension
 By dimension reduction
For example: Given total sales by city, we can roll-up to get sales by state
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:
 By stepping down a concept hierarchy for a dimension
 By introducing a new dimension.
For example: Given total sales by state, can drill-down to get total sales by city.
Slice
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
FOUR-STEP DIMENSIONAL DESIGN PROCESS

We will approach the design of a dimensional model by consistently considering four steps,
as the following sections discuss in more detail.
Step 1: Select the Business Process
A business process is a low-level activity performed by an organization, such as taking
orders, invoicing, receiving payments, handling service calls, registering students, performing
a medical procedure, or processing claims. To identify your organization’s business
processes, it’s helpful to understand several common characteristics:
■ Business processes are frequently expressed as action verbs because they represent
activities that the business performs. The companion dimensions describe descriptive context
associated with each business process event.
■ Business processes are typically supported by an operational system, such as the billing or
purchasing system.
■ Business processes generate or capture key performance metrics. Sometimes the metrics
are a direct result of the business process; the measurements are derivations at other times.
Analysts invariably want to scrutinize and evaluate these metrics by a seemingly limitless
combination of filters and constraints.
■ Business processes are usually triggered by an input and result in output metrics. In many
organizations, there’s a series of processes in which the outputs from one process become the
inputs to the next. In the parlance of a dimensional modeler, this series of processes results in
a series of fact tables.
It’s also worth noting what a business process is not. Organizational business departments or
functions do not equate to business processes. By focusing on processes, rather than on
functional departments, consistent information is delivered more economically throughout the
organization. If you design departmentally bound dimensional models, you inevitably
duplicate data with different labels and data values. The best way to ensure consistency is to
publish the data once.
Step 2: Declare the Grain
Declaring the grain means specifying exactly what an individual fact table row represents.
The grain conveys the level of detail associated with the fact table measurements. It provides
the answer to the question, “How do you describe a single row in the fact table?” The grain is
determined by the physical realities of the operational system that captures the business
process’s events.
Example grain declarations include:
■ One row per scan of an individual product on a customer’s sales transaction
■ One row per line item on a bill from a doctor
■ One row per individual boarding pass scanned at an airport gate
■ One row per daily snapshot of the inventory levels for each item in a warehouse
■ One row per bank account each month
These grain declarations are expressed in business terms. Perhaps you were expecting the
grain to be a traditional declaration of the fact table’s primary key.
Although the grain ultimately is equivalent to the primary key, it’s a mistake to list
a set of dimensions and then assume this list is the grain declaration. Whenever possible, you
should express the grain in business terms. You may discover in steps 3 or 4 of the design
process that the grain statement is wrong. This is okay, but then you must return to step 2,
restate the grain correctly, and revisit steps 3 and 4 again.
Step 3: Identify the Dimensions
Dimensions fall out of the question, “How do business people describe the data resulting
from the business process measurement events?” You need to decorate fact tables with a
robust set of dimensions representing all possible descriptions that take on single values in
the context of each measurement. If you are clear about the grain, the dimensions typically
can easily be identified as they represent the “who, what, where, when, why, and how”
associated with the event. Examples of common dimensions include date, product, customer,
employee, and facility. With the choice of each dimension, you then list all the discrete, text-
like attributes that flesh out each dimension table.
Step 4: Identify the Facts
Facts are determined by answering the question, “What is the process measuring?” Business
users are keenly interested in analyzing these performance metrics. All candidate facts in a
design must be true to the grain defined in step 2. Facts that clearly belong to a different grain
must be in a separate fact table. Typical facts are numeric additive figures, such as quantity
ordered or dollar cost amount. You need to consider both your business users’ requirements
and the realities of your source data in tandem to make decisions regarding the four steps.
Retail Case Study

Let’s start with a brief description of the retail business used in this case study.
Imagine you work in the headquarters of a large grocery chain. The business has 100
grocery stores spread across five states. Each store has a full complement of
departments, including grocery, frozen foods, dairy, meat, produce, bakery, floral,
and health/beauty aids. Each store has approximately 60,000 individual products,
called stock keeping units (SKUs), on its shelves.
Data is collected at several interesting places in a grocery store. Some of the most
useful data is collected at the cash registers as customers purchase products. The
point-of-sale (POS) system scans product barcodes at the cash register, measuring consumer
takeaway at the front door of the grocery store, as illustrated in the figure below indicating cash
register receipt. Other data is captured at the store’s back door where vendors make deliveries.
Poly Grocery
No. 6 Felele Road, Lokoja
Store: 0022
Cashier: 00244409/Timi
00302336689 Nestle Milo 700

21202337655 Ribena Blackcurrant 1000
00708060482 Oral B Toothpaste 450
Saved 50naira off 500
23890086554 Coconut Chocolate 300
Coupon 20 off 320
Total 2450
Amount Tendered 2500

Balance 50
Item Count 4
………………………………………………………………………………
………………………………
Transaction: 649 20/09/2016
10:30am
………………………………………………………………………………
Thank you for shopping at Poly Grocery
Figure: Sample Cash register receipt.
At the grocery store, management is concerned with the logistics of ordering, stocking, and
selling products while maximizing profit. The profit ultimately comes from charging as much
as possible for each product, lowering costs for product acquisition and overhead, and at the
same time attracting as many customers as possible in a highly competitive environment.
Some of the most significant management decisions have to do with pricing and promotions.
Both store management and headquarters marketing spend a great deal of time tinkering with
pricing and promotions. Promotions in a grocery store include temporary price reductions,
ads in newspapers and newspaper inserts, displays in the grocery store, and coupons. The
most direct and effective way to create a surge in the volume of product sold is to lower the
price dramatically. A 50-cent reduction in the price of paper towels, especially when coupled
with an ad and display, can cause the sale of the paper towels to jump by a factor of 10.
Unfortunately, such a big price reduction usually is not sustainable because the towels
probably are being sold at a loss. As a result of these issues, the visibility of all forms of
promotion is an important part of analysing the operations of a grocery store.
Now that we have described our business case study, we’ll begin to design the dimensional
model.
Step 1: Select the Business Process
The first step in the design is to decide what business process to model by
combining an understanding of the business requirements with an understanding of
the available source data.
In our retail case study, management wants to better understand customer
purchases as captured by the POS system. Thus the business process you’re
modelling is POS retail sales transactions. This data enables the business users to
analyse which products are selling in which stores on which days under what
promotional conditions in which transactions.
Step 2: Declare the Grain
In our case study, the most granular data is an individual product on a POS
transaction, assuming the POS system rolls up all sales for a given product within a
shopping cart into a single line item. Although users probably are not interested in
analysing single items associated with a specific POS transaction, you can’t predict
all the ways they’ll want to cull through that data. For example, they may want to
understand the difference in sales on Monday versus Sunday. Or they may want to
assess whether it’s worthwhile to stock so many individual sizes of certain brands. Or
they may want to understand how many shoppers took advantage of the 50-cents-
off promotion on shampoo. Or they may want to determine the impact of decreased
sales when a competitive diet soda product was promoted heavily. Although none of
these queries calls for data from one specific transaction, they are broad questions
that require detailed data sliced in precise ways. None of them could have been
answered if you elected to provide access only to summarized data.
Step 3:
After the grain of the fact table has been chosen, the choice of dimensions is
straightforward.
The product and transaction fall out immediately. Within the framework of the
primary dimensions, you can ask whether other dimensions can be attributed to the
POS measurements, such as the date of the sale, the store where the sale occurred,
the promotion under which the product is sold, the cashier who handled the sale, and
potentially the method of payment. We express this as another design principle. The
following descriptive dimensions apply to the case: date, product, store, promotion,
cashier, and method of payment. In addition, the POS transaction ticket number is
included as a special dimension,
Step 4: Identify the Facts
The fourth and final step in the design is to make a careful determination of which
facts will appear in the fact table. Again, the grain declaration helps anchor your
thinking. Simply put, the facts must be true to the grain: the individual product line
item on the POS transaction in this case. When considering potential facts, you may
again discover adjustments need to be made to either your earlier grain assumptions
or choice of dimensions. The facts collected by the POS system include the sales
quantity (for example, the number of cans of chicken noodle soup), per unit regular,
discount, and net paid prices, and extended discount and sales dollar amounts. The
extended sales naira amount equals the sales quantity multiplied by the net unit
price. Likewise, the extended discount dollar amount is the sales quantity multiplied
by the unit discount amount.
Retail Sales Fact Product Dimension

Date Dimension
Date Key (FK)
Product Key (FK)
Store Dimension Store Key (FK)
Promotion Key (FK)
Promotion Dimension
Cashier Key (FK)
Cashier Dimension Payment Method Key (FK)
POS Transaction # (DD) Payment Method
Sales Quantity
Regular Unit Price Dimension
Discount Unit Price
Net Unit Price
Extended Discount Naira
Amount
Exercise
1. You are the data design specialist on the Data Warehouse Project team for Kogi State
Polytechnic, Lokoja. Design a star schema to track student attendance for lectures.
Student attendance can be analysed along the Date, Course, Student, Lecturer and
Hall dimensions. List possible attributes and designate a primary key for each
dimension tables.
2. Fact table is narrow while Dimension table is wide. Explain.
3. Suppose that a data warehouse for a company consists of four dimensions namely;
Location, supplier, time, and product; three measures count, average costs and total
costs. Draw a snowflake schema for the data warehouse.
Examples of OLAP Application

i. Market Analysis – Finds which items are frequently sold over summer but not
over winter
ii. Credit card companies – Given a new applicant, is he/she credit worthy?

Data Mining and Warehousing

Uploaded by

Data Mining and Warehousing

Uploaded by

DATA MINING AND WAREHOUSING

WHY DO BUSINESS/ORGANISATIONS NEED DATA WAREHOUSE

DATA WAREHOUSE CHARACTERISTICS

Characteristics of data warehousing:

COMPARING DATA WAREHOUSE AND OPERATIONAL DATABASE

Characteristics Operational Database Data Warehouse

Data Extraction, Transformation, and Loading Techniques

ETL Tool Function

WHY IS DATA DIRTY?

Online Analytical Processing (OLAP) and OLAP Tools

Types of OLAP Servers

Relational OLAP (ROLAP)

ROLAP is illustrated below in the diagrams:

 ROLAP tools do not use pre-calculated data cubes.

 DSS server of micro-strategy adopts the ROLAP approach.

Multi-dimensional OLAP (MOLAP).

 Easier to use, therefore MOLAP is suitable for inexperienced users.

Hybrid OLAP (HOLAP) = ROLAP + MOLAP

 Slice and dice

FOUR-STEP DIMENSIONAL DESIGN PROCESS

Retail Case Study

00302336689 Nestle Milo 700

Amount Tendered 2500

Retail Sales Fact Product Dimension

Examples of OLAP Application

You might also like