Data Mining and Warehousing
Data Mining and Warehousing
INFORMATION CRISIS
Organisations generate data. Think of all the various computer applications in a company
(order apps, human resources apps, payment apps, result compilation apps, hostel
accommodation apps etc). Think of all the databases and the quantities of data that support
the operations of a company. How many years’ worth of customer/student data is saved and
available? How many years’ worth of financial data is kept in storage? Ten years? Fifteen
years? Where is all this data? On one platform? In client/server applications?
We are faced with two startling facts: (1) organizations have lots of data, (2) information
technology resources and systems are not effective at turning all that data into useful strategic
information. Over the past two decades, companies have accumulated tons and tons of data
about their operations. Mountains of data exist. Why then do we talk about an information
crisis? Most companies are faced with an information crisis not because of lack of sufficient
data, but because the available data is not readily usable for strategic decision making. There
is a need to turn these mountains of data to strategic information to aid quality decisions for
organisations’ benefits.
Most organisations stored only operational data – data created by business operations
involved in daily management processes such as student registration, purchase management,
sales management and so on. However, organisations need a quick, comprehensive access to
the information required by decision-making processes.
DATA WAREHOUSE
One of the most important assets of any organization is its information. This asset is almost
always used for two purposes: operational record keeping and analytical decision making.
The definition of a data warehouse by Bill Inmon that is accepted as the standard by the
industry states that data warehouse is a subject-oriented, non-volatile, integrated, time variant
collection of data in support of management’s decision. A data warehouse is a repository of
subjectively selected and adapted operational data, which can successfully answer any ad
hoc, complex, statistical or analytical queries. A data warehouse is a relational database that is
designed for query and analysis rather than transaction processing. It usually contains
historical data that is derived from transaction data, but it can include data from other
sources. It separates analysis workload from transaction workload and enables an
organization to consolidate data from several sources. Facebook is an example of data
warehouse. Facebook gathers data including friends, likes, photos, post and so on. Data
warehouse is vital for decision support system of an organisation and contains integrated
data. Data warehousing is a new paradigm specifically intended to provide vital strategic
information.
What to extract, where to extract and how to extract from
Data warehouses are used across an organisation while data marts (DM) are used in
individual customised reporting. Data marts are subset of data warehouse.
Definition of Concepts
Data Mart – A departmental data warehouse that stores only relevant data.
Dependent data mart – A subset that is created directly from a data warehouse.
Independent data mart – A small data warehouse designed for a strategic
business unit or a department.
Integrated
Integration is closely related to subject orientation. Data warehouses must put data
from disparate sources into a consistent format. They must resolve such problems as
naming conflicts and inconsistencies among units of measure. When they achieve this,
they are said to be integrated.
NonVolatile
Nonvolatile means that, once entered into the warehouse, data should not change. This
is logical because the purpose of a warehouse is to enable you to analyze what has
occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is
very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A data
warehouse's focus on change over time is what is meant by the term time variant.
Typically, data flows from one or more online transaction processing (OLTP)
databases into a data warehouse on a monthly, weekly, or daily basis.
Snowflake schema is a variation on the star schema in which dimensional tables from a star
schema are organised into hierarchy by normalization. Due to normalization, snowflake
schema normally has more tables.
A fact constellation is a set of fact tables that share some dimension tables. A fact constellation
has multiple fact tables.
Regardless of how they are implemented, all ETL systems have a common purpose: they move data
from one database to another. Generally, ETL systems move data from OLTP systems to a data
warehouse, but they can also be used to move data from one data warehouse to another. An ETL
system consists of four distinct functional elements:
• Extraction
• Transformation
• Loading
• Meta data
Extraction
The ETL extraction element is responsible for extracting data from the source system. During
extraction, data may be removed from the source system or a copy made and the original data
retained in the source system. It is common to move historical data that accumulates in an
operational OLTP system to a data warehouse to maintain OLTP performance and efficiency.
Legacy systems may require too much effort to implement such offload processes, so legacy
data is often copied into the data warehouse, leaving the original data in place. Extracted data
is loaded into the data warehouse staging area (a relational database usually separate from the
data warehouse database), for manipulation by the remaining ETL processes.
Data extraction is generally performed within the source system itself, especially if it is a
relational database to which extraction procedures can easily be added. It is also possible for
the extraction logic to exist in the data warehouse staging area and query the source system
for data using ODBC, OLE DB, or other APIs. For legacy systems, the most common method
of data extraction is for the legacy system to produce text files, although many newer systems
offer direct query APIs or accommodate access through ODBC or OLE DB.
Data extraction processes can be implemented using Transact-SQL stored procedures, Data
Transformation Services (DTS) tasks, or custom applications developed in programming or
scripting languages.
Transformation
It involves transforming the data into a suitable format that can be easily loaded into a DW
system. Data transformation involves applying calculations, joins, and defining primary and
foreign keys on the data. For example, if you want % of total revenue which is not in
database, you will apply % formula in transformation and load the data. Similarly, if you have
the first name and the last name of users in different columns, then you can apply a
concatenate operation before loading the data. Some data doesn’t require any transformation;
such data is known as direct move or pass through data.
Data transformation also involves data correction and cleansing of data, removing incorrect
data, incomplete data formation, and fixing data errors. It also includes data integrity and
formatting incompatible data before loading it into a DW system. The ETL transformation
element is responsible for data validation, data accuracy, data type conversion, and business
rule application. It is the most complicated of the ETL elements. It may appear to be more
efficient to perform some transformations as the data is being extracted (inline
transformation); however, an ETL system that uses inline transformations during extraction is
less robust and flexible than one that confines transformations to the transformation element.
Transformations performed in the OLTP system impose a performance burden on the OLTP
database. They also split the transformation logic between two ETL elements and add
maintenance complexity when the ETL logic changes. Listed below are some basic examples
that illustrate the types of transformations performed by this element:
Data Validation
Check that all rows in the fact table match rows in dimension tables to enforce data integrity.
Data Accuracy
Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.
Data Type Conversion
Ensure that all values for a specified field are stored the same way in the data warehouse
regardless of how they were stored in the source system. For example, if one source system
stores "off" or "on" in its status field and another source system stores "0" or "1" in its status
field, then a data type conversion transformation converts the content of one or both of the
fields to a specified common value such as "off" or "on".
Business Rule Application
Ensure that the rules of the business are enforced on the data stored in the warehouse. For
example, check that all customer records contain values for both FirstName and LastName
fields.
Loading
The ETL loading element is responsible for loading transformed data into the data warehouse
database. Data warehouses are usually updated periodically rather than continuously, and
large numbers of records are often loaded to multiple tables in a single data load. The data
warehouse is often taken offline during update operations so that data can be loaded faster
and SQL Server 2000 Analysis Services can update OLAP cubes to incorporate the new data.
Meta Data
The ETL meta data functional element is responsible for maintaining information (meta data)
about the movement and transformation of data, and the operation of the data warehouse. It
also documents the data mappings used during the transformations. Meta data logging
provides possibilities for automated administration, trend prediction, and code reuse.
Examples of data warehouse meta data that can be recorded and used to analyze the activity and
performance of a data warehouse include:
• Data Lineage, such as the time that a particular set of records was loaded into the data
warehouse.
• Schema Changes, such as changes to table definitions.
• Data Type Usage, such as identifying all tables that use the "Birthdate" user-defined data
type.
• Transformation Statistics, such as the execution time of each stage of a transformation,
the number of rows processed by the transformation, the last time the transformation was
executed, and so on.
• DTS Package Versioning, which can be used to view, branch, or retrieve any historical
version of a particular DTS package.
• Data Warehouse Usage Statistics, such as query times for reports.
The most common ETL tools include − SAP BO Data Services (BODS), Informatica –
Power Center, Microsoft – SSIS, Oracle Data Integrator ODI, Talend Open Studio, Clover
ETL Open source, etc.
Database server
ROLAP server
Front-end tool.
Sales table is the fact table while product, store and period are the dimension table.
Advantages
ROLAP servers can be easily used with existing RDBMS.
Data can be stored efficiently, since no zero facts can be stored.
Disadvantages
Poor query performance.
Some limitations of scalability depending on the technology architecture that is
utilized.
Advantages
MOLAP allows fastest indexing to the pre-computed summarized data.
Helps the users connected to a network who need to analyze larger, less-defined data.
Disadvantages
MOLAP are not capable of containing detailed data.
The storage utilization may be low if the data set is sparse.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations:
Roll-up
Drill-down
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways:
By climbing up a concept hierarchy for a dimension
By dimension reduction
For example: Given total sales by city, we can roll-up to get sales by state
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:
By stepping down a concept hierarchy for a dimension
By introducing a new dimension.
For example: Given total sales by state, can drill-down to get total sales by city.
Slice
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
Total 2450
Item Count 4
………………………………………………………………………………
………………………………
Transaction: 649 20/09/2016
10:30am
………………………………………………………………………………
Thank you for shopping at Poly Grocery
Figure: Sample Cash register receipt.
At the grocery store, management is concerned with the logistics of ordering, stocking, and
selling products while maximizing profit. The profit ultimately comes from charging as much
as possible for each product, lowering costs for product acquisition and overhead, and at the
same time attracting as many customers as possible in a highly competitive environment.
Some of the most significant management decisions have to do with pricing and promotions.
Both store management and headquarters marketing spend a great deal of time tinkering with
pricing and promotions. Promotions in a grocery store include temporary price reductions,
ads in newspapers and newspaper inserts, displays in the grocery store, and coupons. The
most direct and effective way to create a surge in the volume of product sold is to lower the
price dramatically. A 50-cent reduction in the price of paper towels, especially when coupled
with an ad and display, can cause the sale of the paper towels to jump by a factor of 10.
Unfortunately, such a big price reduction usually is not sustainable because the towels
probably are being sold at a loss. As a result of these issues, the visibility of all forms of
promotion is an important part of analysing the operations of a grocery store.
Now that we have described our business case study, we’ll begin to design the dimensional
model.
Step 1: Select the Business Process
The first step in the design is to decide what business process to model by
combining an understanding of the business requirements with an understanding of
the available source data.
In our retail case study, management wants to better understand customer
purchases as captured by the POS system. Thus the business process you’re
modelling is POS retail sales transactions. This data enables the business users to
analyse which products are selling in which stores on which days under what
promotional conditions in which transactions.
Step 2: Declare the Grain
In our case study, the most granular data is an individual product on a POS
transaction, assuming the POS system rolls up all sales for a given product within a
shopping cart into a single line item. Although users probably are not interested in
analysing single items associated with a specific POS transaction, you can’t predict
all the ways they’ll want to cull through that data. For example, they may want to
understand the difference in sales on Monday versus Sunday. Or they may want to
assess whether it’s worthwhile to stock so many individual sizes of certain brands. Or
they may want to understand how many shoppers took advantage of the 50-cents-
off promotion on shampoo. Or they may want to determine the impact of decreased
sales when a competitive diet soda product was promoted heavily. Although none of
these queries calls for data from one specific transaction, they are broad questions
that require detailed data sliced in precise ways. None of them could have been
answered if you elected to provide access only to summarized data.
Step 3:
After the grain of the fact table has been chosen, the choice of dimensions is
straightforward.
The product and transaction fall out immediately. Within the framework of the
primary dimensions, you can ask whether other dimensions can be attributed to the
POS measurements, such as the date of the sale, the store where the sale occurred,
the promotion under which the product is sold, the cashier who handled the sale, and
potentially the method of payment. We express this as another design principle. The
following descriptive dimensions apply to the case: date, product, store, promotion,
cashier, and method of payment. In addition, the POS transaction ticket number is
included as a special dimension,
Step 4: Identify the Facts
The fourth and final step in the design is to make a careful determination of which
facts will appear in the fact table. Again, the grain declaration helps anchor your
thinking. Simply put, the facts must be true to the grain: the individual product line
item on the POS transaction in this case. When considering potential facts, you may
again discover adjustments need to be made to either your earlier grain assumptions
or choice of dimensions. The facts collected by the POS system include the sales
quantity (for example, the number of cans of chicken noodle soup), per unit regular,
discount, and net paid prices, and extended discount and sales dollar amounts. The
extended sales naira amount equals the sales quantity multiplied by the net unit
price. Likewise, the extended discount dollar amount is the sales quantity multiplied
by the unit discount amount.
Exercise
1. You are the data design specialist on the Data Warehouse Project team for Kogi State
Polytechnic, Lokoja. Design a star schema to track student attendance for lectures.
Student attendance can be analysed along the Date, Course, Student, Lecturer and
Hall dimensions. List possible attributes and designate a primary key for each
dimension tables.
2. Fact table is narrow while Dimension table is wide. Explain.
3. Suppose that a data warehouse for a company consists of four dimensions namely;
Location, supplier, time, and product; three measures count, average costs and total
costs. Draw a snowflake schema for the data warehouse.