Data Warehouse & Data Mining

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 41

Lecture 1

Data Warehouse & Data Mining


Introduction
• Data: Facts and information about
something
• Warehouse: A location or facility for
storing goods and merchandise
Data Warehouse
• large electronic repository of information that is
generated and updated in a structured manner
by an enterprise
• Aid business intelligence and to support decision
making.
• A relational database designed for query and
analysis rather than for transaction processing.
• Contains historical data derived from transaction
data
Data warehousing
• The coordinated, architected, and periodic
copying of data from various sources, both
inside and outside the enterprise, into an
environment optimized for analytical and
informational processing.
• data is copied (duplicated) in a controlled
manner, periodically (batch-oriented
processing).
Features
• It provides
– centralization of corporate data assets.
– a well-managed environment.
– consistent and repeatable processes for
loading data from corporate applications.
– open and scalable architecture, can handle
future expansion.
– tools that allow its users to effectively process
the data into information without a high
degree of technical support.
Features
• Allowing business leaders to make
informed decisions based on previous
business data
• Enable analytical and informational
processing platform
• breaks down the barriers created by non-
enterprise, process-focused applications
• consolidates information into a single view
for users to access.
Data asset

• In Enterprise Data can be managed in tree


groups
• Run-the-business data:
– Produced by corporate applications
– such as company uses to fill customer orders
for its products or uses to manage financial
transactions.
– The raw materials for a data warehouse.
Data asset
• Integrate-the-business data:
– Built to improve the quality of and synchronize two or
more corporate applications
– such as a master list of customers.
– integrate applications that weren’t designed to work
with each other.
• Monitor-the-business data:
– Presented to end users for reporting and decision
support
– such as your financial reports.
– The data is cleansed to enable users to better
understand progress and evaluate cause-and effect.
Data asset
• A data asset is the result of taking the raw
material from the run-the-business data
and producing higher-quality-data end
products to integrate the business and
monitor the business.
Guidelines (or principles) For Data
Warehouse
• Subject Orientation:
– Data will be grouped by subject, rather than
author, department, or physical location.
– So, all manufacturing data goes together, and
the sales data, and the promotions data, etc.,
regardless of where it came from.
Guidelines (or principles) For Data
Warehouse
• Data Integration:
• Even though data comes from separate applications,
departments, etc., differences should be smoothed out
so they have the same look and feel.
• Form: When two data elements (e.g., phone numbers)
have different layouts (e.g., 123-123-1234 and (123)
123-1234), one layout will be superimposed on both of
them.
• Function: When two data elements identify the same
thing (e.g., a hammer) with two different names (e.g.,
part 32G and part B49), these two names will be
replaced with one name.
Guidelines (or principles) For Data
Warehouse
• Nonvolatility:
– Unlike the data in operational applications, which is
discarded once the company is finished using it.
– the data in a data warehouse will remain in the
warehouse.
• Time Variant:
– All data has a context at a moment in time.
– A data warehouse will keep that context.
– So, all data from 1995 will retain its context within
1995.
Guidelines (or principles) For Data
Warehouse
• One Version of the Truth:
– The proliferation of data in the 1980s and 1990s
produce many copies of the same data.
– Only the one, true gold, standard copy of each data
element would be included in a data warehouse.
• Long-Term Investment:
– A data warehouse should be flexible enough to
absorb changes in the company and the world, and
scalable enough to grow with the company.
Data Mining
Introduction
• Data mining the process of applying
analytical approaches to large data sets to
discover implicit, previously unknown, and
potentially useful information.
• This process often involves three steps:
• data preprocessing, data mining and
postprocessing
Introduction
• The first step is to transform the raw data
into a more suitable format for subsequent
data mining.
• The second step conducts the actual
mining while the last one is implemented
to validate and interpret the mining results.
Purpose
• The main purpose of data mining is to
extract knowledge from the data at hand,
increasing its intrinsic value and making
the data useful.
Business Problems for Data
Mining
• Recommendation generation
– Generating recommendations is an important
business challenge for retailers and service providers.
– Customers who are provided appropriate and timely
recommendations are likely to be more valuable
(because they purchase more) and more loyal
(because they feel a stronger relationship to the
vendor).
– For example, Amazon.com
– These recommendations are derived from using data
mining to analyze purchase behavior of all of the
retailer’s customers, and applying the derived rules to
your personal information.
Business Problems for Data
Mining
• Anomaly detection
– How do you know whether your data is ‘‘good’’ or not?
– Analyze your data and pick out those items that don’t fit with the
rest.
– Credit card companies use data mining–driven anomaly
detection to determine if a particular transaction is valid.
– Insurance companies also use anomaly detection to determine if
claims are fraudulent.
– Because these companies process thousands of claims a day, it
is impossible to investigate each case, and data mining can
identify which claims are likely to be false.
– Anomaly detection can even be used to validate data entry—
checking to see if the data entered is correct at the point of
entry.
Business Problems for Data
Mining
• Churn analysis
– Which customers are most likely to switch to
a competitor?
– The telecom, banking, and insurance
industries face severe competition.
– Can help marketing managers identify the
customers who are likely to leave and why,
and as a result, they can improve customer
relations and retain customers.
Business Problems for Data
Mining
• Risk management
– Should a loan be approved for a particular
customer?
– Data mining techniques are used to determine
the risk of a loan application, helping the loan
officer make appropriate decisions on the cost
and validity of each application.
Business Problems for Data
Mining
• Customer segmentation
– Customer segmentation determines the
behavioral and descriptive profiles for your
customers.
– These profiles are then used to provide
personalized marketing programs and
strategies that are appropriate for each group.
Business Problems for Data
Mining
• Targeted ads
– Web retailers or portal sites like to personalize
their content for their Web customers.
– Using navigation or online purchase patterns,
these sites can use data mining solutions to
display targeted advertisements to their Web
navigators.
Business Problems for Data
Mining
• Forecasting
– How many Product will you sell next week?
– What will the inventory level be in one
month?
– Data mining forecasting techniques can be
used to answer these types of time-related
questions.
Schema
• The word schema comes from the Greek
word "σχήμα" (skhēma), which means
shape, or more generally, plan.
• Schema may refer to:
– Model or Diagram
– Schematic, a diagram that represents the
elements of a system using abstract, graphic
symbols
Database Schema
• The schema (pronounced skee-ma) is structure
described in a formal language supported by the
database management system (DBMS).
• In a relational database, the schema defines the tables,
the fields, relationships, views, indexes, packages,
procedures, functions, queues, triggers, types,
sequences, materialized views, synonyms and other
elements.
• Schemas are generally stored in a data dictionary.
• defined in text database language,
• often used to refer to a graphical depiction of the
database structure.
Logical Schema
• A data model of a specific problem
domain Without being specific to a
particular database management product
• it is in terms of either relational tables and
columns,
• This is as opposed to a physical data
model, which describe the particular
physical mechanisms used to capture data
in a storage medium.
Physical Schema
• The technical description of a database
where all the physical constructs (such as
indexes) and parameters (such as page
size or buffer management policy) are
specified.
• The physical schema of a database is the
implementation of its logical schema.
Entity Relationship Diagrams
Entity Relationship Model
• In 1976, Chen developed the (ER) model,
a high-level data model that is useful in
developing a conceptual design for a
database
• A database modeling method, used to
produce a Logical schema or semantic
data model of a system
Entity
• An entity is a real-world item or concept
that exists on its own.
• For example, student , team, lab section, or
experiment is an entity.
• The set of all possible values for an entity,
such as all possible students, is the entity
type.
• In an ER model, we diagram an entity type
as a rectangle containing the type name
Attribute
• Each entity has attributes, or particular
properties that describe the entity.
• For example, student has properties of his
own Student Identification number, name,
and grade.
• A particular value of an attribute, such as
93 for the grade, is a value of the attribute.
Attribute
• Most of the data in a database consists of
values of attributes.
• The set of all possible values of an attribute,
such as integers from 0 to 100 for a grade, is
the attribute domain.
• In an ER model, an attribute name appears in
an oval that has a line to the corresponding
entity box,
Attribute Classification
• A simple attribute is
one component that is
atomic.
• A composite
attribute has multiple
components, each of
which is atomic or
composite.
Attribute Classification
• an entity attribute that
holds exactly one value is
a single-valued
attribute.
• A multi-valued attribute
has more than one value
for a particular entity.
• A derived attribute can
be obtained from other
attributes or related
entities.
Attribute Classification
• An attribute or set of attributes that uniquely
identifies a particular entity is a key.
• A composite key is a key that is a composite of
several attributes.
Relationship

• A relationship type is a set of associations


among entity types.
• For example, the student entity type is related to
the team entity type because each student is a
member of a team
• We use a diamond to illustrate the relationship
type in an ER diagram
Degree of Relation
• The degree of a relationship type is the number
of entity types that participate.
• Binary
• Ternary
• Unary
• role name that indicates the purpose of an entity
in a relationship
Degree of Relation
• a relationship type
can also have
attributes
Relationship Cardinality
• One-to-One Relationships
• One-to-Many Relationships
• Many-to-Many Relationships

You might also like