0% found this document useful (0 votes)
47 views142 pages

Data Mining

1. The document discusses data warehousing and business intelligence, including defining data warehousing as a subject-oriented, integrated, time-variant, and nonvolatile collection of data to support management decision making. 2. It describes the key features of data warehouses as being subject-oriented, integrated, time-variant, and nonvolatile. 3. The different types of data warehouses are discussed as being enterprise data warehouses, operational data stores, or data marts.

Uploaded by

mr2b9jfhy4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
47 views142 pages

Data Mining

1. The document discusses data warehousing and business intelligence, including defining data warehousing as a subject-oriented, integrated, time-variant, and nonvolatile collection of data to support management decision making. 2. It describes the key features of data warehouses as being subject-oriented, integrated, time-variant, and nonvolatile. 3. The different types of data warehouses are discussed as being enterprise data warehouses, operational data stores, or data marts.

Uploaded by

mr2b9jfhy4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 142

MODULE 1

Overview and concepts Data Warehousing and Business Intelligence:


Why reporting and Analysing data, Raw data to valuable information-Lifecycle of Data
- What is Business Intelligence - BI and DW in today’s perspective - What is data
warehousing - The building Blocks: Defining Features - Data warehouses and data
1marts - Overview of the components - Metadata in the data warehouse - Need for data
warehousing - Basic elements of data warehousing - trends in data warehousing.
The Architecture of BI and DW
BI and DW architectures and its types - Relation between BI and DW - OLAP (Online
analytical processing) definitions - Difference between OLAP and OLTP - Dimensional
analysis - What are cubes? Drill-down and roll-up - slice and dice or rotation - OLAP
models - ROLAP versus MOLAP - defining schemas: Stars, snowflakes and fact
constellations

Why reporting and Analysing data

The amount of data stored in databases of industry is growing exponentially. Raw data
does not provide much useful information.so need to convert it into some meaningful
information for business and their customers.
General methods of analysis/reporting can be classified into two categories:
1. Non-parametric analysis 2. Parametric analysis
1. Non-Parametric analysis
It includes information that has not or cannot be rigorously processed or analysed. Mostly used
by managers as they usually requires no special technical know-how to interpret. Mostly
financial data falls into this category
1. Parametric analysis

It includes very detailed information about the behavior of the product based on the
process utilized to gather the data. Mostly engineers uses this analysis.
Business Intelligence in today’s perspective
“Set of methodologies, processes, architectures, and technologies that transform raw data
into meaningful and useful information that allows business users to make informed
business decisions with real-time data that can put a company ahead of its competitors”.
Raw data to valuable information
Lifecycle of Data

Business Understanding: Understanding every aspect of the topic and work accordingly. It is
the most important step in life cycle as everything depends on this stage that how our whole
cycle will work according to the knowledge gained by going through various data sets and
cases.

Data Selection: Choosing for the best data set from where we can extract data which can be
more beneficial. The basic function of this phase of data cycle is to choose data will make our
system more efficient and can fulfill every case accordingly.
Data Preparation: This process includes preparing of extracted data to be used in further
process. It is not necessary that the data selected in above phase is already in ready to use stage.
Some data needed to be processed before using it. So in data preparation phase we transform
data according to our comp ability.

Modeling: It is process of remodeling the given data according to the requirement of user.
After proper understanding and cleaning of the data suitable model is selected. Selecting a
model is totally depended on data type which is been extracted.

Evaluation: It includes going through every aspects of the process as to check for possible
fault or data leakage in the process. It is one of the most necessary process in Data mining.
Fault analysis is the basic function of this phase. Every condition are check whether they fulfils
the required condition or not. If not the process is recycle and again the processing starts.

Deployment: After going through evaluation once everything is check and the data is ready to
be deployed and can be used in further processes.

What is Business Intelligence


Constantly changing circumstances and challenges are faced by business or an
organization, nothing remains static for a longtime. Decisions have to be continuously
taken by the business or organizations to adjust their profitable actions
BI(Business Intelligence) is a set of processes, architectures, and technologies that convert raw data
into meaningful information that drives profitable business actions. It is a suite of software and
services to transform data into actionable intelligence and knowledge.

BI has a direct impact on organization’s strategic, tactical and operational business decisions. BI
supports fact-based decision making using historical data rather than assumptions and gut feeling.

BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs, and charts
to provide users with detailed intelligence about the nature of the business.

Why BI is important?

Measurement: creating KPI (Key Performance Indicators) based on historic data

Identify and set benchmarks for varied processes.

With BI systems organizations can identify market trends and spot business problems that need to
be addressed.

BI helps on data visualization that enhances the data quality and thereby the quality of decision
making.

BI systems can be used not just by enterprises but SME (Small and Medium Enterprises)

Reduce labor costs by generating reports automatically


Make information actionable as user can get data as per their need
Decision maker can take better decision
Multiple data sources can be combined through BI, so decision can be taken faster

BI and DW in today’s perspective


1. Data Visualization: Visual data presentation is very important to understand the
business problems and trends.
2. Data extraction and aggregation: Data is retrieved from various sourced of
databases in different forms like file, spreadsheets, unstructured data. It is necessity
to aggregate that extracted data in desired format.
3. Scalability: Day by day number of users are increasing and volume of data
also increasing.
4. Interoperability: Interface protocol must be designed to interface many databases.
5. Interface: Interfaces should be compatible on various devices or browsers with
same.
What is data warehousing
Data warehousing provides architectures and tools for business executives to system-
atically organize, understand, and use their data to make strategic decisions. Data
warehouse systems are valuable tools in today’s competitive, fast-evolving world. In the last
several years, many firms have spent millions of dollars in building enterprise-wide data
warehouses. Many people feel that with competition mounting in every industry, data
warehousing is the latest must-have marketing weapon—a way to retain customers by
learning more about their needs.
“Then, what exactly is a data warehouse?” Data warehouses have been defined in many
ways, making it difficult to formulate a rigorous definition. Loosely speaking, a data
warehouse refers to a data repository that is maintained separately from an organiza-
tion’s operational databases. Data warehouse systems allow for integration of a variety of
application systems. They support information processing by providing a solid platform
of consolidated historic data for analysis.
According to William H. Inmon, a leading architect in the construction of data
warehouse systems, “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision making pro- cess”
[Inm96]. This short but comprehensive definition presents the major features of a data
warehouse. The four keywords—subject-oriented, integrated, time-variant, and
nonvolatile—distinguish data warehouses from other data repository systems, such as
relational database systems, transaction processing systems, and file systems.

KEY FEATURES
Subject-oriented: A data warehouse is organized around major subjects such as cus- tomer,
supplier, product, and sales. Rather than concentrating on the day-to-day operations and
transaction processing of an organization, a data warehouse focuses on the modeling and
analysis of data for decision makers. Hence, data warehouses typically provide a simple
and concise view of particular subject issues by excluding data that are not useful in the
decision support process.
Integrated: A data warehouse is usually constructed by integrating multiple hetero-
geneous sources, such as relational databases, flat files, and online transaction records.
Data cleaning and data integration techniques are applied to ensure con- sistency in
naming conventions, encoding structures, attribute measures, and so on.
Time-variant: Data are stored to provide information from an historic perspective (e.g., the
past 5–10 years). Every key structure in the data warehouse contains, either implicitly or
explicitly, a time element.
Nonvolatile: A data warehouse is always a physically separate store of data trans- formed
from the application data found in the operational environment. Due to this separation, a data
warehouse does not require transaction processing, recovery, and concurrency control
mechanisms. It usually requires only two operations in data
accessing: initial loading of data and access of data.

The building Blocks: Defining Features - Data warehouses and data 1marts
Types of Data warehouses
• Enterprise Data Warehouse
• Operational Data Store
• Data Mart
Types of Data warehouses
Enterprise Data Warehouse:
• Enterprise Data Warehouse is a centralized warehouse, which provides decision support
service across the enterprise.
• It offers a unified approach to organizing and representing data.
• It also provides the ability to classify data according to the subject and give access
according to those divisions.
Operational Data Store:
• Operational Data Store, also called ODS, is data store required when neither data
warehouse nor OLTP systems support organizations reporting needs.
• It is widely preferred for routine activities like storing records..
• In ODS, Data warehouse is refreshed in real time.
Data Mart:
• A Data Mart is a subset of the data warehouse.
• It specially designed for specific segments like sales, finance, sales, or finance.
• In an independent data mart, data can collect directly from sources.

INTRODUCTION TO DATA MARTS


• A data mart is a simple form of a data warehouse that is focused on a single subject (or
functional area), such as Sales or Finance or Marketing.
• Data marts are often built and controlled by a single department within an organization,
given their single-subject focus, datamarts usually draw data from only a few sources.
• The sources could be internal operational systems, a central data warehouse, or external
data.
• A data mart is a repository of data that is designed to serve a particular community of
knowledge workers.
• The difference between a data warehouse and a data mart can be confusing because the
two terms are sometimes used incorrectly as synonyms.
• A data warehouse is a central repository for all an organization's data.
• The goal of a data mart, however, is to meet the particular demands of a specific group
of users within the organization, such as human resource management (HRM).
• Generally, an organization's data marts are subsets of the
organization's data warehouse.

REASONS FOR CREATING A DATA MARTS


• Easy access to frequently needed data
• Creates collective view by a group of users
• Improves end-user response time
• Ease of creation
• Lower cost than implementing a full data warehouse
• Potential users are more clearly defined than in a full data warehouse
• Contains only business essential data and is less cluttered

Data warehouse or Data Mart?


Data Warehouse:
• Holds multiple subject areas
• Holds very detailed information
• Works to integrate all data sources
• Does not necessarily use a dimensional model but feeds dimensional models.
Data Mart
• Often holds only one subject area- for example, Finance, or Sales
• May hold more summarized data (although many hold full detail)
• Concentrates on integrating information from a given subject area or set of source
systems
• Is built focused on a dimensional model using a star schema.

TYPES OF DATA MARTS


There are three kinds of Data-Marts (DMs):
• Dependent DM: Created from a data warehouse to a separate physical data-
store(build over data warehouse physically)
• Independent DM: Created from operational systems and have separate physical
data-store.
• Logical or Hybrid DM: Exists as a subset of data warehouse. (build over data
warehouse logically)
Dependent data mart:
• A dependent data mart allows you to unite your organization's data in one data
warehouse.
• This gives you the usual advantages of centralization.
Independent data mart:
1. Independent Data Marts · An independent data mart is created without the use of a central
data warehouse. · This could be desirable for smaller groups within an organization.

• An independent data mart is created without the use of a central data warehouse.
• This could be desirable for smaller groups within an organization.
Hybrid data mart:
Hybrid Data Marts - A hybrid data mart allows you to combine input from sources other than a
data warehouse. - This could be useful for many situations, especially when you need ad hoc
integration, such as after a new group or product is added to the organization.
• A hybrid data mart allows you to combine input from sources other than a data
warehouse.
• This could be useful for many situations, especially when you need ad hoc
integration, such as after a new group or product is added to the organization.

Overview of the components

1. Production Data
(financial systems, manufacturing systems, systems along the supply chain, and customer
relationship management systems)
• In operational systems, information queries are narrow
• The queries are all predictable(name and address of a single customer/ orders placed by
a single customer in a single week)
• The significant and disturbing characteristic of production data is disparity.
• A great challenge is to standardize and transform the disparate data from the various
production systems, convert the data, and integrate the pieces into useful data for storage
in the data warehouse.
• Integration of these various sources that provide the value to the data in the data
warehouse
2. Internal Data

• users keep their “private” spreadsheets, documents, customer profiles, and sometimes
even departmental databases.
• Profiles of individual customers become very important for consideration
• The IT department must work with the user departments to gather the internal data
• Internal data adds additional complexity to the process of transforming and integrating
the data before it can be stored in the data warehouse.
• Determine strategies for collecting data from spreadsheets, find ways of taking data from
textual documents, and tie into departmental databases to gather pertinent data from
those sources

Archived Data
• Different methods of archiving exist.
• There are staged archival methods.
• At the first stage, recent data is archived to a separate archival database that may still be
online.
• At the second stage, the older data is archived to flat files on disk storage.
• At the next stage, the oldest data is archived to tape cartridges or microfilm and even
kept off-site.
• A data warehouse keeps historical snapshots of data
• Depending on your data warehouse requirements, you have to include sufficient
historical data. This type of data is useful for discerning patterns and analyzing trends.
External Data
• Data warehouse of a car rental company contains data on the current production
schedules of the leading automobile manufacturers. This external data in the data
warehouse helps the car rental company plan for its fleet management.
• In order to spot industry trends and compare performance against other organizations,
you need data from external sources
• Data from outside sources do not conform to your formats.
• Devise ways to convert data into your internal formats and data types.
• Organize the data transmissions from the external sources. Some sources may provide
information at regular, stipulated intervals. Others may give you the data on request. We
need to accommodate the variations.

Data staging
• Data staging provides a place and an area with a set of functions to clean, change,
combine, convert, deduplicate, and prepare source data for storage and use in the data
warehouse.
• Why do you need a separate place or component to perform the data preparation?
• Can you not move the data from the various sources into the data warehouse storage
itself and then prepare the data?
Data Extraction:
• Tools/ in-house programs
• Data warehouse implementation teams extract the source into a separate physical
environment from which moving the data into the data warehouse would be easier.
• Extract the source data into a group of flat files, or a data-staging relational database, or
a combination of both.

Data Transformation
• Clean the data extracted from each source.
✓ Cleaning may just be correction of misspellings, or may include resolution of conflicts
between state codes and zip codes in the source data, or may deal with providing default
values for missing data elements, or elimination of duplicates when you bring in the
same data from multiple source systems.
• Standardization of data elements forms a large part of data transformation.
✓ standardize the data types and field lengths for same data elements retrieved from the
various sources.
✓ Semantic standardization is another major task.
✓ Resolve synonyms and homonyms. When two or more terms from different source
systems mean the same thing, you resolve the synonyms.
✓ When a single term means many different things in different source systems, you resolve
the homonym.
Data transformation involves many forms of combining pieces of data from the different
sources.
• Combine data from a single source record or related data elements from many source
records.
• Involves purging source data that is not useful and separating out source records into
new combinations.
• Sorting and merging of data takes place on a large scale in the data staging area.
• Keys chosen for the operational systems are field values with built-in meanings
Eg: product key
• Data transformation also includes the assignment of surrogate keys derived from the
source system primary keys.
• Data transformation function would include appropriate summarization

When the data transformation function ends, we have a collection of integrated data that is
cleaned, standardized, and summarized.
Data is ready to load into each data set in your data warehouse.

Data Storage Component


• Data storage in Operational systems Vs Data warehouse
• Data storage for the data warehouse is kept separate from the data storage for operational
systems.
• Databases support operational systems, the updates to data happen as transactions occur.
• These transactions hit the databases in a random fashion.
• How and when the transactions change the data in the databases is not completely within
our control.
• The data in the operational databases could change from moment to moment.
• Analysts use the data in the data warehouse for analysis, they need to know that the data
is stable and that it represents snapshots at specified periods.
• Data warehouses are “read-only” data repositories
• Data warehouses employ relational/multidimensional database management systems.
• Data extracted from the data warehouse storage is aggregated in many ways and the
summary data is kept in the multidimensional databases (MDDBs).

Information Delivery Component


• Who are the users that need information from the data warehouse?
• novice user, casual user, & power user

Metadata Component
• Similar to the data dictionary /data catalog in a database management system
Information about the logical data structures, the information about the files and addresses, the
information about the indexes, and so on.
Contains data about the data in the database.
Management and Control Component
• Coordinates the services and activities within the data warehouse.
• Control the data transformation and the data transfer into the data warehouse storage.
• Moderates the information delivery to the users.
• Works with the database management systems and enables data to be properly stored in
the repositories.
• Monitors the movement of data into the staging area and from there into the data
warehouse storage itself.
• Interacts with the metadata component to perform the management and control
functions.
• Metadata component contains information about the data warehouse itself, the metadata
is the source of information for the management module

METADATA IN THE DATA WAREHOUSE


• Metadata component serves as a directory of the contents of your data warehouse.
Types of Metadata in a data warehouse
➢ Operational metadata :
• Different data structures, have various field lengths and data types
• Operational metadata contain all of this information about the operational data sources
➢ Extraction and transformation metadata
• Extraction frequencies, extraction methods, and business rules for the data extraction
• Contains information about all the data transformations that take place in the data staging
area
➢ End-user metadata
• navigational map of the data warehouse.
• enables the end-users to find information from the data warehouse.
• Allows the end-users to use their own business terminology and look for information in
those ways in which they normally think of the business

Why is metadata especially important in a data warehouse?


• First, it acts as the glue that connects all parts of the data warehouse.
• Provides information about the contents and structures to the developers.
• Finally, it opens the door to the end-users and makes the contents recognizable in their
own terms
Metadata – Example
To Describe Meta Data of a Book Store:
• Name of Book
• Summary of the Book
• The Date of publication
• High level description of what it contains
• How you can find the book
• Author of the book
• Whether the book is available OR not
The information helps you to:
• Search for the book
• Access the book
• Understand the book before you access OR buy it.

NEED FOR DATA WAREHOUSING


The first question that arises is, what is the need for Data Warehouse and spending lots of money
and time on it when you can feed the transaction system direct to it, and we have BI tools. But
there are many limitations to this approach, and gradually enterprises came to understand the
need for Data Warehouse. Let’s see some of the points that make using a Data Warehouse so
important for Business Analytics.

• It serves as a Single Source of Truth for all the data within the company. Using a Data
Warehouse eliminates the following issues:
o Data quality issues
o Unstable data in reports
o Data Inconsistency
o Low query performance
• Data Warehouse gives the ability to quickly run analysis on huge volumes of datasets.
• If there is any change in the structure of the data available in the operational or
transactional Databases. It will not break the business reports running on top of it because
they are not directly connected to BI tools or Reporting tools.
• Cloud Data Warehouse (such as Amazon Redshift and Google BigQuery) offer an added
advantage that you need not invest in them upfront. Instead, you pay as you go as the
size of your data increases. You can refer to this article on Amazon Redshift vs Google
BigQuery for a comparison of the two.
• When companies want to make the data available for all, they will understand the need
for Data Warehouse. You can expose the data within the company for analysis. While
you do so you can hide certain sensitive information (such as PII – Personally
Identifiable Information about your customers, or Partners).
• There is always the need for Data Warehouse as the complexity of queries increases and
users need faster query processing. Because the transactional Databases are built to store
a store in a normalized form whereas fast query processing can be achieved by
denormalized data that is available in Data Warehouse.

BASIC ELEMENTS OF DATA WAREHOUSING


- TRENDS IN DATA WAREHOUSING.

ARCHITECTURAL TYPES

Centralized Data Warehouse


• Enterprise-level information requirements.
• An overall infrastructure is established.
• Atomic level normalized data at the lowest level of granularity is stored in the third
normal form.
• Some summarized data is included.
• Queries and applications access the normalized data in the central data warehouse.
• No separate data marts
Independent Data Marts
• This architectural type evolves in companies where the organizational units develop their
own data marts for their own specific purposes.
• Although each data mart serves the particular organizational unit, these separate data
marts do not provide “a single version of the truth.”
• The data marts are independent of one another.
• Different data marts are likely to have inconsistent data definitions and standards.
• Such variances hinder analysis of data across data marts.
• For example, if there are two independent data marts, one for sales and the other for
shipments, although sales and shipments are related subjects, the independent data marts
would make it difficult to analyze sales and shipments data together
Federated
• Some companies get into data warehousing with an existing legacy of an assortment of
decision-support structures in the form of operational systems, extracted datasets,
primitive data marts, and so on.
• For such companies, it may not be prudent to discard all that huge investment and start
from scratch.
• The practical solution is a federated architectural type where data may be physically or
logically integrated through shared key fields, overall global metadata, distributed
queries, and such other methods.
• In this architectural type, there is no one overall data warehouse.

Hub-and-Spoke
• This is the Inmon Corporate Information Factory approach.
• Similar to the centralized data warehouse architecture, here too is an overall enterprise-
wide data warehouse.
• Atomic data in the third normal form is stored in the centralized data warehouse.
• The major and useful difference is the presence of dependent data marts in this
architectural type.
• Dependent data marts obtain data from the centralized data warehouse.
• The centralized data warehouse forms the hub to feed data to the data marts on the
spokes.
• The dependent data marts may be developed for a variety of purposes.
• Each dependent dart mart may have normalized, denormalized, summarized, or
dimensional data structures based on individual requirements.
• Most queries are directed to the dependent data marts although the centralized data
warehouse may itself be used for querying.
• This architectural type results from adopting a top-down approach to data warehouse
development
Data-Mart Bus
• This is the Kimbal conformed supermarts approach.
• Initiate with analyzing requirements for a specific business subject such as orders,
shipments, billings, insurance claims, car rentals, and so on.
• Build the first data mart (supermart) using business dimensions and metrics.
• These business dimensions will be shared in the future data marts.
• The principal notion is that by conforming dimensions among the various data marts, the
result would be logically integrated supermarts that will provide an enterprise view of
the data.
• The data marts contain atomic data organized as a dimensional data model.
• This architectural type results from adopting an enhanced bottom-up approach to data
warehouse development.

BENEFITS OF DATA WAREHOUSE


• Understand business trends and make better forecasting decisions.
• Data Warehouses are designed to perform well enormous amounts of data.
• The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
• Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.
• Data warehousing is an efficient method to manage demand for lots of information from
lots of users.
• Data warehousing provide the capabilities to analyze a large amount of historical data.
The Architecture of BI and DW
BI and DW architectures and its types - Relation between BI and DW - OLAP (Online
analytical processing) definitions - Difference between OLAP and OLTP - Dimensional
analysis - What are cubes? Drill-down and roll-up - slice and dice or rotation - OLAP
models - ROLAP versus MOLAP - defining schemas: Stars, snowflakes and fact
constellations

BI and DW architectures and its types

How does Data Warehouse work?


• A Data Warehouse is like a central depository where data comes from different
data sources
Data could be in one of the following formats:
• Structured
• Semi-structured
• Unstructured data
• The data is processed and transformed so that users and analysts can access the
processed data in the Data Warehouse through Business Intelligence tools, SQL
clients, and spreadsheets.
• A data warehouse merges all information coming from various sources into one
global and complete database. By merging all of this information in one place, it
becomes easier for an organization to analyze its customers more
comprehensively.

RELATION BETWEEN BUSINESS INTELLIGENCE AND DATA


WAREHOUSE
S.No. Business Intelligence Data Warehouse

It is a set of tools and methods to analyze


It is a system for storage of data from various
data and discover, extract and formulate
1. sources in an orderly manner as to facilitate
actionable information that would be useful
business-minded reads and writes
for business decisions

2. It is a Decision Support System (DSS) It is a data storage system

3. Serves at the front end Serves at the back end

Collects data from the data warehouse for Collects data from various disparate sources and
4.
analysis organises it for efficient BI analysis

Comprises of data held in “fact tables” and


Comprises of business reports, charts,
5. “dimensions” with business meaning
graphs, etc.
incorporated into them

BI as such doesn’t have much use without a


BI is one of many use-cases for data warehouses,
6. data warehouse as large amounts of various
there are more applications for this system
and useful data is required for analysis

Handled and maintained by data engineers and


Handled by executives and analysts
7. system administrators who report to/work for the
relatively higher up in the hierarchy
executives and analysts

Examples of Data warehouse software:


Examples of BI software: SAP, Sisense,
8. BigQuery, Snowflake, Amazon, Redshift,
Datapine, Looker, etc.
Panoply, etc.

Data Warehousing: A Multired Architecture


1. The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources (e.g., customer profile information
provided by external consultants). These tools and utilities perform data extraction,
cleaning, and transformation (e.g., to merge similar data from different sources into a
unified format), as well as load and refresh functions to update the data warehouse
2. The middle tier is an OLAP server that is typically implemented using either (1) a
relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or (2) a
multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).
3. The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
3. Data Warehouse Models:
From the architecture point of view, there are three data warehouse models: the
enterprise warehouse, the data mart, and the virtual warehouse.
Enterprise warehouse: An enterprise warehouse collects all of the information about
subjects spanning the entire organization. It provides corporate-wide data integration,
usually from one or more operational systems or external information providers, and is
cross-functional in scope. It typically contains detailed data as well as summarized data,
and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects. For example,
a marketing data mart may confine its subjects to customer, item, and sales. The data
contained in data marts tend to be summarized.
Depending on the source of data, data marts can be categorized as independent or
dependent. Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated locally
within a particular department or geographic area. Dependent data marts are sourced
directly from enterprise data warehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational databases. For
efficient query processing, only some of the possible summary views may be
materialized. A virtual warehouse is easy to build but requires excess capacity on
operational database servers.

Data Cube: A Multidimensional Data Model

“What is a data cube?”


A data cube allows data to be modeled and viewed in multiple dimensions. It is defined
by dimensions and facts. In general terms, dimensions are the perspectives or entities
with respect to which an organization wants to keep records. For example, All
Electronics may create a sales data warehouse in order to keep records of the store’s
sales with respect to the dimensions time, item, branch, and location. These dimensions
allow the store to keep track of things like monthly sales of items and the branches and
locations at which the items were sold. Each dimension may have a table associated
with it, called a dimension table, which further describes the dimension. For example, a
dimension table for item may contain the attributes item name, brand, and type.
Dimension tables can be specified by users or experts, or automatically generated and
adjusted based on data distributions.
A multidimensional data model is typically organized around a central theme, such
as sales. This theme is represented by a fact table. Facts are numeric measures. Think
of them as the quantities by which we want to analyze relationships between dimensions.
Examples of facts for a sales data warehouse include dollars sold (sales amount
in dollars), units sold (number of units sold), and amount budgeted. The fact table
contains the names of the facts, or measures, as well as keys to each of the related
dimension tables.
OLAP (Online analytical processing) definitions
• Demand for Online Analytical Processing
• Need for Multidimensional Analysis
- business model of a large retail operation,
- time is a critical dimension
• Fast Access and Powerful Calculations
-business analyst looking for reasons why profitability dipped sharply in the recent
months in the entire enterprise
OLAP AND OLAP SERVER DEFINITIONS
On-Line Analytical Processing (OLAP) is a category of software technology that
enables analysts, managers and executives to gain insight into data through fast,
consistent, interactive access to a wide variety of possible views of information that has
been transformed from raw data to reflect the real dimensionality of the enterprise as
understood by the user.
Oracle OLAP uses a multidimensional data model to perform complex statistical,
mathematical, and financial analysis of historical data in real time.
• OLAP databases are divided into one or more cubes, and each cube is organized
and designed by a cube administrator to fit the way that you retrieve and analyze
data so that it is easier to create and use the PivotTable reports and PivotChart
reports that you need.
• OLAP functionality is characterized by dynamic multi-dimensional analysis of
consolidated enterprise data supporting end user analytical and navigational
activities.
• OLAP is implemented in a multi-user client/server mode and offers consistently
rapid response to queries, regardless of database size and complexity.
OLAP SERVER
• An OLAP server is a high-capacity, multi-user data manipulation engine
specifically designed to support and operate on multi-dimensional data
structures.
• A multi-dimensional structure is arranged so that every data item is located and
accessed based on the intersection of the dimension members which define that
item.

What is Online Analytical Processing (OLAP)?


Online Analytical Processing (OLAP) databases facilitate business-intelligence queries.
OLAP is a database technology that has been optimized for querying and reporting,
instead of processing transactions.
The source data for OLAP is Online Transactional Processing (OLTP) databases that
are commonly stored in data warehouses. OLAP data is derived from this historical
data, and aggregated into structures that permit sophisticated analysis.
OLAP data is also organized hierarchically and stored in cubes instead of tables. It is a
sophisticated technology that uses multidimensional structures to provide rapid access
to data for analysis.
Why a separate OLAP tool?
o Empowers end users to do own analysis
o Frees up IS backlog of report requests
o Ease of use
o No knowledge of tables or SQL required
o Why is OLAP useful?
o Facilitates multidimensional data analysis by pre-computing aggregates across
many sets of dimensions
o Provides for:
o Greater speed and responsiveness
o Improved user interactivity
Features of OLAP
• Enables analysts, executives, and managers to gain useful insights from the
presentation of data.
• Can reorganize metrics along several dimensions and allow data to be viewed
from different perspectives.
• Supports multidimensional analysis.
• Is able to drill down or roll up within each dimension.
• Is capable of applying mathematical formulas and calculations to measures.
• Provides fast response, facilitating speed-of-thought analysis.
• Complements the use of other information delivery techniques such as data
mining.
• Improves the comprehension of result sets through visual presentations using
graphs and charts.
• Can be implemented on the Web.
• Designed for highly interactive analysis.
Difference between OLAP and OLTP

DIMENSIONAL ANALYSIS - WHAT ARE CUBES?

DRILL-DOWN AND ROLL-UP - SLICE AND DICE OR ROTATION


Typical OLAP Operations
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives.
A number of OLAP data cube operations exist to materialize these different views,
allowing interactive querying and analysis of the data at hand.
Roll-up: The roll-up operation (also called the drill-up operation by some vendors)
performs aggregation on a data cube, either by climbing up a concept hierarchy for a
dimension or by dimension reduction.
Figure shows the result of a roll-up operation performed on the central cube by climbing
up the concept hierarchy for location. This hierarchy was defined as the total order “street
< city < province or state < country.” The roll-up operation shown aggregates the data
by ascending the location hierarchy from the level of city to the level of country. In other
words, rather than grouping the data by city, the resulting cube groups the data by
country.
When roll-up is performed by dimension reduction, one or more dimensions are removed
from the given cube.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either stepping down a concept hierarchy for a
dimension or introducing additional dimensions.
Slice and dice: The slice operation performs a selection on one dimension of the given cube,
resulting in a sub cube. Figure shows a slice operation where the sales data are selected from
the central cube for the dimension time using the criterion time = “Q1.” The dice operation
defines a sub cube by performing a selection on two or more dimensions. Figure shows a dice
operation on the central cube based on the following selection criteria that involve three
dimensions: (location = “Toronto”or “Vancouver”) and (time = “Q1” or “Q2”) and (item =
“home entertainment” or “computer”).
Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in
view to provide an alternative data presentation.

OLAP models - ROLAP versus MOLAP - defining schemas: Stars, snowflakes and
fact constellations
OLAP databases contain two basic types of data: measures, which are numeric data,
the quantities and averages that you use to make informed business decisions, and
dimensions, which are the categories that you use to organize these measures.
OLAP databases help organize data by many levels of detail.
Types of OLAP Servers
We have four types of OLAP servers:
• Relational OLAP (ROLAP)
• Multidimensional OLAP (MOLAP)
• Hybrid OLAP (HOLAP)
• Specialized SQL Servers
Relational OLAP (ROLAP)

Relational On-Line Analytical Processing (ROLAP) work mainly for the data that resides in a
relational database, where the base data and dimension tables are stored as relational tables.

ROLAP servers are placed between the relational back-end server and client front-end tools.

ROLAP servers use RDBMS to store and manage warehouse data, and OLAP middleware to
support missing pieces.

Advantages of ROLAP

• ROLAP can handle large amounts of data.


• Can be used with data warehouse and OLTP systems.

Disadvantages of ROLAP

• Limited by SQL functionalities.


• Hard to maintain aggregate tables.

Multidimensional OLAP (MOLAP)


Multidimensional On-Line Analytical Processing (MOLAP) support multidimensional views
of data through array-based multidimensional storage engines.
Advantages of MOLAP

• Optimal for slice and dice operations.


• Performs better than ROLAP when data is dense(heavy).
• Can perform complex calculations.
• Disadvantages of MOLAP
• Difficult to change dimension without re-aggregation.
• MOLAP can handle limited amount of data.

Hybrid OLAP (HOLAP)

Hybrid On-Line Analytical Processing (HOLAP) is a combination of ROLAP and MOLAP.


HOLAP provide greater scalability of ROLAP and the faster computation of MOLAP.

Advantages of HOLAP

• HOLAP provide advantages of both MOLAP and ROLAP.


• Provide fast access at all levels of aggregation.

Disadvantages of HOLAP
HOLAP architecture is very complex because it support both MOLAP and ROLAP servers.
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional
Data Models
The entity-relationship data model is commonly used in the design of relational databases,
where a database schema consists of a set of entities and the relationships between them. Such
a data model is appropriate for online transaction processing. A data warehouse, however,
requires a concise, subject-oriented schema that facilitates online data analysis.
The most popular data model for a data warehouse is a multidimensional model, which can
exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Let’s look
at each of these.
Star schema: The most common modeling paradigm is the star schema, in which the data
warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and (2) a set of smaller attendant tables (dimension tables), one foreach dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern
around the central fact table.
Sales are considered along four dimensions: time, item, branch, and location. The schema
contains a central fact table for sales that contains keys to each of the four dimensions, along
with two measures: dollars sold and units sold. To minimize the size of the fact table, dimension
identifiers (e.g., time key and item key) are system-generated identifiers. In the star schema, each
dimension is represented by only one table, and each table contains a set of attributes. For
example, the location dimension table contains the attribute set {location key, street, city,
province or state, country}. This constraint may introduce some redundancy. For example,
“Urbana” and “Chicago” are both cities in the state of Illinois, USA. Entries for such cities in
the location dimension table will create redundancy among the attributes province or state and
country; that is, (..., Urbana, IL, USA) and (..., Chicago, IL, USA). Moreover, the attributes
within a dimension table may form either a hierarchy (total order) or a lattice (partial order).
Snowflake schema: The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the dimension
tables of the snowflake model may be kept in normalized form to reduce redundancies. Such a
table is easy to maintain and saves storage space.
The main difference between the two schemas is in the definition of dimension tables. The
single dimension table for item in the star schema is normalized in the snowflake schema,
resulting in new item and supplier tables. For example, the item dimension table now contains
the attributes item key, item name, brand, type, and supplier key, where supplier key is
linked to the supplier dimension table, containing supplier key and supplier type information.
Similarly, the single dimension table for location in the star schema can be normalized into
two new tables: location and city. The city key in the new location table links to the city
dimension. Notice that, when desirable, further normalization can be performed on
province or state and country in the snowflake schema
Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called
a galaxy schema or a fact constellation. This schema specifies two fact tables, sales and
shipping. The sales table definition is identical to that of the star schema . The shipping table
has five dimensions, or keys—item key, time key, shipper key, from location, and to location—
and two measures—dollars cost and units shipped. A fact constellation schema allows
dimension tables to be shared between fact tables. For example, the dimensions tables for time,
item, and location are shared between the sales and shipping fact tables.
Dimensions: The Role of Concept Hierarchies
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-
level, more general concepts. Consider a concept hierarchy for the dimension location. City
values for location include Vancouver, Toronto, New York, and Chicago.
Each city, however, can be mapped to the province or state to which it belongs. For example,
Vancouver can be mapped to British Columbia, and Chicago to Illinois. The provinces and
states can in turn be mapped to the country (e.g., Canada or the United States) to which they
belong. These mappings form a concept hierarchy for the dimension location, mapping a set of
low-level concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).
With multidimensional data stores, the storage utilization may be low if the data set is
sparse.

Advantages of MOLAP

• Optimal for slice and dice operations.


• Performs better than ROLAP when data is dense(heavy).
• Can perform complex calculations.
• Disadvantages of MOLAP
• Difficult to change dimension without re-aggregation.
• MOLAP can handle limited amount of data.

Hybrid OLAP (HOLAP)

Hybrid On-Line Analytical Processing (HOLAP) is a combination of ROLAP and MOLAP.


HOLAP provide greater scalability of ROLAP and the faster computation of MOLAP.

Advantages of HOLAP

• HOLAP provide advantages of both MOLAP and ROLAP.


• Provide fast access at all levels of aggregation.

Disadvantages of HOLAP
HOLAP architecture is very complex because it support both MOLAP and ROLAP servers
MODULE 2
Introduction to data mining (DM):
Motivation for Data Mining - Data Mining-Definition and Functionalities –
Classification of DM Systems - DM task primitives - Integration of a Data Mining
system with a Database or a Data Warehouse - Issues in DM – KDD Process
Data Pre-processing: Why to pre-process data? - Data cleaning: Missing Values,
Noisy Data - Data Integration and transformation - Data Reduction: Data cube
aggregation, Dimensionality reduction - Data Compression - Numerosity Reduction
- Data Mining Primitives - Languages and System Architectures: Task relevant data
- Kind of Knowledge to be mined - Discretization and Concept Hierarchy
Motivation for Data Mining
Data mining is the procedure of finding useful new correlations, patterns, and trends by sharing
through a high amount of data saved in repositories, using pattern recognition technologies
including statistical and mathematical techniques. It is the analysis of factual datasets to discover
unsuspected relationships and to summarize the records in novel methods that are both logical
and helpful to the data owner.
It is the procedure of selection, exploration, and modeling of high quantities of information to
find regularities or relations that are at first unknown to obtain clear and beneficial results for the
owner of the database.
It is not limited to the use of computer algorithms or statistical techniques. It is a process of
business intelligence that can be used together with information technology to support company
decisions.
Data Mining is similar to Data Science. It is carried out by a person, in a particular situation, on
a specific data set, with an objective. This phase contains several types of services including text
mining, web mining, audio and video mining, pictorial data mining, and social media mining. It
is completed through software that is simple or greatly specific.
Data mining has engaged a huge deal of attention in the information market and society as a
whole in current years, because of the wide availability of huge amounts of data and the imminent
needed for turning such data into beneficial data and knowledge. The information and knowledge
gained can be used for software ranging from industry analysis, fraud detection, and user
retention, to production control and science exploration.
Data mining can be considered as a result of the natural progress of data technology. The database
system market has supported an evolutionary direction in the development of the following
functionalities including data collection and database creation, data management, and advanced
data analysis.
For example, the recent development of data collection and database creation structure served as
necessary for the later development of an effective structure for data storage and retrieval, and
query and transaction processing. With various database systems providing query and transaction
processing as common practice, advanced data analysis has developed into the next object.
Data can be saved in several types of databases and data repositories. One data repository
structure that has appeared in the data warehouse, a repository of several heterogeneous data
sources organized under a unified schema at an individual site to support management decision
making.
Data warehouse technology involves data cleaning, data integration, and online analytical
processing (OLAP), especially, analysis techniques with functionalities including
summarization, consolidation, and aggregation, and the ability to view data from multiple angles.
What motivated data mining? Why is it important?

The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and
science exploration.
The evolution of database technology

Data collection and Database Creation


(1960s and earlier)

Primitive file processing

Database Management Systems

(1970s-early 1980s)

1) Hierarchical and network database system


2) Relational database system
3) Data modeling tools: entity-relational models, etc
4) Indexing and accessing methods: B-trees, hashing etc.
5) Query languages: SQL, etc.
User Interfaces, forms and reports
6) Query Processing and Query Optimization
Advanced Data Analysis:
Advanced Database Systems7) Transactions, concurrency control and recovery
8) Online
Data
transaction
warehousing
Processing
and Data(OLTP)
(mid 1980s-present)
mining (late 1980s-present)
1) Advanced Data models: 1)Data warehouse and OLAP
Extended relational, 2)Data mining and Web based databases
object- relational ,etc. knowledge
(1990s-present)
2) Advanced applications; discovery: generalization, classification,
Spatial, temporal, 1) XML- based
associ ation , clustering, frequent pattern,
multimedia, active database systems
outlier analysis, etc
stream and sensor, 2)Integration with
knowledge based 3)Advanced data mining applications: information retrieval
Stream data mining, bio-data mining, 3)Data and information
Integration
New Generation of Integrated Data and Information Systems(present future)
What is data mining?

Data mining refers to extracting or mining" knowledge from large amounts of data.
There are many other terms related to data mining, such as knowledge mining,
knowledge extraction, data/pattern analysis, data archaeology, and data dredging.
Many people treat datamining as a synonym for another popularly used term,
Knowledge Discovery in Databases", or KDD

Data Mining means “knowledge mining from data”.


Data Mining is processing data to identify patterns and establish relationships. Data mining is the
process of analyzing large amounts of data stored in a data warehouse for useful information
which makes use of artificial intelligence techniques, neural networks, and advanced statistical
tools to reveal trends, patterns and relationships, which otherwise may be undetected.
In addition, many other terms have a similar meaning to data mining—for example, knowledge
mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging.
Many people treat data mining as a synonym for another popularly used term, knowledge
discovery from data, or KDD, while others view data mining as merely an essential step in the
process of knowledge discovery.
The knowledge discovery process consists of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms appropriate for
mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data patterns)
6. Pattern evaluation
7. Knowledge presentation (where visualization and knowledge representation techniques are
used to present mined knowledge to users)

Data Mining-Definition and Functionalities

Data mining is a technical methodology to detect information from huge data sets.
The main objective of data mining is to identify patterns, trends, or rules that
explain data behavior contextually. The data mining method uses mathematical
analysis to deduce patterns and trends, which were not possible through the old
methods of data exploration. Data mining is a handy and extremely convenient
methodology when it comes to dealing with huge volumes of data. In this article, we
explore some data mining functionalities that are measured to predict the type of
patterns in data sets.

Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. In general, data mining tasks can be classified into two types
including descriptive and predictive. Descriptive mining tasks define the common features of
the data in the database and the predictive mining tasks act inference on the current
information to develop predictions.
There are various data mining functionalities which are as follows −
Data characterization − It is a summarization of the general characteristics of an object class
of data. The data corresponding to the user-specified class is generally collected by a database
query. The output of data characterization can be presented in multiple forms.
Data discrimination − It is a comparison of the general characteristics of target class data
objects with the general characteristics of objects from one or a set of contrasting classes. The
target and contrasting classes can be represented by the user, and the equivalent data objects
fetched through database queries.
Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset. There are two parameters that are used for determining the association
rules −
It provides which identifies the common item set in the database.
Confidence is the conditional probability that an item occurs in a transaction when another
item occurs.
Classification − Classification is the procedure of discovering a model that represents and
distinguishes data classes or concepts, for the objective of being able to use the model to
predict the class of objects whose class label is anonymous. The derived model is established
on the analysis of a set of training data (i.e., data objects whose class label is common).
Prediction − It defines predict some unavailable data values or pending trends. An object can
be anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase/decrease trends in time-related
information.
Clustering − It is similar to classification but the classes are not predefined. The classes are
represented by data attributes. It is unsupervised learning. The objects are clustered or
grouped, depends on the principle of maximizing the intraclass similarity and minimizing the
intraclass similarity.
Outlier analysis − Outliers are data elements that cannot be grouped in a given class or
cluster. These are the data objects which have multiple behaviour from the general behaviour
of other data objects. The analysis of this type of data can be essential to mine the knowledge.
Evolution analysis − It defines the trends for objects whose behaviour changes over some
time.

Classification of DM Systems –

DM task primitives
Data Mining Primitives:
A data mining task can be specified in the form of a data mining query, which is input to the data
mining system. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to inter-actively communicate with the data mining system during
discovery of knowledge.
The data mining task primitives includes the following:
• Task-relevant data
• Kind of knowledge to be mined
• Background knowledge
• Interestingness measurement
• Presentation for visualizing the discovered patterns
Task-relevant data
This specifies the portions of the database or the dataset of data in which the user is interested.
This includes the database attributes or data warehouse dimensions of interest (referred to as the
relevant attributes or dimensions).
The kind of knowledge to be mined
This specifies the data mining functions to be performed. Such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.
The background knowledge to be used in the discovery process
The knowledge about the domain is useful for guiding the knowledge discovery process for
evaluating the interesting patterns. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of abstraction.
An example of a concept hierarchy for the attribute (or dimension) age is shown in user beliefs
regarding relationships in the data are another form of background knowledge.
The interestingness measures and thresholds for pattern evaluation:
Different kinds of knowledge may have different interestingness measures.
For example, interestingness measures for association rules include support and confidence.
Rules whose support and confidence values are below user-specified thresholds are considered
uninteresting.
The expected representation for visualizing the discovered patterns. It refers to the discovered
patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and
cubes.
A data mining query language can be designed to incorporate these primitives, allowing users to
flexibly interact with data mining systems.

Data Mining Task Primitives


A data mining task can be specified in the form of a data mining query, which is input to the data
mining system. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to interactively communicate with the data mining system during
discovery to direct the mining process or examine the findings from different angles or depths.
The data mining primitives specify the following,

1. Set of task-relevant data to be mined.


2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.

A data mining query language can be designed to incorporate these primitives, allowing users to
interact with data mining systems flexibly. Having a data mining query language provides a
foundation on which user-friendly graphical interfaces can be built.
Designing a comprehensive data mining language is challenging because data mining covers a
wide spectrum of tasks, from data characterization to evolution analysis. Each task has different
requirements. The design of an effective data mining query language requires a deep
understanding of the power, limitation, and underlying mechanisms of the various kinds of data
mining tasks. This facilitates a data mining system's communication with other information
systems and integrates with the overall information processing environment.

ist of Data Mining Task Primitives


A data mining query is defined in terms of the following primitives, such as:

. The set of task-relevant data to be mined

This specifies the portions of the database or the set of data in which the user is interested. This
includes the database attributes or data warehouse dimensions of interest (the relevant attributes
or dimensions).

In a relational database, the set of task-relevant data can be collected via a relational query
involving operations like selection, projection, join, and aggregation.

The data collection process results in a new data relational called the initial data relation. The
initial data relation can be ordered or grouped according to the conditions specified in the query.
This data retrieval can be thought of as a subtask of the data mining task.

This initial relation may or may not correspond to physical relation in the database. Since virtual
relations are called Views in the field of databases, the set of task-relevant data for data mining is
called a minable view.

2. The kind of knowledge to be mined

This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.

3. The background knowledge to be used in the discovery process

This knowledge about the domain to be mined is useful for guiding the knowledge discovery
process and evaluating the patterns found. Concept hierarchies are a popular form of
background knowledge, which allows data to be mined at multiple levels of abstraction.

Concept hierarchy defines a sequence of mappings from low-level concepts to higher-level,


more general concepts.
o Rolling Up - Generalization of data: Allow to view data at more meaningful and explicit
abstractions and makes it easier to understand. It compresses the data, and it would require
fewer input/output operations.
o Drilling Down - Specialization of data: Concept values replaced by lower-level
concepts. Based on different user viewpoints, there may be more than one concept
hierarchy for a given attribute or dimension.

An example of a concept hierarchy for the attribute (or dimension) age is shown below. User
beliefs regarding relationships in the data are another form of background knowledge.

4. The interestingness measures and thresholds for pattern evaluation

Different kinds of knowledge may have different interesting measures. They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns. For example,
interesting measures for association rules include support and confidence. Rules whose support
and confidence values are below user-specified thresholds are considered uninteresting.

o Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall


simplicity for human comprehension. For example, the more complex the structure of a
rule is, the more difficult it is to interpret, and hence, the less interesting it is likely to be.
Objective measures of pattern simplicity can be viewed as functions of the pattern
structure, defined in terms of the pattern size in bits or the number of attributes or operators
appearing in the pattern.
o Certainty (Confidence): Each discovered pattern should have a measure of certainty
associated with it that assesses the validity or "trustworthiness" of the pattern. A certainty
measure for association rules of the form "A =>B" where A and B are sets of items is
confidence. Confidence is a certainty measure. Given a set of task-relevant data tuples, the
confidence of "A => B" is defined as
Confidence (A=>B) = # tuples containing both A and B /# tuples containing A
o Utility (Support): The potential usefulness of a pattern is a factor defining its
interestingness. It can be estimated by a utility function, such as support. The support of
an association pattern refers to the percentage of task-relevant data tuples (or transactions)
for which the pattern is true.
Utility (support): usefulness of a pattern
Support (A=>B) = # tuples containing both A and B / total #of tuples
o Novelty: Novel patterns are those that contribute new information or increased
performance to the given pattern set. For example -> A data exception. Another strategy
for detecting novelty is to remove redundant patterns.

5. The expected representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed, which may include
rules, tables, cross tabs, charts, graphs, decision trees, cubes, or other visual representations.

Users must be able to specify the forms of presentation to be used for displaying the discovered
patterns. Some representation forms may be better suited than others for particular kinds of
knowledge.

For example, generalized relations and their corresponding cross tabs or pie/bar charts are good
for presenting characteristic descriptions, whereas decision trees are common for classification.

Integration of a Data Mining system with a Database or a Data


Warehouse

The data mining system is integrated with a database or data warehouse system so that it can do
its tasks in an effective presence. A data mining system operates in an environment that needed
it to communicate with other data systems like a database system. There are the possible
integration schemes that can integrate these systems which are as follows −
No coupling − No coupling defines that a data mining system will not use any function of a
database or data warehouse system. It can retrieve data from a specific source (including a file
system), process data using some data mining algorithms, and therefore save the mining results
in a different file.
Such a system, though simple, deteriorates from various limitations. First, a Database system
offers a big deal of flexibility and adaptability at storing, organizing, accessing, and processing
data. Without using a Database/Data warehouse system, a Data mining system can allocate a
large amount of time finding, collecting, cleaning, and changing data.
Loose Coupling − In this data mining system uses some services of a database or data warehouse
system. The data is fetched from a data repository handled by these systems. Data mining
approaches are used to process the data and then the processed data is saved either in a file or in
a designated area in a database or data warehouse. Loose coupling is better than no coupling as
it can fetch some area of data stored in databases by using query processing or various system
facilities.
Semitight Coupling − In this adequate execution of a few essential data mining primitives can
be supported in the database/datawarehouse system. These primitives can contain sorting,
indexing, aggregation, histogram analysis, multi-way join, and pre-computation of some
important statistical measures, including sum, count, max, min, standard deviation, etc.
Tight coupling − Tight coupling defines that a data mining system is smoothly integrated into
the database/data warehouse system. The data mining subsystem is considered as one functional
element of an information system.
Data mining queries and functions are developed and established on mining query analysis, data
structures, indexing schemes, and query processing methods of database/data warehouse systems.
It is hugely desirable because it supports the effective implementation of data mining functions,
high system performance, and an integrated data processing environment.

Issues in DM

Major Issues in Data Mining


1. Mining Methodology
Mining various and new kinds of knowledge
Data mining covers a wide spectrum of data analysis and knowledge discovery tasks, from data
characterization and discrimination to association and correlation analysis, classification,
regression, clustering, outlier analysis, sequence analysis, and trend and evolution analysis. These
tasks may use the same database in different ways and require the development of numerous data
mining techniques.
Mining knowledge in multidimensional space
When searching for knowledge in large data sets, we can explore the data in multidimensional
space. That is, we can search for interesting patterns among combinations of dimensions
(attributes) at varying levels of abstraction. Such mining is known as (exploratory)
multidimensional data mining.
Data mining
The power of data mining can be substantially enhanced by integrating new methods from
multiple disciplines. For example,to mine data with natural language text, it makes sense to fuse
data mining methods with methods of information retrieval and natural language processing. As
another example, consider the mining of software bugs in large programs. This form of mining,
known as bug mining, benefits from the incorporation of software engineering knowledge into
the data mining process.
Boosting the power of discovery in a networked environment
Most data objects reside in a linked or interconnected environment, whether it be the Web,
database relations, files, or documents. Semantic links across multiple data objects can be used to
advantage in data mining. Knowledge derived in one set of objects can be used to boost the
discovery of knowledge in a “related” or semantically linked set of objects.
Handling uncertainty, noise, or incompleteness of data
Data often contain noise, errors, exceptions, or uncertainty, or are incomplete. Errors and noise
may confuse the data mining process, leading to the derivation of erroneous patterns. Data
cleaning, data preprocessing, outlier detection and removal, and uncertainty reasoning are
examples of techniques that need to be integrated with the data mining process.
Pattern evaluation and pattern- or constraint-guided mining
Not all the patterns generated by data mining processes are interesting. What makes a pattern
interesting may vary from user to user. Therefore, techniques are needed to assess the
interestingness of discovered patterns based on subjective measures. These estimate the value of
patterns with respect to a given user class, based on user beliefs or expectations. Moreover, by
using interestingness measures or user-specified constraints to guide the discovery process, we
may generate more interesting patterns and reduce the search space.
2. User Interaction
Interactive mining
Interactive mining should allow users to dynamically change the focus of a search, to refine
mining requests based on returned results, and to drill, dice, and pivot through the data and
knowledge space interactively, dynamically exploring “cube space” while mining.
Incorporation of background knowledge
Background knowledge, constraints, rules, and other information regarding the domain under
study should be incorporated into the knowledge discovery process. Such knowledge can be used
for pattern evaluation as well as to guide the search toward interesting patterns.
-Ad hoc data mining and data mining query languages high-level data mining query languages or
other high-level flexible user interfaces will give users the freedom to define ad hoc data mining
tasks. This should facilitate specification of the relevant sets of data for analysis, the domain
knowledge, the kinds of knowledge to be mined, and the conditions and constraints to be enforced
on the discovered patterns. Optimization of the processing of such flexible mining requests is
another promising area of study.
-Presentation and visualization of data mining results
How can a data mining system present data mining results, vividly and flexibly, so that the
discovered knowledge can be easily understood and directly usable by humans? This is especially
crucial if the data mining process is interactive. It requires the system to adopt expressive
knowledge representations, user-friendly interfaces, and visualization techniques.
3. Efficiency and Scalability
Efficiency and scalability of data mining algorithms Data mining algorithms must be efficient and
scalable in order to effectively extract information from huge amounts of data in many data
repositories or in dynamic data streams.
Parallel, distributed, and incremental mining algorithms
Such algorithms first partition the data into “pieces.” Each piece is processed, in parallel, by
searching for patterns. The parallel processes may interact with one another. The patterns from
each partition are eventually merged. The high cost of some data mining processes and the
incremental nature of input promote incremental data mining, which incorporates new data
updates without having to mine the entire data “from scratch.” Such methods perform knowledge
modification incrementally to amend and strengthen what was previously discovered.
4. Diversity of Database Types
Handling complex types of data
Diverse applications generate a wide spectrum of new data types, from structured data such as
relational and data warehouse data to semi-structured and unstructured data; from stable data
repositories to dynamic data streams; from simple data objects to temporal data, biological
sequences, sensor data, spatial data, hypertext data, multimedia data, software program code, Web
data, and social network data.
Mining dynamic, networked, and global data repositories
Multiple sources of data are connected by the Internet and various kinds of networks, forming
gigantic, distributed, and heterogeneous global information systems and networks. The discovery
of knowledge from different sources of structured, semi-structured, or unstructured yet
interconnected data with diverse data semantics poses great challenges to data mining.
5. Data Mining and Society
Social impacts of data mining
With data mining penetrating our everyday lives, it is important to study the impact of data mining
on society. How can we use data mining technology to benefit society? How can we guard against
its misuse? The improper disclosure or use of data and the potential violation of individual privacy
and data protection rights are areas of concern that need to be addressed.
Privacy-preserving data mining
Data mining will help scientific discovery, business management, economy recovery, and security
protection (e.g., the real-time discovery of intruders and cyberattacks).
Invisible data mining
We cannot expect everyone in society to learn and master data mining techniques. More and more
systems should have data mining functions built within so that people can perform data mining or
use data mining results simply by mouse clicking, without any knowledge of data mining
algorithms. Intelligent search engines and Internet-based stores perform such invisible data
mining by incorporating data mining into their components to improve their functionality and
performance.

KDD Process
KDD- Knowledge Discovery in Databases
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of
discovering knowledge in data and emphasizes the high-level applications of specific Data Mining
techniques. It is a field of interest to researchers in various fields, including artificial intelligence,
machine learning, pattern recognition, databases, statistics, knowledge acquisition for expert
systems, and data visualization.

The main objective of the KDD process is to extract information from data in the context of large
databases. It does this by using Data Mining algorithms to identify what is deemed knowledge.

The Knowledge Discovery in Databases is considered as a programmed, exploratory analysis and


modeling of vast data repositories.KDD is the organized procedure of recognizing valid, useful,
and understandable patterns from huge and complex data sets. Data Mining is the root of the KDD
procedure, including the inferring of algorithms that investigate the data, develop the model, and
find previously unknown patterns. The model is used for extracting the knowledge from the data,
analyze the data, and predict the data.

The availability and abundance of data today make knowledge discovery and Data Mining a
matter of impressive significance and need. In the recent development of the field, it isn't
surprising that a wide variety of techniques is presently accessible to specialists and experts.
The KDD Process
The knowledge discovery process(illustrates in the given figure) is iterative and interactive,
comprises of nine steps. The process is iterative at each stage, implying that moving back to the
previous actions might be required. The process has many imaginative aspects in the sense that
one cant presents one formula or make a complete scientific categorization for the correct
decisions for each step and application type. Thus, it is needed to understand the process and the
different requirements and possibilities in each stage.

The process begins with determining the KDD objectives and ends with the implementation of
the discovered knowledge. At that point, the loop is closed, and the Active Data Mining starts.
Subsequently, changes would need to be made in the application domain. For example, offering
various features to cell phone users in order to reduce churn. This closes the loop, and the impacts
are then measured on the new data repositories, and the KDD process again. Following is a concise
description of the nine-step KDD process, Beginning with a managerial step:

1. Building up an understanding of the application domain

This is the initial preliminary step. It develops the scene for understanding what should be done
with the various decisions like transformation, algorithms, representation, etc. The individuals
who are in charge of a KDD venture need to understand and characterize the objectives of the
end-user and the environment in which the knowledge discovery process will occur ( involves
relevant prior knowledge).

2. Choosing and creating a data set on which discovery will be performed

Once defined the objectives, the data that will be utilized for the knowledge discovery process
should be determined. This incorporates discovering what data is accessible, obtaining important
data, and afterward integrating all the data for knowledge discovery onto one set involves the
qualities that will be considered for the process. This process is important because of Data Mining
learns and discovers from the accessible data. This is the evidence base for building the models.
If some significant attributes are missing, at that point, then the entire study may be unsuccessful
from this respect, the more attributes are considered. On the other hand, to organize, collect, and
operate advanced data repositories is expensive, and there is an arrangement with the opportunity
for best understanding the phenomena. This arrangement refers to an aspect where the interactive
and iterative aspect of the KDD is taking place. This begins with the best available data sets and
later expands and observes the impact in terms of knowledge discovery and modeling.

3. Preprocessing and cleansing

In this step, data reliability is improved. It incorporates data clearing, for example, Handling the
missing quantities and removal of noise or outliers. It might include complex statistical techniques
or use a Data Mining algorithm in this context. For example, when one suspects that a specific
attribute of lacking reliability or has many missing data, at this point, this attribute could turn into
the objective of the Data Mining supervised algorithm. A prediction model for these attributes
will be created, and after that, missing data can be predicted. The expansion to which one pays
attention to this level relies upon numerous factors. Regardless, studying the aspects is significant
and regularly revealing by itself, to enterprise data frameworks.

4. Data Transformation

In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction( for example, feature selection and extraction
and record sampling), also attribute transformation(for example, discretization of numerical
attributes and functional transformation). This step can be essential for the success of the entire
KDD project, and it is typically very project-specific. For example, in medical assessments, the
quotient of attributes may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well as efforts and transient
issues. For example, studying the impact of advertising accumulation. However, if we do not
utilize the right transformation at the starting, then we may acquire an amazing effect that insights
to us about the transformation required in the next iteration. Thus, the KDD process follows upon
itself and prompts an understanding of the transformation required.

5. Prediction and description


We are now prepared to decide on which kind of Data Mining to use, for example, classification,
regression, clustering, etc. This mainly relies on the KDD objectives, and also on the previous
steps. There are two significant objectives in Data Mining, the first one is a prediction, and the
second one is the description. Prediction is usually referred to as supervised Data Mining, while
descriptive Data Mining incorporates the unsupervised and visualization aspects of Data Mining.
Most Data Mining techniques depend on inductive learning, where a model is built explicitly or
implicitly by generalizing from an adequate number of preparing models. The fundamental
assumption of the inductive approach is that the prepared model applies to future cases. The
technique also takes into account the level of meta-learning for the specific set of accessible data.

6. Selecting the Data Mining algorithm

Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For example,
considering precision versus understandability, the previous is better with neural networks, while
the latter is better with decision trees. For each system of meta-learning, there are several
possibilities of how it can be succeeded. Meta-learning focuses on clarifying what causes a Data
Mining algorithm to be fruitful or not in a specific issue. Thus, this methodology attempts to
understand the situation under which a Data Mining algorithm is most suitable. Each algorithm
has parameters and strategies of leaning, such as ten folds cross-validation or another division for
training and testing.

7. Utilizing the Data Mining algorithm

At last, the implementation of the Data Mining algorithm is reached. In this stage, we may need
to utilize the algorithm several times until a satisfying outcome is obtained. For example, by
turning the algorithms control parameters, such as the minimum number of instances in a single
leaf of a decision tree.

8. Evaluation

In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact on the
Data Mining algorithm results. For example, including a feature in step 4, and repeat from there.
This step focuses on the comprehensibility and utility of the induced model. In this step, the
identified knowledge is also recorded for further use. The last step is the use, and overall feedback
and discovery results acquire by Data Mining.

9. Using the discovered knowledge

Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and measure
the impacts. The accomplishment of this step decides the effectiveness of the whole KDD process.
There are numerous challenges in this step, such as losing the "laboratory conditions" under which
we have worked. For example, the knowledge was discovered from a certain static depiction, it is
usually a set of data, but now the data becomes dynamic. Data structures may change certain
quantities that become unavailable, and the data domain might be modified, such as an attribute
that may have a value that was not expected previously.

Data Pre-processing: Why to pre-process data? - Data cleaning: Missing Values,


Noisy Data - Data Integration and transformation - Data Reduction: Data cube
aggregation, Dimensionality reduction - Data Compression - Numerosity Reduction
- Data Mining Primitives - Languages and System Architectures: Task relevant data
- Kind of Knowledge to be mined - Discretization and Concept Hierarchy

Preprocessing in Data Mining:


Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Data Preprocessing

Data preprocessing is the process of transforming raw data into an understandable format. It
is also an important step in data mining as we cannot work with raw data. The quality of the
data should be checked before applying machine learning or data mining algorithms.

Why is Data preprocessing important?

Preprocessing of data is mainly to check the data quality. The quality can be checked by the
following

• Accuracy: To check whether the data entered is correct or not.


• Completeness: To check whether the data is available or not recorded.
• Consistency: To check whether the same data is kept in all the places that do or do
not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.

• (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.
• (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
2. Data Integration

3. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
Min-max normalization performs a linear transformation on the original data. Suppose that minA
and maxA are the minimum and maximum values of an attribute, A. Min-max normalization maps
a value, v, of A to v 0 in the range [new minA, newmaxA]
In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean and standard deviation of A. A value, v, of A is normalized to v 0.
Normalization by decimal scaling normalizes by moving the decimal point of values of attribute
A. The number of decimal points moved depends on the maximum absolute value of A. A value,
v, of A is normalized to v0.
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order
to get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute.the attribute having p-value greater than significance level can be
discarded.

3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless.
If after reconstruction from compressed data, original data can be retrieved,
such reduction are called lossless reduction else it is called lossy reduction. The
two effective methods of dimensionality reduction are:Wavelet transforms and
PCA (Principal Component Analysis).

Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Step: 2
Smoothing by bin means:
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Binning method - Example (Cont..)
Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Step: 2
Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34

Data Mining Primitives

A data mining query is defined in terms of the following primitives


Task-relevant data: This is the database portion to be investigated. For example,
suppose that you are a manager of All Electronics in charge of sales in the United
States and Canada. In particular, you would like to study the buying trends of
customers in Canada. Rather than mining on the entire database. These are referred
to as relevant attributes
The kinds of knowledge to be mined: This specifies the data mining functions to
be performed, such as characterization, discrimination, association, classification,
clustering, or evolution analysis. For instance, if studying the buying habits of
customers in Canada, you may choose to mine associations between customer
profiles and the items that these customers like to buy
Background knowledge: Users can specify background knowledge, or knowledge
about the domain to be mined. This knowledge is useful for guiding the knowledge
discovery process, and for evaluating the patterns found. There are several kinds of
background knowledge.
Interestingness measures: These functions are used to separate uninteresting
patterns from knowledge. They may be used to guide the mining process, or after
discovery, to evaluate the discovered patterns. Different kinds of knowledge may
have different interestingness measures.

Languages and System Architectures:

Architecture of a typical data mining system/Major Components

Data mining is the process of discovering interesting knowledge from large amounts
of data stored either in databases, data warehouses, or other information repositories.
Based on this view, the architecture of a typical data mining system may have the
following major components:

1. A database, data warehouse, or other information repository, which


consists of the set of databases, data warehouses, spreadsheets, or other
kinds of information repositories containing the student and course
information.
2. A database or data warehouse server which fetches the relevant data
based on users’ data mining requests.
3. A knowledge base that contains the domain knowledge used to guide the
search or to evaluate the interestingness of resulting patterns. For
example, the knowledge base may contain metadata which describes data
from multiple heterogeneous sources.
4. A data mining engine, which consists of a set of functional modules for
tasks such as classification, association, classification, cluster analysis, and
evolution and deviation analysis.
5. A pattern evaluation module that works in tandem with the data mining
modules by employing interestingness measures to help focus the search
towards interestingness patterns.
6. A graphical user interface that allows the user an interactive
approach to thedata mining system.

Architecture of a typical data mining system

Graphical user interface

Pattern evaluation

Knowledge base

Data mining engine

Database or data warehouse server

Data cleansing
Data Integration Filtering

Database Data warehouse

Task relevant data


Data mining tasks are designed to be semi-automatic or fully automatic and on large data sets to
uncover patterns such as groups or clusters, unusual or over the top data called anomaly detection
and dependencies such as association and sequential pattern. Once patterns are uncovered, they
can be thought of as a summary of the input data, and further analysis may be carried out using
Machine Learning and Predictive analytics. For example, the data mining step might help identify
multiple groups in the data that a decision support system can use. Note that data collection,
preparation, reporting are not part of data mining.

There is a lot of confusion between data mining and data analysis. Data mining functions are used
to define the trends or correlations contained in data mining activities. While data analysis is used
to test statistical models that fit the dataset, for example, analysis of a marketing campaign, data
mining uses Machine Learning and mathematical and statistical models to discover patterns
hidden in the data. In comparison, data mining activities can be divided into two categories:

o Descriptive Data Mining: It includes certain knowledge to understand what is happening


within the data without a previous idea. The common data features are highlighted in the
data set. For example, count, average etc.
o Predictive Data Mining: It helps developers to provide unlabeled definitions of
attributes. With previously available or historical data, data mining can be used to make
predictions about critical business metrics based on data's linearity. For example,
predicting the volume of business next quarter based on performance in the previous
quarters over several years or judging from the findings of a patient's medical
examinations that is he suffering from any particular disease.

Kind of Knowledge to be mined

This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.

Discretization and Concept Hierarchy

Data Discretization

• Dividing the range of a continuous attribute into intervals.


• Interval labels can then be used to replace actual data values.
• Reduce the number of values for a given continuous attribute.
• Some classification algorithms only accept categorically attributes.
• This leads to a concise, easy-to-use, knowledge-level representation of mining
results.
• Discretization techniques can be categorized based on whether it uses class
information or not such as follows:
o Supervised Discretization - This discretization process uses class
information.
o Unsupervised Discretization - This discretization process does not use class
information.
• Discretization techniques can be categorized based on which direction it proceeds as
follows:

Top-down Discretization -

• If the process starts by first finding one or a few points called split points or cut
points to split the entire attribute range and then repeat this recursively on the
resulting intervals.

Bottom-up Discretization -

• Starts by considering all of the continuous values as potential split-points.


• Removes some by merging neighborhood values to form intervals, and then
recursively applies this process to the resulting intervals.

Concept Hierarchies

• Discretization can be performed rapidly on an attribute to provide a hierarchical


partitioning of the attribute values, known as a Concept Hierarchy.
• Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts with higher-level concepts.
• In the multidimensional model, data are organized into multiple dimensions, and
each dimension contains multiple levels of abstraction defined by concept
hierarchies.
• This organization provides users with the flexibility to view data from different
perspectives.
• Data mining on a reduced data set means fewer input and output operations and is
more efficient than mining on a larger data set.
• Because of these benefits, discretization techniques and concept hierarchies are
typically applied before data mining, rather than during mining.
MODULE 3
Concept Description and Association Rule Mining
What is concept description? - Data Generalization and summarization-based
characterization - Attribute relevance - class comparisons Association Rule Mining:
Market basket analysis - basic concepts - Finding frequent item sets: Apriori
algorithm - generating rules – Improved Apriori algorithm – Incremental ARM –
Associative Classification – Rule Mining
WHAT IS CONCEPT DESCRIPTION?
Concept Description is a definitive type of data mining. It defines a set of data including
frequent buyers, graduate candidates, etc. It describes the characterization and comparison of
the data. It is also known as a class description when the concept to be described is defined
as a class of objects. These descriptions can be determined with the support of data
characterization.
Data characterization is a summarization of the general characteristics of the target class of
data. The data relating to a specific user-defined class is usually recovered by a database
query. The output of data characterization can be presented in several forms such as bar
charts, curves, pie charts, and live graphs, etc.
Characterization supports a concise summarization of a given set of data, while concept or
class comparison supports descriptions comparing two or more sets of data.
OLAP
OLAP represents On-Line Analytical Processing. OLAP is a categorization of software
technology that empower analysts, managers, and administration to profit vision into data
through quick, regular, interactive access in a large variety of possible views of data that has
been changed from raw information to reflect the real dimensionality of the enterprise as
accomplished by the users.
OLAP server current business users with multidimensional data from data warehouses or data
marts, without concerns regarding how or where the data are saved. The physical structure
and performance of OLAP servers should consider data storage issues.

Data can be associated with class or concept. For example, in the All Electronics store, classes
of items for. sale include computers and printers and concepts of customers include big
spender b and budget spender. It can. useful to describe individual classes and concepts in
summarized, concise, and yet precise terms. Such descriptions of a class or a concept are
called class/concept-description. These descriptions can be derived via-
Data Characterization –

It is a summarization of the general characteristics or features of a target class of data. The


data corresponding to the user-specified class are typically collected by a database query: For
example, to study the characteristics of software, products whose sales are increased by 10%
in the last year, the data related to such products can be collected by executing an SQL query.
The output of data characterization can be represented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs. The resulting descriptions can also be. presented as generalized relations
or in rule form (called a characteristic rule)

Data Discrimination –

It is a comparison of general features of target class data objects with the general features from
one or a set of contrasting classes. The target and contrasting classes can be specified by the
user, the corresponding data objects retrieved through database queries. For example, the user
may like to compare the general features of software products whose sales are increased by
10% in the last year with those whose, sales are decreased by at least 30% during the same
period: The method used for data discrimination are similar to those used for data
characterization.

he simplest kind of descriptive data mining is called concept description. A concept usually
refers to a collection of data such as frequent_buyers, graduate_students and so on.
As data mining task concept description is not a simple enumeration of the data. Instead,
concept description generates descriptions for characterization and comparison of the data.
It is sometimes called class description when the concept to be described refers to a class of
objects
• Characterization: It provides a concise and succinct summarization of the given
collection of data.
• Comparison: It provides descriptions comparing two or more collections of data.

Data Generalization and summarization-based characterization


Data and objects in databases contain detailed information at the primitive concept level.
For example, the item relation in a sales database may contain attributes describing low-
level item information such as item_ID, name, brand, category, supplier, place_made and
price.
It is useful to be able to summarize a large set of data and present it at a high conceptual
level.
For example, summarizing a large set of items relating to Christmas season sales provides a
general description of such data, which can be very helpful for sales and marketing
managers.
This requires an important functionality called data generalization.

Data Generalization

A process that abstracts a large set of task-relevant data in a database from a low conceptual
level to higher ones.
Data Generalization is a summarization of general features of objects in a target class and
produces what is called characteristic rules.
The data relevant to a user-specified class are normally retrieved by a database query and
run through a summarization module to extract the essence of the data at different levels of
abstractions.
For example, one may want to characterize the "OurVideoStore" customers who regularly
rent more than 30 movies a year. With concept hierarchies on the attributes describing the
target class, the attribute-oriented induction method can be used, for example, to carry out
data summarization.
Note that with a data cube containing a summarization of data, simple OLAP operations fit
the purpose of data characterization.

Approaches:
• Data cube approach(OLAP approach).
• Attribute-oriented induction approach.

Presentation Of Generalized Results


Generalized Relation:
• Relations where some or all attributes are generalized, with counts or other
aggregation values accumulated.

Cross-Tabulation:
• Mapping results into cross-tabulation form (similar to contingency tables).
Visualization Techniques:
• Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
• Mapping generalized results in characteristic rules with quantitative information
associated with it.

Data Cube Approach

It is nothing but performing computations and storing results in data cubes.


Strength
• An efficient implementation of data generalization.
• Computation of various kinds of measures, e.g., count( ), sum( ), average( ), max( ).
• Generalization and specialization can be performed on a data cube by roll-up and
drill-down.
Limitations
• It handles only dimensions of simple non-numeric data and measures of simple
aggregated numeric values.
• Lack of intelligent analysis, can’t tell which dimensions should be used and what
levels should the generalization reach.

Summary

Data generalization is the process that abstracts a large set of task-relevant data in a database
from a low conceptual level to higher ones.
It is a summarization of general features of objects in a target class and produces what is
called characteristic rules.

ATTRIBUTE RELEVANCE
Reasons for attribute relevance analysis
There are several reasons for attribute relevance analysis are as follows −
• It can decide which dimensions must be included.
• It can produce a high level of generalization.
• It can reduce the number of attributes that support us to read patterns easily.
The basic concept behind attribute relevance analysis is to evaluate some measure that can
compute the relevance of an attribute regarding a given class or concept. Such measures
involve information gain, ambiguity, and correlation coefficient.
Attribute relevance analysis for concept description is implemented as follows −
Data collection − It can collect data for both the target class and the contrasting class by
query processing.
Preliminary relevance analysis using conservative AOI − This step recognizes a set of
dimensions and attributes on which the selected relevance measure is to be used.
AOI can be used to implement preliminary analysis on the data by eliminating attributes
having a high number of distinct values. It can be conservative, the AOI implemented should
employ attribute generalization thresholds that are set reasonably large to enable more
attributes to be treated in further relevance analysis by the selected measure.
Remove − This process removes irrelevant and weakly relevant attributes using the selected
relevance analysis measure.
Generate the concept description using AOI − It can implement AOI using a less
conservative set of attribute generalization thresholds. If the descriptive mining function is
class characterization, only the original target class working relation is included now.
If the descriptive mining function is class characterization, only the original target class
working relation is included. If the descriptive mining function is class characterization, only
the original target class working relation is included. If the descriptive mining function is
class comparison, both the original target class working relation and the original contrasting
class working relation are included.
CLASS COMPARISONS ASSOCIATION RULE MINING:
Market basket analysis - basic concepts
Market Basket Analysis: A Motivating Example
Frequent itemset mining leads to the discovery of associations and correlations
among items in large transactional or relational data sets. With massive amounts of data
continuously being collected and stored, many industries are becoming interested in
mining such patterns from their databases. The discovery of interesting correlation
relation- ships among huge amounts of business transaction records can help in
many busi- ness decision-making processes such as catalog design, cross-marketing,
and customer shopping behavior analysis.
A typical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different items that
customers place in their “shopping baskets” (Figure 6.1). The discovery of these
associations can help retailers develop marketing strategies by gaining insight into
which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread (and what kind of
bread) on the same trip

Which items are frequently


purchased together by customers?

Shopping Baskets

milk
bread milk bread milk bread
cereal sugar eggs
butter
Customer 1 Customer 2 Customer 3

sugar
eggs
Market Analyst

to the supermarket? This information can lead to increased sales by helping retailers do
selective marketing and plan their shelf space.
Let’s look at an example of how market basket analysis can be useful.

Market basket analysis. Suppose, as manager of an All Electronics branch, you would like
to learn more about the buying habits of your customers. Specifically, you wonder, “Which
groups or sets of items are customers likely to purchase on a given trip to the store?” To answer your
question, market basket analysis may be performed on the retail data of customer
transactions at your store. You can then use the results to plan marketing or advertising
strategies, or in the design of a new catalog. For instance, market basket anal ysis may help you
design different store layouts. In one strategy, items that are frequently purchased together
can be placed in proximity to further encourage the combined sale of such items. If customers
who purchase computers also tend to buy antivirus software at the same time, then placing
the hardware display close to the software display may help increase the sales of both
items.

In an alternative strategy, placing hardware and software at opposite ends of the store may
entice customers who purchase such items to pick up other items along the way. For instance,
after deciding on an expensive computer, a customer may observe security systems for sale
while heading toward the software display to purchase antivirus software, and may decide to
purchase a home security system as well. Market basket analysis can also help retailers plan
which items to put on sale at reduced prices. If customers tend to purchase computers and
printers together, then having a sale on printers may encourage the sale of printers as well as
computers.
If we think of the universe as the set of items available at the store, then each item has a Boolean
variable representing the presence or absence of that item. Each basket can then be
represented by a Boolean vector of values assigned to these variables. The Boolean vectors
can be analyzed for buying patterns that reflect items that are frequently associated or
purchased together. These patterns can be represented in the form of association rules. For
example, the information that customers who purchase computers also tend to buy
antivirus software at the same time is represented in the following association rule:

computer ⇒ antivirus software [support = 2%, confidence = 60%]

Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for Rule (6.1)
means that 2% of all the transactions under analysis show that computer and antivirus
software are purchased together. A confidence of 60% means that 60% of the customers who
purchased a computer also bought the software. Typically, associa tion rules are considered
interesting if they satisfy both a minimum support threshold and a minimum
confidence threshold. These thresholds can be a set by users or domain experts.
Additional analysis can be performed to discover interesting statistical correlations
between associated items.
ASSOCIATION RULE MINING
Association rule learning is a rule-based machine learning method for discovering interesting
relations between variables in large databases.
It is intended to identify strong rules discovered in databases using some measures of
interestingness
❖ Learning of Association rules is used to find relationships between attributes in large databases.
An association rule, A=> B, will be of the form” for a set of transactions, some value of itemset
A determines the values of itemset B under the condition in which minimum support and
confidence are met”.
❖ Support and Confidence can be represented by the following example:

❖ The above statement is an example of an association rule. This means that there is a 2%
transaction that bought bread and butter together and there are 60% of customers who bought
bread as well as butter
Association rule mining consists of 2 steps:
1. Find all the frequent itemsets.
2. Generate association rules from the above frequent itemsets
Finding frequent item sets: Apriori algorithm - generating rules – Improved
Apriori algorithm – Incremental ARM – Associative Classification – Rule
Mining
APRIORI ALGORITHM

❖ With the quick growth in e-commerce applications, there is an accumulation vast quantity of
data in months not in years. Data Mining, also known as Knowledge Discovery in
Databases(KDD), to find anomalies, correlations, patterns, and trends to predict outcomes.

❖ Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent itemsets
and relevant association rules. It is devised to operate on a database containing a lot of
transactions, for instance, items brought by customers in a store.

❖ It is very important for effective Market Basket Analysis and it helps the customers in
purchasing their items with more ease which increases the sales of the markets. It has also been
used in the field of healthcare for the detection of adverse drug reactions. It produces association
rules that indicates what all combinations of medications and patient characteristics lead to ADRs
WHAT IS AN ITEMSET?

❖ A set of items together is called an itemset. If any itemset has k-items it is called a k-itemset.
An itemset consists of two or more items. An itemset that occurs frequently is called a frequent
itemset. Thus frequent itemset mining is a data mining technique to identify the items that often
occur together.

➢ For Example, Bread and butter, Laptop and Antivirus software, etc
WHAT IS A FREQUENT ITEMSET?

❖ A set of items is called frequent if it satisfies a minimum threshold value for support and
confidence. Support shows transactions with items purchased together in a single transaction.
Confidence shows transactions where the items are purchased one after the other.

❖ For frequent itemset mining method, we consider only those transactions which meet minimum
threshold support and confidence requirements. Insights from these mining algorithms offer a lot
of benefits, cost-cutting and improved competitive advantage.

❖ There is a tradeoff time taken to mine data and the volume of data for frequent mining. The
frequent mining algorithm is an efficient algorithm to mine the hidden patterns of itemsets within
a short time and less memory consumption
FREQUENT PATTERN MINING

❖ The frequent pattern mining algorithm is one of the most important techniques of data mining
to discover relationships between different items in a dataset. These relationships are represented
in the form of association rules. It helps to find the irregularities in data.

❖ FPM has many applications in the field of data analysis, software bugs, cross-marketing, sale
campaign analysis, market basket analysis, etc.

❖ Frequent itemsets discovered through Apriori have many applications in data mining tasks.
Tasks such as finding interesting patterns in the database, finding out sequence and Mining of
association rules is the most important of them.

❖ Association rules apply to supermarket transaction data, that is, to examine the customer
behavior in terms of the purchased products. Association rules describe how often the items are
purchased together
WHY FREQUENT ITEMSET MINING?

❖ Frequent itemset or pattern mining is broadly used because of its wide applications in mining
association rules, correlations and graph patterns constraint that is based on frequent patterns,
sequential patterns, and many other data mining tasks.

❖ Apriori says:

❖ The probability that item I is not frequent is if:

➢ P(I) < minimum support threshold, then I is not frequent.

➢ P (I+A) < minimum support threshold, then I+A is not frequent, where A also belongs to
itemset.

➢ If an itemset set has value less than minimum support then all of its supersets will also fall
below min support, and thus can be ignored. This property is called the Antimonotone property

STEPS TO FOLLOW IN APRIORI ALGORITHM


1. Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with
itself.
2. Prune Step: This step scans the count of each item in the database. If the candidate item
does not meet minimum support, then it is regarded as infrequent and thus it is removed.
This step is performed to reduce the size of the candidate itemsets.
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the
given database. This data mining technique follows the join and the prune steps iteratively until
the most frequent itemset is achieved. A minimum support threshold is given in the problem or it
is assumed by the user.
1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm
will count the occurrences of each item.
2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose occurrence
is satisfying the min sup are determined. Only those candidates which count more than or equal
to min_sup, are taken ahead for the next iteration and the others are pruned.
3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-
itemset is generated by forming a group of 2 by combining items with itself
4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have 2
–itemsets with min-sup only.
5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each group
fall in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent otherwise
it is pruned.
6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset
does not meet the min_sup criteria. The algorithm is stopped when the most frequent itemset is
achieved.

Example refer classwork

Transactional Data for an AllElectronics support= 22% 7L1 0 L1 is equivalent to L1 × L1, since
the definition of Lk 0 Lk requires the two joining itemsets toshare k − 1 = 0 items.

Branch
TID List of item IDs
T100 I1, I2, I5

T200 I2, I4

T300 I2, I3

T400 I1, I2, I4

T500 I1, I3

T600 I2, I3

T700 I1, I3

T800 I1, I2, I3, I5

T900 I1,
Generation of the candidate itemsets and frequent itemsets, where the minimum support count is 2.

1. Next, the transactions in D are scanned and the support count of each candidate
itemset in C2 is accumulated, as shown in the middle table of the second row in
Figure 6.2.
2. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-
itemsets in C2 having minimum support.
3. The generation of the set of the candidate 3-itemsets, C3, is detailed in Figure 6.3. From
the join step, we first get C3 L2 0 L2 I1, I2, I3 , I1, I2, I5 , I1, I3, I5 ,
I2, I3, I4 , I2, I3, I5 , I2, I4, I5 . Based on the Apriori property that all subsets of a
frequent itemset must also be frequent, we can determine that the four latter candidates
cannot possibly be frequent. We therefore remove them from C3, thereby saving the effort
of unnecessarily obtaining their counts during the subsequent scan of D to determine L3.
Note that when given a candidate k-itemset, we only need to checkif its (k − 1)-subsets
are frequent since the Apriori algorithm uses a level-wise

(a) Join: C3 = L2 0 L2 = {{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}}
0{{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}}
= {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
(b) Prune using the Apriori property: All nonempty subsets of a frequent itemset must
also be frequent. Do any of the candidates have a subset that is not frequent?

The 2-item subsets of {I1, I2, I3} are {I1, I2}, {I1, I3}, and {I2, I3}. All 2-item subsets of
{I1, I2, I3} are members of L2. Therefore, keep {I1, I2, I3} in C3.
The 2-item subsets of {I1, I2, I5} are {I1, I2}, {I1, I5}, and {I2, I5}. All 2-item subsets of
{I1, I2, I5} are members of L2. Therefore, keep {I1, I2, I5} in C3.
The 2-item subsets of {I1, I3, I5} are {I1, I3}, {I1, I5}, and {I3, I5}. {I3, I5} is not a
member of L2, and so it is not frequent. Therefore, remove {I1, I3, I5} from C3.
The 2-item subsets of {I2, I3, I4} are {I2, I3}, {I2, I4}, and {I3, I4}. {I3, I4} is not a
member of L2, and so it is not frequent. Therefore, remove {I2, I3, I4} from C3.
The 2-item subsets of {I2, I3, I5} are {I2, I3}, {I2, I5}, and {I3, I5}. {I3, I5} is not a
member of L2, and so it is not frequent. Therefore, remove {I2, I3, I5} from C3.
The 2-item subsets of {I2, I4, I5} are {I2, I4}, {I2, I5}, and {I4, I5}. {I4, I5} is not a
member of L2, and so it is not frequent. Therefore, remove {I2, I4, I5} from C3.
( c ) Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after pruning.

Generation and pruning of candidate 3-itemsets, C3, from L2 using the Apriori property.
search strategy. The resulting pruned version of C3 is shown in the first table of the
bottom row of Figure 6.2
1. The transactions in D are scanned to determine L3, consisting of those
candidate 3-itemsets in C3 having minimum support (Figure 6.2).
2. The algorithm uses L3 0 L3 to generate a candidate set of 4-itemsets, C4.
Although the join results in {{I1, I2, I3, I5}}, itemset {I1, I2, I3, I5} is pruned
because its subset I2, I3, I5 is not frequent. Thus, C4 φ, and the algorithm
terminates, having found all of the frequent itemsets

Advantages
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in large databases
Disadvantages
1. It requires high computation if the itemsets are very large and the minimum
support is kept very low.
2. The entire database needs to be scanned.
PSEUDO CODE OF APRIORI ALGORITHM

IMPROVED APRIORI ALGORITHM


• Many methods are available for improving the efficiency of the algorithm.
1. Hash-Based Technique: This method uses a hash-based structure called a hash table for
generating the k-itemsets and its corresponding count. It uses a hash function for
generating the table.
2. Transaction Reduction: This method reduces the number of transactions scanning in
iterations. The transactions which do not contain frequent items are marked or
removed.
3. Partitioning: This method requires only two database scans to mine the frequent
itemsets. It says that for any itemset to be potentially frequent in the database, it should
be frequent in at least one of the partitions of the database.
4. Sampling: This method picks a random sample S from Database D and then searches for
frequent itemset in S. It may be possible to lose a global frequent itemset. This can be
reduced by lowering the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate itemsets at any
marked start point of the database during the scanning of the database
Refer class work for problems
APPLICATION OF ALGORITHM
1. In Education Field: Extracting association rules in data mining of admitted students
through characteristics and specialties.
2. In the Medical field: For example Analysis of the patient's database.
3. In Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
4. Apriori is used by many companies like Amazon in the Recommender System and by
Google for the auto-complete feature.
INCREMENTAL ARM

❖ It is noted that analysis of past transaction data can provide very valuable information on
customer buying behavior, and thus improve the quality of business decisions.

❖ With the increasing use of the record-based databases whose data is being continuously
added, updated, deleted etc.

❖ Examples of such applications include Web log records, stock market data, grocery sales
data, transactions in e-commerce, and daily weather/traffic records etc.

❖ In many applications, we would like to mine the transaction database for a fixed amount of
most recent data (say, data in the last 12 months)
❖ Mining is not a one-time operation, a naive approach to solve the incremental mining
problem is to re-run the mining algorithm on the updated database
ASSOCIATIVE CLASSIFICATION

❖ Associative classification (AC) is a branch of a wide area of scientific study known as data
mining. Associative classification makes use of association rule mining for extracting efficient
rules, which can precisely generalize the training data set, in the rule discovery process.

❖ An associative classifier (AC) is a kind of supervised learning model that uses association
rules to assign a target value. The term associative classification was coined by Bing Liu et al., in
which the authors defined a model made of rules "whose right-hand side are restricted to the
classification class attribute"
FP GROWTH ALGORITHM
• The FP-Growth Algorithm, proposed by Han in, is an efficient and scalable method for
mining the complete set of frequent patterns by pattern fragment growth, using an
extended prefix-tree structure for storing compressed and crucial information about
frequent patterns named frequent-pattern tree (FP-tree).
• This algorithm is an improvement to the Apriori method. A frequent pattern is
generated without the need for candidate generation. FP growth algorithm represents
the database in the form of a tree called a frequent pattern tree or FP tree.
• This tree structure will maintain the association between the itemsets. The database is
fragmented using one frequent item. This fragmented part is called “pattern fragment”.
The itemsets of these fragmented patterns are analyzed. Thus with this method, the
search for frequent itemsets is reduced comparatively
FP TREE

❖ Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the
database. The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP
tree represents an item of the itemset.

❖ The root node represents null while the lower nodes represent the itemsets. The association
of the nodes with the lower nodes that is the itemsets with the other itemsets are maintained
while forming the tree
FP GROWTH STEPS
1) The first step is to scan the database to find the occurrences of the itemsets in the
database. This step is the same as the first step of Apriori. The count of 1-itemsets in the
database is called support count or frequency of 1-itemset.
2) The second step is to construct the FP tree. For this, create the root of the tree. The root
is represented by null.
3) The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the itemset in it. The itemset with the max count is taken
at the top, the next itemset with lower count and so on. It means that the branch of the
tree is constructed with transaction itemsets in descending order of count.
4) The next transaction in the database is examined. The itemsets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch (for example in the 1st transaction), then this transaction branch would
share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this
transaction
5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked according
to transactions.
6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along
with the links of the lowest nodes. The lowest node represents the frequency pattern length 1.
From this, traverse the path in the FP Tree. This path or paths are called a conditional pattern
base. Conditional pattern base is a sub-database consisting of prefix paths in the FP tree
occurring with the lowest node (suffix).
7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
8) Frequent Patterns are generated from the Conditional FP Tree
1. The lowest node item I5 is not considered as it does not have a min support count,
hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore
considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1}. This forms the
conditional pattern base.
3. The conditional pattern base is considered a transaction database, an FP-tree is
constructed. This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the
min support count.
4. This path will generate all combinations of frequent patterns :
{I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree :
{I2:4, I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}. 6. For I1,
the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and
frequent patterns are generated: {I2, I1:4}
Advantages Of FP Growth Algorithm
1. This algorithm needs to scan the database only twice when compared to Apriori which
scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm
1. FP Tree is more cumbersome and difficult to build than Apriori.
2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory
MODULE 4
Classification and prediction:
What is classification and prediction? – Issues regarding Classification and prediction:
Classification methods: Decision tree, Bayesian Classification, Rule based, CART, Neural Network
Prediction methods: Linear and nonlinear regression, Logistic Regression. Introduction of tools
such as DB Miner /WEKA/DTREG DM Tools.
What is classification and prediction?

There are two forms of data analysis that can be used to extract models describing important
classes or predict future data trends. These two forms are as follows:

1. Classification
2. Prediction

We use classification and prediction to extract a model, representing the data classes to
predict future data trends. Classification predicts the categorical labels of data with the
prediction models. This analysis provides us with the best understanding of the data at a large
scale.

Classification models predict categorical class labels, and prediction models predict
continuous-valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky or a prediction model to predict the expenditures
in dollars of potential customers on computer equipment given their income and occupation.
What is Classification?

Classification is to identify the category or the class label of a new observation. First, a set
of data is used as training data. The set of input data and the corresponding outputs are
given to the algorithm. So, the training data set includes the input data and their
associated class labels. Using the training dataset, the algorithm derives a model or the
classifier. The derived model can be a decision tree, mathematical formula, or a neural
network. In classification, when unlabeled data is given to the model, it should find the
class to which it belongs. The new data provided to the model is the test data set.

Classification is the process of classifying a record. One simple example of classification is


to check whether it is raining or not. The answer can either be yes or no. So, there is a
particular number of choices. Sometimes there can be more than two classes to classify.
That is called multiclass classification.

The bank needs to analyze whether giving a loan to a particular customer is risky or
not. For example, based on observable data for multiple loan borrowers, a classification
model may be established that forecasts credit risk. The data could track job records,
homeownership or leasing, years of residency, number, type of deposits, historical credit
ranking, etc. The goal would be credit ranking, the predictors would be the other
characteristics, and the data would represent a case for each consumer. In this example,
a model is constructed to find the categorical label. The labels are risky or safe.

How does Classification Works?

The functioning of classification with the assistance of the bank loan application has been
mentioned above. There are two stages in the data classification system: classifier or
model creation and classification classifier.
1. Developing the Classifier or model creation: This level is the learning stage or the
learning process. The classification algorithms construct the classifier in this stage. A
classifier is constructed from a training set composed of the records of databases and their
corresponding class names. Each category that makes up the training set is referred to as
a category or class. We may also refer to these records as samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this level.
The test data are used here to estimate the accuracy of the classification algorithm. If the
consistency is deemed sufficient, the classification rules can be expanded to cover new
data records. It includes:
o Sentiment Analysis: Sentiment analysis is highly helpful in social media
monitoring. We can use it to extract social media insights. We can build sentiment
analysis models to read and analyze misspelled words with advanced machine
learning algorithms. The accurate trained models provide consistently accurate
outcomes and result in a fraction of the time.
o Document Classification: We can use document classification to organize the
documents into sections according to the content. Document classification refers
to text classification; we can classify the words in the entire document. And with
the help of machine learning classification algorithms, we can execute it
automatically.
o Image Classification: Image classification is used for the trained categories of an
image. These could be the caption of the image, a statistical value, a theme. You
can tag images to train your model for relevant categories by applying supervised
learning algorithms.
o Machine Learning Classification: It uses the statistically demonstrable algorithm
rules to execute analytical tasks that would take humans hundreds of more hours
to perform.
3. Data Classification Process: The data classification process can be categorized into five
steps:
o Create the goals of data classification, strategy, workflows, and architecture of data
classification.
o Classify confidential details that we store.
o Using marks by data labelling.
o To improve protection and obedience, use effects.
o Data is complex, and a continuous method is a classification.

What is Data Classification Lifecycle?


The data classification life cycle produces an excellent structure for controlling the flow of
data to an enterprise. Businesses need to account for data security and compliance at
each level. With the help of data classification, we can perform it at every stage, from
origin to deletion. The data life-cycle has the following stages, such as:

1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google
documents, social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging
based on in-house protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from
various devices and platforms.
5. Archive: Here, data is eventually archived within an industry's storage systems.
6. Publication: Through the publication of data, it can reach customers. They can then view
and download in the form of dashboards.

What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same
as in classification, the training dataset contains the inputs and corresponding numerical
output values. The algorithm derives the model or a predictor according to the training
dataset. The model should find a numerical output when the new data is given. Unlike in
classification, this method does not have a class label. The model predicts a continuous-
valued function or ordered value.

Regression is generally used for prediction. Predicting the value of a house depending on
the facts such as the number of rooms, the total area, etc., is an example for prediction.

For example, suppose the marketing manager needs to predict how much a particular
customer will spend at his company during a sale. We are bothered to forecast a numerical
value in this case. Therefore, an example of numeric prediction is the data processing
activity. In this case, a model or a predictor will be developed that forecasts a continuous
or ordered value function.

Classification and Prediction Issues


The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities, such as:

1. Data Cleaning: Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques, and the
problem of missing values is solved by replacing a missing value with the most
commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the
following methods.
o Normalization: The data is transformed using normalization.
Normalization involves scaling all values for a given attribute to make them
fall within a small specified range. Normalization is used when the neural
networks or the methods involving measurements are used in the learning
step.
o Generalization: The data can also be transformed by generalizing it to the
higher concept. For this purpose, we can use the concept hierarchies

Comparison of Classification and Prediction Methods


Here are the criteria for comparing the methods of Classification and Prediction, such as:

o Accuracy: The accuracy of the classifier can be referred to as the ability of the classifier to
predict the class label correctly, and the accuracy of the predictor can be referred to as
how well a given predictor can estimate the unknown value.
o Speed: The speed of the method depends on the computational cost of generating and
using the classifier or predictor.
o Robustness: Robustness is the ability to make correct predictions or classifications. In the
context of data mining, robustness is the ability of the classifier or predictor to make
correct predictions from incoming unknown data.
o Scalability: Scalability refers to an increase or decrease in the performance of the classifier
or predictor based on the given data.
o Interpretability: Interpretability is how readily we can understand the reasoning behind
predictions or classification made by the predictor or classifier.

Difference between Classification and Prediction


The decision tree, applied to existing data, is a classification model. We can get a class
prediction by applying it to new data for which the class is unknown. The assumption is
that the new data comes from a distribution similar to the data we used to construct our
decision tree. In many instances, this is a correct assumption, so we can use the decision
tree to build a predictive model. Classification of prediction is the process of finding a
model that describes the classes or concepts of information. The purpose is to predict the
class of objects whose class label is unknown using this model. Below are some major
differences between classification and prediction.

Classification Prediction

Classification is the process of identifying which category a Predication is the process of identifying
new observation belongs to based on a training data set the missing or unavailable numerical data
containing observations whose category membership is for a new observation.
known.

In classification, the accuracy depends on finding the class In prediction, the accuracy depends on
label correctly. how well a given predictor can guess the
value of a predicated attribute for new
data.

In classification, the model can be known as the classifier. In prediction, the model can be known as
the predictor.

A model or the classifier is constructed to find the A model or a predictor will be constructed
categorical labels. that predicts a continuous-valued
function or ordered value.

For example, the grouping of patients based on their For example, We can think of prediction
medical records can be considered a classification. as predicting the correct treatment for a
particular disease for a person.

Classification methods:
DECISION TREE

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies


• Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.

• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.

• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.

• Branch/Sub Tree: A tree formed by splitting the tree.

• Pruning: Pruning is the process of removing the unwanted branches from the tree.

• Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or Not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next decision
node (distance from the office) and one leaf node based on the corresponding labels. The
next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2


Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types
of tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Refer presentation for problems on gini index

BAYESIAN CLASSIFICATION
Data Mining Bayesian Classifiers

In numerous applications, the connection between the attribute set and the class variable
is non- deterministic. In other words, we can say the class label of a test record cant be
assumed with certainty even though its attribute set is the same as some of the training
examples. These circumstances may emerge due to the noisy data or the presence of
certain confusing factors that influence classification, but it is not included in the analysis.
For example, consider the task of predicting the occurrence of whether an individual is at
risk for liver illness based on individuals eating habits and working efficiency. Although
most people who eat healthly and exercise consistently having less probability of
occurrence of liver disease, they may still do so due to other factors. For example, due to
consumption of the high-calorie street foods and alcohol abuse. Determining whether an
individual's eating routine is healthy or the workout efficiency is sufficient is also subject
to analysis, which in turn may introduce vulnerabilities into the leaning issue.

Bayesian classification uses Bayes theorem to predict the occurrence of any event.
Bayesian classifiers are the statistical classifiers with the Bayesian probability
understandings. The theory expresses how a level of belief, expressed as a probability.

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.

Bayes's theorem is expressed mathematically by the following equation that is given


below

Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the occurrence of event X is given


that Y is true.

P(Y/X) is a conditional probability that describes the occurrence of event Y is given


that X is true.

P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This
is known as the marginal probability.

Bayesian interpretation:

In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem


connects the degree of belief in a hypothesis before and after accounting for evidence.
For example, Lets us consider an example of the coin. If we toss a coin, then we get either
heads or tails, and the percent of occurrence of either heads and tails is 50%. If the coin
is flipped numbers of times, and the outcomes are observed, the degree of belief may
rise, fall, or remain the same depending on the outcomes.
For proposition X and evidence Y,

o P(X), the prior, is the primary degree of belief in X


o P(X/Y), the posterior is the degree of belief having accounted for Y.

o The quotient represents the supports Y provides for X.

Bayes theorem can be derived from the conditional probability,

Where P (X⋂Y) is the joint probability of both X and Y being true, because

Bayesian network:

A Bayesian Network falls under the classification of Probabilistic Graphical Modelling


(PGM) procedure that is utilized to compute uncertainties by utilizing the probability
concept. Generally known as Belief Networks, Bayesian Networks are used to show
uncertainties using Directed Acyclic Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network, and like some other
statistical graph, a DAG consists of a set of nodes and links, where the links signify the
connection between the nodes.
The nodes here represent random variables, and the edges define the relationship
between these variables.

Example

• Given:
• A doctor knows that meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50,000
• Prior probability of any patient having stiff neck is 1/20
• If a patient has stiff neck, what’s the probability he/she has meningitis?

P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) = = = 0.0002
P( S ) 1 / 20
RULE BASED,
• Classify records by using a collection of “if…then…” rules
• Rule: (Condition) → y where
• Condition is a conjunctions of attributes
• y is the class label
• LHS: rule antecedent or condition
• RHS: rule consequent
• Examples of classification rules:
• (Blood Type=Warm)  (Lay Eggs=Yes) → Birds
• (Taxable Income < 50K)  (Refund=Yes) → Evade=No

example

• R1: (Give Birth = no)  (Can Fly = yes) → Birds


• R2: (Give Birth = no)  (Live in Water = yes) → Fishes
• R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
• R4: (Give Birth = no)  (Can Fly = no) → Reptiles
• R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
express a rule in the following from −
IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes


THEN buy_computer = yes

Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and
these tests are logically ANDed.
• The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction

Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from
a decision tree.
Points to remember −
To extract a rule from a decision tree −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm

Sequential Covering Algorithm can be used to extract IF-THEN rules form the training
data. We do not require to generate a decision tree first. In this algorithm, each rule for
a given class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the
general strategy the rules are learned one at a time. For each time rules are learned, a
tuple covered by the rule is removed and the process continues for the rest of the tuples.
This is because the path to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class
at a time. When learning a rule from a class Ci, we want the rule to cover all the tuples
from class C only and no tuple form any other class.
Algorithm: Sequential Covering

Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
Output: A Set of IF-THEN rules.
Method:
Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-set


end for
return Rule_Set;
Rule Pruning

The rule is pruned is due to the following reason −


• The Assessment of quality is made on the original set of training data. The
rule may perform well on training data but less well on subsequent data.
That's why the rule pruning is required.
• The rule is pruned by removing conjunct. The rule R is pruned, if pruned
version of R has greater quality than what was assessed on an independent
set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.

CART

• CART is an alternative decision tree building algorithm. It can handle both classification
and regression tasks.
• This algorithm uses a new metric named gini index to create decision points for
classification tasks. We will mention a step by step CART decision tree example by

Gini index
• Gini index is a metric for classification tasks in CART. It stores sum of squared
probabilities of
• each class. We can formulate it as illustrated below.
• Gini = 1 – Σ (Pi)2 for i=1 to number of classes.
• Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final
decisions for outlook feature.

• Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48


• Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 = 0
• Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48
• Then, we will calculate weighted sum of gini indexes for outlook feature.
• Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342

• Temperature is a nominal feature and it could have 3 different values: Cool, Hot and Mild.
• Summarize decisions for temperature feature.
• Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5
• Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375
• Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445
• We’ll calculate weighted sum of gini index for temperature feature
• Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 =
0.439

Advantages of CART algorithm


1. The CART algorithm is nonparametric, thus it does not depend on information

from a certain sort of distribution.

2. The CART algorithm combines both testings with a test data set and cross-

validation to more precisely measure the goodness of fit.

3. CART allows one to utilize the same variables many times in various regions of

the tree. This skill is capable of revealing intricate interdependencies between

groups of variables.

4. Outliers in the input variables have no meaningful effect on CART.


5. One can loosen halting restrictions to allow decision trees to overgrow and then

trim the tree down to its ideal size. This method reduces the likelihood of missing

essential structure in the data set by terminating too soon.

6. To choose the input set of variables, CART can be used in combination with other

prediction algorithms.

The CART algorithm is a subpart of Random Forest, which is one of the most powerful

algorithms of Machine learning. The CART algorithm is organized as a series of

questions, the responses to which decide the following question if any. The ultimate

outcome of these questions is a tree-like structure with terminal nodes when there are

no more questions.

This algorithm is widely used in making Decision Trees through Classification and
Regression. Decision Trees are widely used in data mining to create a model that

predicts the value of a target based on the values of many input variables (or

independent variables).
NEURAL NETWORK PREDICTION METHODS:
Linear and nonlinear regression, Logistic Regression.

REGRESSION

• Is a data mining function that predicts a number or value.


• Age, weight, distance, temperature, income, or sales attributes can be predicted using
regression techniques.
• For example, a regression model could be used to predict children’s height, given their age,
weight and other factors.
• A regression task begins with a data set in which the target values are known.
• For example, a regression model that predicts children's height could be developed based
on observed data for many children over a period of time.
• The data might track age, height, weight, developmental milestones, family history etc.
• Height would be the target, the other attributes would be the predictors, and the data for
each child would constitute a case.
• Regression models are tested by computing various statistics that measure the difference
between the predicted values and the expected values.
• The goal of regression analysis is to determine the values of parameters for a function that
cause the function to fit best in a set of data observations that we provide.
• Regression refers to a data mining technique that is used to predict the numeric
values in a given data set. For example, regression might be used to predict the
product or service cost or other variables. It is also used in various industries for
business and marketing behavior, trend analysis, and financial forecast.

Linear Regression

Linear regression is the type of regression that forms a relationship between the target
variable and one or more independent variables utilizing a straight line. The given
equation represents the equation of linear regression

Y = a + b*X + e.

Where,

a represents the intercept

b represents the slope of the regression line

e represents the error

X and Y represent the predictor and target variables, respectively.


If X is made up of more than one variable, termed as multiple linear equations.

In linear regression, the best fit line is achieved utilizing the least squared method, and it
minimizes the total sum of the squares of the deviations from each data point to the line
of regression. Here, the positive and negative deviations do not get canceled as all the
deviations are squared.

NON LINEAR REGRESSION

• Often the relationship between x and y cannot be approximated with a straight line or curve
for that nonlinear regression technique may be used.
• Alternatively, the data could be preprocessed to make the relationship linear.

LOGISTIC REGRESSION
A linear regression is not appropriate for predicting the value of a binary variable for two
reasons:

• A linear regression will predict values outside the acceptable range (e.g. predicting
probabilities outside the range 0 to 1).
• Since the experiments can only have one of two possible values for each experiment, the
residuals(random errors) will not be normally distributed about the predicted line.

A logistic regression produces a logistic curve, which is limited to values between 0 and 1.

• Logistic regression is similar to a linear regression, but the curve is constructed using the
natural logarithm “odds” of the target variable, rather than the probability.
Linear Regression
Logistic regression is basically a supervised classification algorithm. In a classification
problem, the target variable(or output), y, can take only discrete values for a given set
of features(or inputs), X.
Contrary to popular belief, logistic regression IS a regression model. The model builds
a regression model to predict the probability that a given data entry belongs to the
category numbered as “1”. Just like Linear regression assumes that the data follows a
linear function, Logistic regression models the data using the sigmoid function.

Logistic regression becomes a classification technique only when a decision threshold


is brought into the picture. The setting of the threshold value is a very important
aspect of Logistic regression and is dependent on the classification problem itself.
The decision for the value of the threshold value is majorly affected by the values
of precision and recall. Ideally, we want both precision and recall to be 1, but this
seldom is the case.
In the case of a Precision-Recall tradeoff, we use the following arguments to decide
upon the threshold:-
1. Low Precision/High Recall: In applications where we want to reduce the number of
false negatives without necessarily reducing the number of false positives, we choose
a decision value that has a low value of Precision or a high value of Recall. For
example, in a cancer diagnosis application, we do not want any affected patient to be
classified as not affected without giving much heed to if the patient is being wrongfully
diagnosed with cancer. This is because the absence of cancer can be detected by
further medical diseases but the presence of the disease cannot be detected in an
already rejected candidate.
2. High Precision/Low Recall: In applications where we want to reduce the number of
false positives without necessarily reducing the number of false negatives, we choose
a decision value that has a high value of Precision or a low value of Recall. For
example, if we are classifying customers whether they will react positively or
negatively to a personalized advertisement, we want to be absolutely sure that the
customer will react positively to the advertisement because otherwise, a negative
reaction can cause a loss of potential sales from the customer.
Based on the number of categories, Logistic regression can be classified as:

1. binomial: target variable can have only 2 possible types: “0” or “1” which
may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
2. multinomial: target variable can have 3 or more possible types which are
not ordered(i.e. types have no quantitative significance) like “disease A” vs
“disease B” vs “disease C”.
3. ordinal: it deals with target variables with ordered categories. For example,
a test score can be categorized as:“very poor”, “poor”, “good”, “very good”.
Here, each category can be given a score like 0, 1, 2, 3.

Introduction of tools such as DB Miner /WEKA/DTREG DM Tools.


Weka

• It is open source and free software.


• It is best suited for data analysis and predictive modelling.
• It contains algorithms and visualization tools that support data mining tasks
and machine learning.
• Weka has a GUI that gives easy access to all its features.
• It is written in JAVA language.
• Weka supports major data mining tasks including data mining, processing,
visualization, regression etc.
• Weka can also provide access to SQL Databases through database
connectivity and can further process the data returned by the query.
MODULE 5
Data Mining for Business Intelligence Applications: Data mining for business Applications
like Balanced Scorecard, Fraud Detection, Clickstream Mining, Market Segmentation, retail
industry, telecommunications industry, banking & finance and CRM etc., Data Analytics Life
Cycle: Introduction to Big data Business Analytics - State of the practice in analytics role of
data scientists Key roles for successful analytic project - Main phases of life cycle - Developing
core deliverables for stakeholders.

Data Mining for Business Intelligence Applications


Retail and
Financial Science and
Telecommunication
Data Analysis Engineering
Industries

Data Mining
Applications

Intrusion
Recommender
Detection
Systems
and Prevention

Data mining for business Applications like Balanced Scorecard


What Is a Balanced Scorecard (BSC)?

• The term balanced scorecard (BSC) refers to a strategic


management performance metric used to identify and improve various internal
business functions and their resulting external outcomes. Used to measure and provide
feedback to organizations, balanced scorecards are common among companies in the
United States, the United Kingdom, Japan, and Europe. Data collection is crucial to
providing quantitative results as managers and executives gather and interpret the
information. Company personnel can use this information to make better decisions for
the future of their organizations. A balanced scorecard is a performance metric used
to identify, improve, and control a business's various functions and resulting outcomes.

• The concept of BSCs was first introduced in 1992 by David Norton and Robert Kaplan,
who took previous metric performance measures and adapted them to include
nonfinancial information.
• BSCs were originally developed for for-profit companies but were later adapted for
use by nonprofits and government agencies.
• The balanced scorecard involves measuring four main aspects of a business: Learning
and growth, business processes, customers, and finance.
• BSCs allow companies to pool information in a single report, to provide information
into service and quality in addition to financial performance, and to help improve
efficiencies.
Understanding Balanced Scorecards (BSCs)
Accounting academic Dr. Robert Kaplan and business executive and theorist Dr. David
Norton first introduced the balanced scorecard. The Harvard Business Review first published
it in the 1992 article "The Balanced Scorecard—Measures That Drive Performance." Both
Kaplan and Norton worked on a year-long project involving 12 top-performing companies.
Their study took previous performance measures and adapted them to include nonfinancial
information.1

Companies can easily identify factors hindering business performance and outline strategic
changes tracked by future scorecards.

BSCs were originally meant for for-profit companies but were later adapted for nonprofit
organizations and government agencies.2 It is meant to measure the intellectual capital of a
company, such as training, skills, knowledge, and any other proprietary information that gives
it a competitive advantage in the market. The balanced scorecard model reinforces good
behavior in an organization by isolating four separate areas that need to be analyzed. These
four areas, also called legs, involve:

• Learning and growth


• Business processes
• Customers
• Finance1

The BSC is used to gather important information, such as objectives, measurements,


initiatives, and goals, that result from these four primary functions of a business. Companies
can easily identify factors that hinder business performance and outline strategic changes
tracked by future scorecards.1

The scorecard can provide information about the firm as a whole when viewing company
objectives. An organization may use the balanced scorecard model to implement strategy
mapping to see where value is added within an organization. A company may also use a BSC
to develop strategic initiatives and strategic objectives.1 This can be done by assigning tasks
and projects to different areas of the company in order to boost financial and operational
efficiencies, thus improving the company's bottom line.

Characteristics of the Balanced Scorecard Model (BSC)


Information is collected and analyzed from four aspects of a business:
1. Learning and growth are analyzed through the investigation of training and
knowledge resources. This first leg handles how well information is captured and how
effectively employees use that information to convert it to a competitive
advantage within the industry.
2. Business processes are evaluated by investigating how well products are
manufactured. Operational management is analyzed to track any gaps, delays,
bottlenecks, shortages, or waste.
3. Customer perspectives are collected to gauge customer satisfaction with the quality,
price, and availability of products or services. Customers provide feedback about their
satisfaction with current products.
4. Financial data, such as sales, expenditures, and income are used to understand
financial performance. These financial metrics may include dollar amounts, financial
ratios, budget variances, or income targets.1

These four legs encompass the vision and strategy of an organization and require active
management to analyze the data collected.

The balanced scorecard analyzes is often referred to as a management tool rather than a
measurement tool because of its application by a company's key personnel.
Benefits of a Balanced Scorecard (BSC)
There are many benefits to using a balanced scorecard. For instance, the BSC allows
businesses to pool together information and data into a single report rather than having to deal
with multiple tools. This allows management to save time, money, and resources when they
need to execute reviews to improve procedures and operations.1

Scorecards provide management with valuable insight into their firm's service and quality in
addition to its financial track record. By measuring all of these metrics, executives are able to
train employees and other stakeholders and provide them with guidance and support. This
allows them to communicate their goals and priorities in order to meet their future goals.

Another key benefit of BSCs is how it helps companies reduce their reliance on inefficiencies
in their processes. This is referred to as suboptimization. This often results in
reduced productivity or output, which can lead to higher costs, lower revenue, and a
breakdown in company brand names and their reputations.1

Examples of a Balanced Scorecard (BSC)


Corporations can use their own, internal versions of BSCs, For example, banks often contact
customers and conduct surveys to gauge how well they do in their customer service. These
surveys include rating recent banking visits, with questions ranging from wait times,
interactions with bank staff, and overall satisfaction. They may also ask customers to make
suggestions for improvement. Bank managers can use this information to help retrain staff if
there are problems with service or to identify any issues customers have with products,
procedures, and services.
In other cases, companies may use external firms to develop reports for them. For instance,
the J.D. Power survey is one of the most common examples of a balanced scorecard.1 This
firm provides data, insights, and advisory services to help companies identify problems in
their operations and make improvements for the future. J.D. Power does this through surveys
in various industries, including the financial services and automotive industries. Results are
compiled and reported back to the hiring firm.3

Balanced Scorecard (BSC) FAQs


What Is a Balanced Scorecard and How Does It Work?
A balanced scorecard is a strategic management performance metric that helps companies
identify and improve their internal operations to help their external outcomes. It measures
past performance data and provides organizations with feedback on how to make better
decisions in the future.

What Are the Four Perspectives of the Balanced Scorecard?


The four perspectives of a balanced scorecard are learning and growth, business processes,
customer perspectives, and financial data. These four areas, which are also called legs, make
up a company's vision and strategy. As such they require a firm's key personnel, whether that's
the executive and/or its management team(s), to analyze the data collected in the scorecard.

How Do You Use a Balanced Scorecard?


Balanced scorecards allow companies to measure their intellectual capital along with their
financial data to break down successes and failures in their internal processes. By compiling
data from past performance in a single report, management can identify inefficiencies, devise
plans for improvement, and communicate goals and priorities to their employees and other
stakeholders.

What Are the Balanced Scorecard Benefits?


There are many benefits to using a scorecard. The most important advantages include the
ability to bring information into a single report, which can save time, money, and resources.
It also allows companies to track their performance in service and quality in addition to
tracking their financial data. Scorecards also allow companies to recognize and reduce
inefficiencies.

What Is a Balanced Scorecard Example?


Corporations may use internal methods to develop scorecards. For instance, they may conduct
customer service surveys to identify the successes and failures of their products and services
or they may hire external firms to do the work for them. J.D. Power is an example of one such
firm that is hired by companies to conduct research on their behalf.

The Bottom Line


Companies have a number of options available to help identify and resolve issues with their
internal processes so they can improve their financial success. Balanced scorecards allow
companies to collect and study data from four key areas, including learning and growth,
business processes, customers, and finance. By pooling together information in just one
report. companies can save time, money, and resources to better train staff, communicate with
stakeholders, and improve their financial position in the market.

What is Fraud Detection and Prevention?


Fraudulent activities can encompass a wide range of cases, including money laundering,
cybersecurity threats, tax evasion, fraudulent insurance claims, forged bank checks, identity
theft, and terrorist financing, and is prevalent throughout the financial institutions,
government, healthcare, public sector, and insurance sectors.
To combat this growing list of opportunities for fraudulent transactions, organizations are
implementing modern fraud detection and prevention technologies and risk management
strategies, which combine big data sources with real-time monitoring, and apply adaptive and
predictive analytics techniques, such as Machine Learning, to create a risk of fraud score.
Detecting fraud with data analytics, fraud detection software and tools, and a fraud detection
and prevention program enables organizations to predict conventional fraud tactics, cross-
reference data through automation, manually and continually monitor transactions and crimes
in real time, and decipher new and sophisticated schemes.
Fraud detection and prevention software is available in both proprietary and open source
versions. Common features in fraud analytics software include: a dashboard, data import and
export, data visualization, customer relationship management integration, calendar
management, budgeting, scheduling, multi-user capabilities, password and access
management, Application Programming Interfaces (API), two-factor authentication, billing,
and customer database management.
Fraud Detection and Prevention Techniques
Fraud data analytics methodologies can be categorized as either statistical data analysis
techniques or artificial intelligence (AI).
Statistical data analysis techniques include:
•Statistical parameter calculation, such as averages, quantiles, and performance metrics
•Regression analysis - estimates relationships between independent variables and a dependent
variable
•Probability distributions and models
•Data matching - used to compare two sets of collected data, remove duplicate records, and
identify links between sets
•Time-series analysis
AI techniques include:
•Data mining - data mining for fraud detection and prevention classifies and segments data
groups in which millions of transactions can be performed to find patterns and detect fraud
•Neural networks - suspicious patterns are learned and used to detect further repeats
•Machine Learning - fraud analytics Machine Learning automatically identifies characteristics
found in fraud
•Pattern recognition - detects patterns or clusters of suspicious behavior
The four most crucial steps in the fraud prevention and detection process include:
•Capture and unify all manner of data types from every channel and incorporate them into the
analytical process.
•Continually monitor all transactions and employ behavioral analytics to facilitate real-time
decisions.
•Incorporate analytics culture into every facet of the enterprise through data visualization.
•Employ layered security techniques.
CLICK STREAM MINING
Click stream mining is a record of a user's activity on the internet, including every web site
and every page of every web site that the users visits, how long the user was on a page or site,
in what order the pages were visited, any newsgroups that the user participates in and even the
email-addresses of mail that the users send and receive.
• Both ISPs and individual web sites are capable of tracking a user's clickstream. Clickstream
data is becoming increasingly valuable to internet marketers and advertisers. Be aware of the
big amount of data a clickstream generates.
• These 'footprints' visitors leave at a site grown wildly - large businesses may gather a terabyte
of it every day. But the ability to analyses such data hasn't kept pace with the ability to capture
it.
• The next frontier of web data analysis is better integration of clickstream data with other
customer information such as purchase history and even demographic profiles, to form what's
often called a "360-degree view" of a site visitor.
• Clickstream analysis can be seen as a four-stage process of collection, storage, analyis and
reporting. The first two concentrate on gathering and formatting information, and the latter
two on making sense of it.
• There are two levels of clickstream data analysis: Web traffic analysis, movement related,
and commerce-based analysis, which looks at e-business-related activities.
 Web traffic analysis
o Web traffic analysis operates at the web server level and concentrate on how visitors navigate
through the site.
o It measures the number of pages delivered to the customer as opposed to pages sent by the
server. It determines how often visitors hit the browser stop button, how much of the page was
delivered until they hit the button and how long they waited before they hit it.
o Performance parameters are also logged, such as length of time it took for loading a page ad
determining how much data was transmitted.
o Commerce or e-business analysis can use higher level information out of clickstreams, such
as tracking visitors' responses to pages and their content.
o One of the main reason for measuring clickstream data at this level is to analyze the
effectiveness of the web as a channel to market.
o Measuring the success of commerce activities is much more difficult than evaluating web
traffic, because it looks at why visitors behave in a particular way, not just where they went.
So for high-level clickstream analysis it is possible even to see the reactions of the customers.
What items do people buy and which they take out of their shopping basket.
o This provides business-level information about how visitors interact with the site which can
be helpful to aid further site development.
o With clickstream data more values can be gained by combining them with information from
other sources like direct marketing or sales.
o A direct mail campaign may be used to encourage customers to visit the web site. Now the
effectiveness of the mail campaign can be measured by collecting clickstream data from users
who have been sent mail shots and those who have not.
 E-business feedback
o The e-business analysis cycle is more sophisticated. This process combines web site activity
with data from other sources, such as visitor profile information, sales databases and
campaigns that include links to the web site.
o It provides higher-level information, more focused answers ad information that can be used
to enhance e-commerce activities across the business as well as improving the web site.
o The e-business cycle is a continual process, involving the integration of web and other data
with web-site activity data analysis, followed by improvements.
o The integration of e-business and enterprise data with web traffic and other type of data
allows discoveries and insights that cannot be gained by observing web activity alone, and
increases the potential for qualitative analysis.
o Attempting clickstream analysis it is important to differ between the two techniques tools on
the market which are used. Some analysis tools just report actions on web sites, while
straightforward reporting tools will only log actions.
o A second consideration is whether the analysis tool supports real-time data feeds or uses a
batch processing model. Batch processing can only ever analyses historical data, and this
lengthens the time between customer actions and a firm's reaction.
o Take into this consideration that real-time data feeds are more in tune with the move towards
dynamic web pages, customer profiling and the use of personalization engines.
o Real-time data feeds do not restrict the company generating weekly or monthly reports, but
can support real-time reporting, which can speed up the decision making process when tuning
the web site. However there are performance and bandwidth issues associated with real time
reporting.
o Clickstream analysis automates much of the analysis process, but even with the best tools,
some human intervention and analysis will be necessary, especially if the clickstream data is
used in conjunction with other data sources.
o For example, if site visits peak at a certain time on a particular day, the tool can readily
recognize the spike but will not necessarily discern the reason, which may be that a special
marketing campaign ran just beforehand.

MARKET SEGMENTATION
Before businesses and companies release their products and services, they first decide on
whom to cater those. Different products and services, even if from the same company, can be
marketed towards different group of people.

That is why companies and organizations use different methods to know where they can best
market their product. One of the most used methods is market research. This is where
businesses gather, analyze and interpret all the information collected. After that, companies
now need to group their customers for an easier marketing. That method is called market
segmentation.

What is Market Segmentation?

According to Statistical Concepts, Market Segmentation is the “process of partitioning markets


into groups of customers and prospects with similar needs and/or characteristics who are likely
to exhibit similar purchase behavior.”

Market segmentation is an important task in marketing. It permits marketing staff to know


what marketing method they can use to a particular group in the market. They can mix and
match different combinations of product’s price, promotion, and place. They also use market
segmentation to know which customer can maintain its loyalty or the customers that will likely
be more willing to buy their products.

Market segmentation is not only used for business-customers relationship. They are also used
for business-business relationship. The same statistical method is used for segmentation but
the characteristics like the demographics and categories will be different.

Data used for market segmentation is usually from an account or CRM (Customer Relationship
Management) database. Sometimes they also use external databases like researches and
surveys. The ones from databases are usually the product of data mining.

What Is Data Mining?

In a simpler term, data mining is where systems study large databases to have new information
that can be used for businesses. These new information are used to forecast and calculate new
trends.

A lot of benefits can be derived from using data mining. One obvious benefit is that it will be
easier to discover unseen relationships and patterns in databases. It can be used for making
predictions about future trends and for marketing teams to devise their tactics to fit in with it.
Also, it will set apart a company from its competitor because of the data they know.

How Can Data Mining Improve Market Segmentation?

For marketing purposes, data mining is such a huge help. Using the database of Customer
Relationship Management (CRM), the demographics — age, sex, religion, income, occupation
and education, geographic, psychographic, and behavioral information of the customers will
be helpful in segmenting them. The segmentation process will be faster and easier. Also, with
the new data and information that comes with data mining, it can also help with the market
segmentation.

As for its business purpose, knowing your target market and their needs and wants is a lot
cheaper than releasing different products and services that will cater to different customers. It
will help businesses and companies use the full potential of their resources while still making
sales rather than trial and error way.

Moreover, it will be easier for marketing teams to sell the services and products because the
market has one desire and wishes from companies.
RETAIL INDUSTRY

The retail industry is a major application area for data mining because it collects huge amounts
of records on sales, users shopping history, goods transportation, consumption, and service.
The quantity of data collected continues to expand promptly, especially because of the
increasing ease, accessibility, and popularity of business conducted on the internet, or e-
commerce.
Today, multiple stores also have websites where users can create purchases online. Some
businesses, including Amazon.com (www.amazon.com), exist solely online, without any
brick-and-mortar (i.e., physical) store areas. Retail data support a rich source for data mining.
Retail data mining can help identify user buying behaviors, find user shopping patterns and
trends, enhance the quality of user service, achieve better user retention and satisfaction,
increase goods consumption ratios, design more effective goods transportation and
distribution policies, and decrease the cost of business.
There are a few examples of data mining in the retail industry are as follows −
Design and construction of data warehouses based on the benefits of data mining −
Because retail data cover a broad spectrum (such as sales, customers, employees,goods
transportation, consumption, and services), there can be several methods to design a data
warehouse for this market.
The levels of detail to contain can also vary substantially. The results of preliminary data
mining exercises can be used to support guide the design and development of data warehouse
architecture. This contains deciding which dimensions and levels to contain and what pre-
processing to implement in order to facilitate effective data mining.
Multidimensional analysis of sales, customers, products, time, and region − The retail
market needed timely data regarding customer requirements, product sales, trends, and
fashions, and the quality, cost, profit, and service of commodities. It is essential to provide
dynamic multidimensional analysis and visualization tools, such as the construction of
sophisticated data cubes according to the requirement of data analysis.
Analysis of the effectiveness of sales campaigns − The retail market conducts sales
campaigns using advertisements, coupons, and several types of discounts and bonuses to
promote products and attract users. Careful analysis of the efficiency of sales campaigns can
support improve company profits.
Multidimensional analysis can be used for these goals by comparing the number of sales and
the multiple transactions including the sales items during the sales period versus those
including the same items before or after the sales campaign.
TELECOMMUNICATIONS INDUSTRY
Expanding and growing at a fast pace, especially with the advent of the internet. Data mining
can enable key industry players to improve their service quality to stay ahead in the game.
Pattern analysis of spatiotemporal databases can play a huge role in mobile telecommunication,
mobile computing, and also web and information services. And techniques like outlier analysis
can detect fraudulent users. Also, OLAP and visualization tools can help compare information,
such as user group behaviour, profit, data traffic, system overloads, etc.

Data Mining for Retail and Telecommunication Industries


The retail industry is a well-fit application area for data mining, since it collects huge
amounts of data on sales, customer shopping history, goods transportation, consumption, and
service. The quantity of data collected continues to expand rapidly, especially due to the
increasing availability, ease, and popularity of business conducted on the Web, or e-commerce.
Today, most major chain stores also have web sites where customers can make purchases
online. Some businesses, such as Amazon.com (www.amazon.com), exist solely online, without
any brick-and-mortar (i.e., physical) store locations. Retail data provide a rich source for data
mining.
Retail data mining can help identify customer buying behaviors, discover customer shopping
patterns and trends, improve the quality of customer service, achieve better customer retention
and satisfaction, enhance goods consumption ratios, design more effective goods
transportation and distribution policies, and reduce the cost of business.
A few examples of data mining in the retail industry are outlined as follows:

Design and construction of data warehouses: Because retail data cover a wide spec trum
(including sales, customers, employees, goods transportation, consumption,
and services), there can be many ways to design a data warehouse for this industry. The
levels of detail to include can vary substantially. The outcome of preliminary data
mining exercises can be used to help guide the design and development of data warehouse
structures. This involves deciding which dimensions and levels to include and what
preprocessing to perform to facilitate effective data mining.

Multidimensional analysis of sales, customers, products, time, and region: The


retail industry requires timely information regarding customer needs, product sales, trends,
and fashions, as well as the quality, cost, profit, and service of commodities. It is
therefore important to provide powerful multidimensional analysis and visualization
tools, including the construction of sophisticated data cubes according to the needs of
data analysis. The advanced data cube structures introduced in Chapter 5 are useful in
retail data analysis because they facilitate analysis on multidimensional aggregates with
complex conditions.
Analysis of the effectiveness of sales campaigns: The retail industry conducts sales
campaigns using advertisements, coupons, and various kinds of discounts and bonuses
to promote products and attract customers. Careful analysis of the effec tiveness of
sales campaigns can help improve company profits. Multidimensional analysis can be
used for this purpose by comparing the amount of sales and the number of transactions
containing the sales items during the sales period versus those containing the same items
before or after the sales campaign. Moreover, association analysis may disclose which
items are likely to be purchased together with the items on sale, especially in comparison
with the sales before or after the campaign.
Customer retention—analysis of customer loyalty: We can use customer loyalty
card information to register sequences of purchases of particular customers. Customer
loyalty and purchase trends can be analyzed systematically. Goods purchased at different
periods by the same customers can be grouped into sequences. Sequential pattern mining
can then be used to investigate changes in customer consumption or loyalty and suggest
adjustments on the pricing and variety of goods to help retain customers and attract
new ones.
Product recommendation and cross-referencing of items: By mining associations
from sales records, we may discover that a customer who buys a digital camera is likely
to buy another set of items. Such information can be used to form product
recommendations. Collaborative recommender systems (Section 13.3.5) use data mining
techniques to make personalized product recommendations during live customer
transactions, based on the opinions of other customers. Product recommendations can
also be advertised on sales receipts, in weekly flyers, or on the Web to help improve
customer service, aid customers in selecting items, and increase sales. Similarly,
information, such as “hot items this week” or attractive deals, can be displayed together
with the associative information to promote sales.
Fraudulent analysis and the identification of unusual patterns: Fraudulent activity
costs the retail industry millions of dollars per year. It is important to (1) identify
potentially fraudulent users and their atypical usage patterns; (2) detect attempts to
gain fraudulent entry or unauthorized access to individual and organizational

accounts; and (3) discover unusual patterns that may need special attention. Many of these
patterns can be discovered by multidimensional analysis, cluster analysis, and outlier analysis.

As another industry that handles huge amounts of data, the telecommunication industry
has quickly evolved from offering local and long-distance telephone services to providing
many other comprehensive communication services. These include cellular phone, smart
phone, Internet access, email, text messages, images, computer and web data transmissions, and
other data traffic. The integration of telecommunication, com puter network, Internet, and
numerous other means of communication and computing has been under way, changing the face
of telecommunications and computing. This has created a great demand for data mining to help
understand business dynamics, identify telecommunication patterns, catch fraudulent activities,
make better use of resources, and improve service quality.
Data mining tasks in telecommunications share many similarities with those in the retail
industry. Common tasks include constructing large-scale data warehouses, performing
multidimensional visualization, OLAP, and in-depth analysis of trends, customer patterns,
and sequential patterns. Such tasks contribute to business improvements, cost reduction,
customer retention, fraud analysis, and sharpening the edges of competition. There are many
data mining tasks for which customized data mining tools for telecommunication have been
flourishing and are expected to play increasingly important roles in business.
Data mining has been popularly used in many other industries, such as insurance,
manufacturing, and health care, as well as for the analysis of governmental and institutional
administration data. Although each industry has its own characteristic data sets and
application demands, they share many common principles and methodologies. Therefore,
through effective mining in one industry, we may gain experience and methodologies that can
be transferred to other industrial applications.
banking & finance and CRM

Financial Analysis
The banking and finance industry relies on high-quality, reliable data. In loan markets,
financial and user data can be used for a variety of purposes, like predicting loan payments and
determining credit ratings. And data mining methods make such tasks more manageable.
Classification techniques facilitate the separation of crucial factors that influence customers’
banking decisions from the irrelevant ones. Further, multidimensional clustering techniques
allow the identification of customers with similar loan payment behaviours. Data analysis and
mining can also help detect money laundering and other financial crimes

Role of Data Mining in CRM


Although it’s still a new technology, businesses from many industries have invested in it to
make the most of their data. Data mining techniques in CRM assist your business in finding
and selecting relevant information. This can then be used to get a clear view of the customer
life-cycle. The life-cycle includes customer identification, attraction, retention, and
development. The more data in the database, the more accurate the models created will be and
hence more value gained.
Data mining usually involves the use of predictive modeling, forecasting, and descriptive
modeling techniques as its key elements. CRM in the age of data analytics enables an
organization to engage in many useful activities. You can manage customer retention, choose
the right segments, set optimal pricing policies, and rank suppliers to your needs.
Applications of Data Mining in CRM
Basket Analysis
Find out which items customers tend to purchase together. This knowledge can improve
stocking, store layout strategies, and promotions.
Sales Forecasting
Examining time-based patterns helps businesses make re-stocking decisions. Furthermore, it
helps you in supply chain management, financial management and gives complete control over
internal operations.
Database Marketing
Retailers can design profiles of customers based on demographics, tastes, preferences, and
buying behavior. It will also aid the marketing team in designing the right marketing
campaigns and promotional offers. This will result in enhanced productivity, optimal
allocation of resources, and desirable ROI.
Predictive Life-Cycle Management
Data mining helps an organization predict each customer’s lifetime value and service each
segment properly.
Market Segmentation
Learn which customers are interested in purchasing your products. Design your marketing
campaigns and promotions keeping their tastes and preferences in mind. This will increase
efficiency and result in the desired ROI since you won’t be targeting customers who are not
interested in your product.
Product Customization
Manufacturers can customize products according to the exact needs of customers. To do this,
they must be able to predict which features should be bundled to meet customer demand.
Fraud Detection
By analyzing past transactions that turned out to be fraudulent, you can take precautions to
stop that from happening again. Banks and other financial institutions will benefit from this
feature immensely, by reducing the number of bad debts.
Warranties
Manufacturers need to predict the number of customers who will make warranty claims and
the average cost of those claims. This will ensure the best management of company funds.
Data Analytics Life Cycle:

Phase 1: Discovery
• –The data science team learn and investigate the problem.
• Develop context and understanding.
• Come to know about data sources needed and available for the project.
• The team formulates initial hypothesis that can be later tested with data.
Phase 2: Data Preparation
• –Steps to explore, preprocess, and condition data prior to modeling and analysis.
• It requires the presence of an analytic sandbox, the team execute, load, and transform,
to get data into the sandbox.
• Data preparation tasks are likely to be performed multiple times and not in predefined
order.
• Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine,
etc.
Phase 3: Model Planning
• –Team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
• In this phase, data science team develop data sets for training, testing, and production
purposes.
• Team builds and executes models based on the work done in the model planning phase.
• Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 4: Model Building
• –Team develops datasets for testing, training, and production purposes.
• Team also considers whether its existing tools will suffice for running the models or if
they need more robust environment for executing models.
• Free or open-source tools – Rand PL/R, Octave, WEKA.
• Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results
• –After executing model team need to compare outcomes of modeling to criteria
established for success and failure.
• Team considers how best to articulate findings and outcomes to various team members
and stakeholders, taking into account warning, assumptions.
• Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
Operationalize
• –The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.
• This approach enables team to learn about performance and related constraints of the
model in production environment on small scale , and make adjustments before full
deployment.
• The team delivers final reports, briefings, codes.
• Free or open source tools – Octave, WEKA, SQL, MADlib.
Introduction to Big data Business Analytics
• is the use of advanced analytic techniques against very large, diverse data sets that
include structured, semi-structured and unstructured data, from different sources, and
in different sizes from terabytes to zettabytes.
Big data analytics helps businesses to get insights from today's huge data resources. People,
organizations, and machines now produce massive amounts of data. Social media, cloud
applications, and machine sensor data are just some examples

• Five Key Types of Big Data Analytics Every Business Analyst Should Know
• Prescriptive Analytics.
• Diagnostic Analytics.
• Descriptive Analytics.
• Predictive Analytics.
• Cyber Analytics.
BIG DATA DATA ANALYTICS
1. Big data refers to the large volume of data Data Analytics refers to the process of
and also the data is increasing with a rapid analyzing the raw data and finding out
speed with respect to time. conclusions about that information.
2Big data includes Structured, Unstructured Descriptive, Diagnostic, Predictive,
and Semi-structured the three types of data. Prescriptive are the four basic types of
data analytics.
The purpose of big data is to store huge The purpose of data analytics is to
volume of data and to process it. analyze the raw data and find out insights
for the information.
Parallel computing and other complex Predictive and statistical modelling with
automation tools are used to handle big data. relatively simple tools are used to handle
data analytics.
Big data operations are handled by big data Data analytics is performed by skilled
professionals. data analysts.
Big data analysts need the knowledge of Data Analysts need the knowledge of
programming, NoSQL databases, distributed programming, statistics, and
systems and frameworks. mathematics.
Big data is mainly found in financial Data analytics is mainly used in business
services, Media and Entertainment, for risk detection and management,
communication, Banking, information science, travelling, health care, Gaming,
technology, and retail etc. energy management, and information
technology.
It supports in dealing with huge volume of It supports in examining raw data and
data. recognizing useful information.
It is considered as the first step as first big It is considered as second step as it
data generated and then stored. performs analysis on the large data sets.
Some of the big data tools are Apache Some of the data analytics tools are
Hadoop, Cloudera Distribution for Hadoop, Tableau Public, Python, Apache Spark,
Cassandra, MongoDB etc. Excel, RapidMiner, KNIME etc.

State of the practice in analytics role of data scientists Key roles for
successful analytic project
Roles & Responsibilities of a Data Scientist
• Management: The Data Scientist plays an insignificant managerial role where he
supports the construction of the base of futuristic and technical abilities within the Data
and Analytics field in order to assist various planned and continuing data analytics
projects.
• Analytics: The Data Scientist represents a scientific role where he plans, implements,
and assesses high-level statistical models and strategies for application in the business’s
most complex issues. The Data Scientist develops econometric and statistical models
for various problems including projections, classification, clustering, pattern analysis,
sampling, simulations, and so forth.
• Strategy/Design: The Data Scientist performs a vital role in the advancement of
innovative strategies to understand the business’s consumer trends and management as
well as ways to solve difficult business problems, for instance, the optimization of
product fulfillment and entire profit.
• Collaboration: The role of the Data Scientist is not a solitary role and in this position,
he collaborates with superior data scientists to communicate obstacles and findings to
relevant stakeholders in an effort to enhance drive business performance and decision-
making.
• Knowledge: The Data Scientist also takes leadership to explore different technologies
and tools with the vision of creating innovative data-driven insights for the business at
the most agile pace feasible. In this situation, the Data Scientist also uses initiative in
assessing and utilizing new and enhanced data science methods for the business, which
he delivers to senior management of approval.
• Other Duties: A Data Scientist also performs related tasks and tasks as assigned by the
Senior Data Scientist, Head of Data Science, Chief Data Officer, or the Employer.
Data Scientist Data Analyst Data Engineer

The focus will be on the futuristic The main focus of a data analyst is Data Engineers focus on
display of data. on optimization of scenarios, for optimization techniques and the
example how an employee can construction of data in a
enhance the company’s product conventional manner. The
growth. purpose of a data engineer is
continuously advancing data
consumption

Data scientists present both Data formation and cleaning of raw Data formation and cleaning of
supervised and unsupervised data, interpreting and visualization raw data, interpreting and
learning of data, say regression and of data to perform the analysis and visualization of data to perform
classification of data, Neural to perform the technical summary the analysis and to perform the
networks, etc. of data technical summary of data

Skills required for Data Scientist are Skills required for Data Analyst are Skills required for Data Engineer
Python, R, SQL, Pig, SAS, Apache Python, R, SQL, SAS. are MapReduce, Hive, Pig
Hadoop, Java, Perl, Spark. Hadoop, techniques

Key Roles for a Data analytics project :


1. Business User :
• The business user is the one who understands the main area of the project and is
also basically benefited from the results.
• This user gives advice and consult the team working on the project about the value
of the results obtained and how the operations on the outputs are done.
• The business manager, line manager, or deep subject matter expert in the project
mains fulfills this role.
2. Project Sponsor :
• The Project Sponsor is the one who is responsible to initiate the project. Project
Sponsor provides the actual requirements for the project and presents the basic
business issue.
• He generally provides the funds and measures the degree of value from the final
output of the team working on the project.
• This person introduce the prime concern and brooms the desired output.
3. Project Manager :
• This person ensures that key milestone and purpose of the project is met on time
and of the expected quality.
4. Business Intelligence Analyst :
• Business Intelligence Analyst provides business domain perfection based on a
detailed and deep understanding of the data, key performance indicators (KPIs),
key matrix, and business intelligence from a reporting point of view.
• This person generally creates fascia and reports and knows about the data feeds
and sources.
5. Database Administrator (DBA) :
• DBA facilitates and arrange the database environment to support the analytics
need of the team working on a project.
• His responsibilities may include providing permission to key databases or tables
and making sure that the appropriate security stages are in their correct places
related to the data repositories or not.
6. Data Engineer :
• Data engineer grasps deep technical skills to assist with tuning SQL queries for
data management and data extraction and provides support for data intake into
the analytic sandbox.
• The data engineer works jointly with the data scientist to help build data in correct
ways for analysis.
7. Data Scientist :
• Data scientist facilitates with the subject matter expertise for analytical
techniques, data modelling, and applying correct analytical techniques for a
given business issues.
• He ensures overall analytical objectives are met.
• Data scientists outline and apply analytical methods and proceed towards the data
available for the concerned project.
Main phases of life cycle of big data
In this article, we will discuss the life cycle phases of Big Data Analytics. It differs from
traditional data analysis, mainly due to the fact that in big data, volume, variety, and
velocity form the basis of data.
The Big Data Analytics Life cycle is divided into nine phases, named as :
1. Business Case/Problem Definition
2. Data Identification
3. Data Acquisition and filtration
4. Data Extraction
5. Data Munging(Validation and Cleaning)
6. Data Aggregation & Representation(Storage)
7. Exploratory Data Analysis
8. Data Visualization(Preparation for Modeling and Assessment)
9. Utilization of analysis results.
Let us discuss each phase :
Phase I Business Problem Definition –
In this stage, the team learns about the business domain, which presents the motivation and
goals for carrying out the analysis. In this stage, the problem is identified, and assumptions
are made that how much potential gain a company will make after carrying out the
analysis. Important activities in this step include framing the business problem as an
analytics challenge that can be addressed in subsequent phases. It helps the decision-
makers understand the business resources that will be required to be utilized thereby
determining the underlying budget required to carry out the project.
Moreover, it can be determined, whether the problem identified, is a Big Data problem or
not, based on the business requirements in the business case. To qualify as a big data
problem, the business case should be directly related to one(or more) of the characteristics
of volume, velocity, or variety.

Phase II Data Definition –


Once the business case is identified, now it’s time to find the appropriate datasets to work
with. In this stage, analysis is done to see what other companies have done for a similar
case.
Depending on the business case and the scope of analysis of the project being addressed,
the sources of datasets can be either external or internal to the company. In the case of
internal datasets, the datasets can include data collected from internal sources, such as
feedback forms, from existing software, On the other hand, for external datasets, the list
includes datasets from third-party providers.

Phase III Data Acquisition and filtration –


Once the source of data is identified, now it is time to gather the data from such sources.
This kind of data is mostly unstructured.Then it is subjected to filtration, such as removal
of the corrupt data or irrelevant data, which is of no scope to the analysis objective. Here
corrupt data means data that may have missing records, or the ones, which include
incompatible data types.
After filtration, a copy of the filtered data is stored and compressed, as it can be of use in
the future, for some other analysis.

Phase IV Data Extraction –


Now the data is filtered, but there might be a possibility that some of the entries of the data
might be incompatible, to rectify this issue, a separate phase is created, known as the data
extraction phase. In this phase, the data, which don’t match with the underlying scope of
the analysis, are extracted and transformed in such a form.

Phase V Data Munging –


As mentioned in phase III, the data is collected from various sources, which results in the
data being unstructured. There might be a possibility, that the data might have constraints,
that are unsuitable, which can lead to false results. Hence there is a need to clean and
validate the data.
It includes removing any invalid data and establishing complex validation rules. There are
many ways to validate and clean the data. For example, a dataset might contain few rows,
with null entries. If a similar dataset is present, then those entries are copied from that
dataset, else those rows are dropped.

Phase VI Data Aggregation & Representation –


The data is cleansed and validates, against certain rules set by the enterprise. But the data
might be spread across multiple datasets, and it is not advisable to work with multiple
datasets. Hence, the datasets are joined together. For example: If there are two datasets,
namely that of a Student Academic section and Student Personal Details section, then both
can be joined together via common fields, i.e. roll number.
This phase calls for intensive operation since the amount of data can be very large.
Automation can be brought into consideration, so that these things are executed, without
any human intervention.

Phase VII Exploratory Data Analysis –


Here comes the actual step, the analysis task. Depending on the nature of the big data
problem, analysis is carried out. Data analysis can be classified as Confirmatory analysis
and Exploratory analysis. In confirmatory analysis, the cause of a phenomenon is analyzed
before. The assumption is called the hypothesis. The data is analyzed to approve or
disapprove the hypothesis.
This kind of analysis provides definitive answers to some specific questions and confirms
whether an assumption was true or not.In an exploratory analysis, the data is explored to
obtain information, why a phenomenon occurred. This type of analysis answers “why” a
phenomenon occurred. This kind of analysis doesn’t provide definitive, meanwhile, it
provides discovery of patterns.
Phase VIII Data Visualization –
Now we have the answer to some questions, using the information from the data in the
datasets. But these answers are still in a form that can’t be presented to business users. A
sort of representation is required to obtains value or some conclusion from the analysis.
Hence, various tools are used to visualize the data in graphic form, which can easily be
interpreted by business users.
Visualization is said to influence the interpretation of the results. Moreover, it allows the
users to discover answers to questions that are yet to be formulated.

Phase IX Utilization of analysis results –


The analysis is done, the results are visualized, now it’s time for the business users to make
decisions to utilize the results. The results can be used for optimization, to refine the
business process. It can also be used as an input for the systems to enhance performance.

The block diagram of the life cycle is given below :

DEVELOPING CORE DELIVERABLES FOR STAKEHOLDERS.


• Machine learning implementation − This could be a classification algorithm, a
regression model or a segmentation model.
• Recommender system − The objective is to develop a system that recommends
choices based on user behavior. Netflix is the characteristic example of this data
product, where based on the ratings of users, other movies are recommended.
• Dashboard − Business normally needs tools to visualize aggregated data. A dashboard
is a graphical mechanism to make this data accessible.
• Ad-Hoc analysis − Normally business areas have questions, hypotheses or myths that
can be answered doing ad-hoc analysis with data.

You might also like