Data Mining
Data Mining
The amount of data stored in databases of industry is growing exponentially. Raw data
does not provide much useful information.so need to convert it into some meaningful
information for business and their customers.
General methods of analysis/reporting can be classified into two categories:
1. Non-parametric analysis 2. Parametric analysis
1. Non-Parametric analysis
It includes information that has not or cannot be rigorously processed or analysed. Mostly used
by managers as they usually requires no special technical know-how to interpret. Mostly
financial data falls into this category
1. Parametric analysis
It includes very detailed information about the behavior of the product based on the
process utilized to gather the data. Mostly engineers uses this analysis.
Business Intelligence in today’s perspective
“Set of methodologies, processes, architectures, and technologies that transform raw data
into meaningful and useful information that allows business users to make informed
business decisions with real-time data that can put a company ahead of its competitors”.
Raw data to valuable information
Lifecycle of Data
Business Understanding: Understanding every aspect of the topic and work accordingly. It is
the most important step in life cycle as everything depends on this stage that how our whole
cycle will work according to the knowledge gained by going through various data sets and
cases.
Data Selection: Choosing for the best data set from where we can extract data which can be
more beneficial. The basic function of this phase of data cycle is to choose data will make our
system more efficient and can fulfill every case accordingly.
Data Preparation: This process includes preparing of extracted data to be used in further
process. It is not necessary that the data selected in above phase is already in ready to use stage.
Some data needed to be processed before using it. So in data preparation phase we transform
data according to our comp ability.
Modeling: It is process of remodeling the given data according to the requirement of user.
After proper understanding and cleaning of the data suitable model is selected. Selecting a
model is totally depended on data type which is been extracted.
Evaluation: It includes going through every aspects of the process as to check for possible
fault or data leakage in the process. It is one of the most necessary process in Data mining.
Fault analysis is the basic function of this phase. Every condition are check whether they fulfils
the required condition or not. If not the process is recycle and again the processing starts.
Deployment: After going through evaluation once everything is check and the data is ready to
be deployed and can be used in further processes.
BI has a direct impact on organization’s strategic, tactical and operational business decisions. BI
supports fact-based decision making using historical data rather than assumptions and gut feeling.
BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs, and charts
to provide users with detailed intelligence about the nature of the business.
Why BI is important?
With BI systems organizations can identify market trends and spot business problems that need to
be addressed.
BI helps on data visualization that enhances the data quality and thereby the quality of decision
making.
BI systems can be used not just by enterprises but SME (Small and Medium Enterprises)
KEY FEATURES
Subject-oriented: A data warehouse is organized around major subjects such as cus- tomer,
supplier, product, and sales. Rather than concentrating on the day-to-day operations and
transaction processing of an organization, a data warehouse focuses on the modeling and
analysis of data for decision makers. Hence, data warehouses typically provide a simple
and concise view of particular subject issues by excluding data that are not useful in the
decision support process.
Integrated: A data warehouse is usually constructed by integrating multiple hetero-
geneous sources, such as relational databases, flat files, and online transaction records.
Data cleaning and data integration techniques are applied to ensure con- sistency in
naming conventions, encoding structures, attribute measures, and so on.
Time-variant: Data are stored to provide information from an historic perspective (e.g., the
past 5–10 years). Every key structure in the data warehouse contains, either implicitly or
explicitly, a time element.
Nonvolatile: A data warehouse is always a physically separate store of data trans- formed
from the application data found in the operational environment. Due to this separation, a data
warehouse does not require transaction processing, recovery, and concurrency control
mechanisms. It usually requires only two operations in data
accessing: initial loading of data and access of data.
The building Blocks: Defining Features - Data warehouses and data 1marts
Types of Data warehouses
• Enterprise Data Warehouse
• Operational Data Store
• Data Mart
Types of Data warehouses
Enterprise Data Warehouse:
• Enterprise Data Warehouse is a centralized warehouse, which provides decision support
service across the enterprise.
• It offers a unified approach to organizing and representing data.
• It also provides the ability to classify data according to the subject and give access
according to those divisions.
Operational Data Store:
• Operational Data Store, also called ODS, is data store required when neither data
warehouse nor OLTP systems support organizations reporting needs.
• It is widely preferred for routine activities like storing records..
• In ODS, Data warehouse is refreshed in real time.
Data Mart:
• A Data Mart is a subset of the data warehouse.
• It specially designed for specific segments like sales, finance, sales, or finance.
• In an independent data mart, data can collect directly from sources.
• An independent data mart is created without the use of a central data warehouse.
• This could be desirable for smaller groups within an organization.
Hybrid data mart:
Hybrid Data Marts - A hybrid data mart allows you to combine input from sources other than a
data warehouse. - This could be useful for many situations, especially when you need ad hoc
integration, such as after a new group or product is added to the organization.
• A hybrid data mart allows you to combine input from sources other than a data
warehouse.
• This could be useful for many situations, especially when you need ad hoc
integration, such as after a new group or product is added to the organization.
1. Production Data
(financial systems, manufacturing systems, systems along the supply chain, and customer
relationship management systems)
• In operational systems, information queries are narrow
• The queries are all predictable(name and address of a single customer/ orders placed by
a single customer in a single week)
• The significant and disturbing characteristic of production data is disparity.
• A great challenge is to standardize and transform the disparate data from the various
production systems, convert the data, and integrate the pieces into useful data for storage
in the data warehouse.
• Integration of these various sources that provide the value to the data in the data
warehouse
2. Internal Data
• users keep their “private” spreadsheets, documents, customer profiles, and sometimes
even departmental databases.
• Profiles of individual customers become very important for consideration
• The IT department must work with the user departments to gather the internal data
• Internal data adds additional complexity to the process of transforming and integrating
the data before it can be stored in the data warehouse.
• Determine strategies for collecting data from spreadsheets, find ways of taking data from
textual documents, and tie into departmental databases to gather pertinent data from
those sources
Archived Data
• Different methods of archiving exist.
• There are staged archival methods.
• At the first stage, recent data is archived to a separate archival database that may still be
online.
• At the second stage, the older data is archived to flat files on disk storage.
• At the next stage, the oldest data is archived to tape cartridges or microfilm and even
kept off-site.
• A data warehouse keeps historical snapshots of data
• Depending on your data warehouse requirements, you have to include sufficient
historical data. This type of data is useful for discerning patterns and analyzing trends.
External Data
• Data warehouse of a car rental company contains data on the current production
schedules of the leading automobile manufacturers. This external data in the data
warehouse helps the car rental company plan for its fleet management.
• In order to spot industry trends and compare performance against other organizations,
you need data from external sources
• Data from outside sources do not conform to your formats.
• Devise ways to convert data into your internal formats and data types.
• Organize the data transmissions from the external sources. Some sources may provide
information at regular, stipulated intervals. Others may give you the data on request. We
need to accommodate the variations.
Data staging
• Data staging provides a place and an area with a set of functions to clean, change,
combine, convert, deduplicate, and prepare source data for storage and use in the data
warehouse.
• Why do you need a separate place or component to perform the data preparation?
• Can you not move the data from the various sources into the data warehouse storage
itself and then prepare the data?
Data Extraction:
• Tools/ in-house programs
• Data warehouse implementation teams extract the source into a separate physical
environment from which moving the data into the data warehouse would be easier.
• Extract the source data into a group of flat files, or a data-staging relational database, or
a combination of both.
Data Transformation
• Clean the data extracted from each source.
✓ Cleaning may just be correction of misspellings, or may include resolution of conflicts
between state codes and zip codes in the source data, or may deal with providing default
values for missing data elements, or elimination of duplicates when you bring in the
same data from multiple source systems.
• Standardization of data elements forms a large part of data transformation.
✓ standardize the data types and field lengths for same data elements retrieved from the
various sources.
✓ Semantic standardization is another major task.
✓ Resolve synonyms and homonyms. When two or more terms from different source
systems mean the same thing, you resolve the synonyms.
✓ When a single term means many different things in different source systems, you resolve
the homonym.
Data transformation involves many forms of combining pieces of data from the different
sources.
• Combine data from a single source record or related data elements from many source
records.
• Involves purging source data that is not useful and separating out source records into
new combinations.
• Sorting and merging of data takes place on a large scale in the data staging area.
• Keys chosen for the operational systems are field values with built-in meanings
Eg: product key
• Data transformation also includes the assignment of surrogate keys derived from the
source system primary keys.
• Data transformation function would include appropriate summarization
When the data transformation function ends, we have a collection of integrated data that is
cleaned, standardized, and summarized.
Data is ready to load into each data set in your data warehouse.
Metadata Component
• Similar to the data dictionary /data catalog in a database management system
Information about the logical data structures, the information about the files and addresses, the
information about the indexes, and so on.
Contains data about the data in the database.
Management and Control Component
• Coordinates the services and activities within the data warehouse.
• Control the data transformation and the data transfer into the data warehouse storage.
• Moderates the information delivery to the users.
• Works with the database management systems and enables data to be properly stored in
the repositories.
• Monitors the movement of data into the staging area and from there into the data
warehouse storage itself.
• Interacts with the metadata component to perform the management and control
functions.
• Metadata component contains information about the data warehouse itself, the metadata
is the source of information for the management module
• It serves as a Single Source of Truth for all the data within the company. Using a Data
Warehouse eliminates the following issues:
o Data quality issues
o Unstable data in reports
o Data Inconsistency
o Low query performance
• Data Warehouse gives the ability to quickly run analysis on huge volumes of datasets.
• If there is any change in the structure of the data available in the operational or
transactional Databases. It will not break the business reports running on top of it because
they are not directly connected to BI tools or Reporting tools.
• Cloud Data Warehouse (such as Amazon Redshift and Google BigQuery) offer an added
advantage that you need not invest in them upfront. Instead, you pay as you go as the
size of your data increases. You can refer to this article on Amazon Redshift vs Google
BigQuery for a comparison of the two.
• When companies want to make the data available for all, they will understand the need
for Data Warehouse. You can expose the data within the company for analysis. While
you do so you can hide certain sensitive information (such as PII – Personally
Identifiable Information about your customers, or Partners).
• There is always the need for Data Warehouse as the complexity of queries increases and
users need faster query processing. Because the transactional Databases are built to store
a store in a normalized form whereas fast query processing can be achieved by
denormalized data that is available in Data Warehouse.
ARCHITECTURAL TYPES
Hub-and-Spoke
• This is the Inmon Corporate Information Factory approach.
• Similar to the centralized data warehouse architecture, here too is an overall enterprise-
wide data warehouse.
• Atomic data in the third normal form is stored in the centralized data warehouse.
• The major and useful difference is the presence of dependent data marts in this
architectural type.
• Dependent data marts obtain data from the centralized data warehouse.
• The centralized data warehouse forms the hub to feed data to the data marts on the
spokes.
• The dependent data marts may be developed for a variety of purposes.
• Each dependent dart mart may have normalized, denormalized, summarized, or
dimensional data structures based on individual requirements.
• Most queries are directed to the dependent data marts although the centralized data
warehouse may itself be used for querying.
• This architectural type results from adopting a top-down approach to data warehouse
development
Data-Mart Bus
• This is the Kimbal conformed supermarts approach.
• Initiate with analyzing requirements for a specific business subject such as orders,
shipments, billings, insurance claims, car rentals, and so on.
• Build the first data mart (supermart) using business dimensions and metrics.
• These business dimensions will be shared in the future data marts.
• The principal notion is that by conforming dimensions among the various data marts, the
result would be logically integrated supermarts that will provide an enterprise view of
the data.
• The data marts contain atomic data organized as a dimensional data model.
• This architectural type results from adopting an enhanced bottom-up approach to data
warehouse development.
Collects data from the data warehouse for Collects data from various disparate sources and
4.
analysis organises it for efficient BI analysis
OLAP models - ROLAP versus MOLAP - defining schemas: Stars, snowflakes and
fact constellations
OLAP databases contain two basic types of data: measures, which are numeric data,
the quantities and averages that you use to make informed business decisions, and
dimensions, which are the categories that you use to organize these measures.
OLAP databases help organize data by many levels of detail.
Types of OLAP Servers
We have four types of OLAP servers:
• Relational OLAP (ROLAP)
• Multidimensional OLAP (MOLAP)
• Hybrid OLAP (HOLAP)
• Specialized SQL Servers
Relational OLAP (ROLAP)
Relational On-Line Analytical Processing (ROLAP) work mainly for the data that resides in a
relational database, where the base data and dimension tables are stored as relational tables.
ROLAP servers are placed between the relational back-end server and client front-end tools.
ROLAP servers use RDBMS to store and manage warehouse data, and OLAP middleware to
support missing pieces.
Advantages of ROLAP
Disadvantages of ROLAP
Advantages of HOLAP
Disadvantages of HOLAP
HOLAP architecture is very complex because it support both MOLAP and ROLAP servers.
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional
Data Models
The entity-relationship data model is commonly used in the design of relational databases,
where a database schema consists of a set of entities and the relationships between them. Such
a data model is appropriate for online transaction processing. A data warehouse, however,
requires a concise, subject-oriented schema that facilitates online data analysis.
The most popular data model for a data warehouse is a multidimensional model, which can
exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Let’s look
at each of these.
Star schema: The most common modeling paradigm is the star schema, in which the data
warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and (2) a set of smaller attendant tables (dimension tables), one foreach dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern
around the central fact table.
Sales are considered along four dimensions: time, item, branch, and location. The schema
contains a central fact table for sales that contains keys to each of the four dimensions, along
with two measures: dollars sold and units sold. To minimize the size of the fact table, dimension
identifiers (e.g., time key and item key) are system-generated identifiers. In the star schema, each
dimension is represented by only one table, and each table contains a set of attributes. For
example, the location dimension table contains the attribute set {location key, street, city,
province or state, country}. This constraint may introduce some redundancy. For example,
“Urbana” and “Chicago” are both cities in the state of Illinois, USA. Entries for such cities in
the location dimension table will create redundancy among the attributes province or state and
country; that is, (..., Urbana, IL, USA) and (..., Chicago, IL, USA). Moreover, the attributes
within a dimension table may form either a hierarchy (total order) or a lattice (partial order).
Snowflake schema: The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the dimension
tables of the snowflake model may be kept in normalized form to reduce redundancies. Such a
table is easy to maintain and saves storage space.
The main difference between the two schemas is in the definition of dimension tables. The
single dimension table for item in the star schema is normalized in the snowflake schema,
resulting in new item and supplier tables. For example, the item dimension table now contains
the attributes item key, item name, brand, type, and supplier key, where supplier key is
linked to the supplier dimension table, containing supplier key and supplier type information.
Similarly, the single dimension table for location in the star schema can be normalized into
two new tables: location and city. The city key in the new location table links to the city
dimension. Notice that, when desirable, further normalization can be performed on
province or state and country in the snowflake schema
Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called
a galaxy schema or a fact constellation. This schema specifies two fact tables, sales and
shipping. The sales table definition is identical to that of the star schema . The shipping table
has five dimensions, or keys—item key, time key, shipper key, from location, and to location—
and two measures—dollars cost and units shipped. A fact constellation schema allows
dimension tables to be shared between fact tables. For example, the dimensions tables for time,
item, and location are shared between the sales and shipping fact tables.
Dimensions: The Role of Concept Hierarchies
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-
level, more general concepts. Consider a concept hierarchy for the dimension location. City
values for location include Vancouver, Toronto, New York, and Chicago.
Each city, however, can be mapped to the province or state to which it belongs. For example,
Vancouver can be mapped to British Columbia, and Chicago to Illinois. The provinces and
states can in turn be mapped to the country (e.g., Canada or the United States) to which they
belong. These mappings form a concept hierarchy for the dimension location, mapping a set of
low-level concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).
With multidimensional data stores, the storage utilization may be low if the data set is
sparse.
Advantages of MOLAP
Advantages of HOLAP
Disadvantages of HOLAP
HOLAP architecture is very complex because it support both MOLAP and ROLAP servers
MODULE 2
Introduction to data mining (DM):
Motivation for Data Mining - Data Mining-Definition and Functionalities –
Classification of DM Systems - DM task primitives - Integration of a Data Mining
system with a Database or a Data Warehouse - Issues in DM – KDD Process
Data Pre-processing: Why to pre-process data? - Data cleaning: Missing Values,
Noisy Data - Data Integration and transformation - Data Reduction: Data cube
aggregation, Dimensionality reduction - Data Compression - Numerosity Reduction
- Data Mining Primitives - Languages and System Architectures: Task relevant data
- Kind of Knowledge to be mined - Discretization and Concept Hierarchy
Motivation for Data Mining
Data mining is the procedure of finding useful new correlations, patterns, and trends by sharing
through a high amount of data saved in repositories, using pattern recognition technologies
including statistical and mathematical techniques. It is the analysis of factual datasets to discover
unsuspected relationships and to summarize the records in novel methods that are both logical
and helpful to the data owner.
It is the procedure of selection, exploration, and modeling of high quantities of information to
find regularities or relations that are at first unknown to obtain clear and beneficial results for the
owner of the database.
It is not limited to the use of computer algorithms or statistical techniques. It is a process of
business intelligence that can be used together with information technology to support company
decisions.
Data Mining is similar to Data Science. It is carried out by a person, in a particular situation, on
a specific data set, with an objective. This phase contains several types of services including text
mining, web mining, audio and video mining, pictorial data mining, and social media mining. It
is completed through software that is simple or greatly specific.
Data mining has engaged a huge deal of attention in the information market and society as a
whole in current years, because of the wide availability of huge amounts of data and the imminent
needed for turning such data into beneficial data and knowledge. The information and knowledge
gained can be used for software ranging from industry analysis, fraud detection, and user
retention, to production control and science exploration.
Data mining can be considered as a result of the natural progress of data technology. The database
system market has supported an evolutionary direction in the development of the following
functionalities including data collection and database creation, data management, and advanced
data analysis.
For example, the recent development of data collection and database creation structure served as
necessary for the later development of an effective structure for data storage and retrieval, and
query and transaction processing. With various database systems providing query and transaction
processing as common practice, advanced data analysis has developed into the next object.
Data can be saved in several types of databases and data repositories. One data repository
structure that has appeared in the data warehouse, a repository of several heterogeneous data
sources organized under a unified schema at an individual site to support management decision
making.
Data warehouse technology involves data cleaning, data integration, and online analytical
processing (OLAP), especially, analysis techniques with functionalities including
summarization, consolidation, and aggregation, and the ability to view data from multiple angles.
What motivated data mining? Why is it important?
The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and
science exploration.
The evolution of database technology
(1970s-early 1980s)
Data mining refers to extracting or mining" knowledge from large amounts of data.
There are many other terms related to data mining, such as knowledge mining,
knowledge extraction, data/pattern analysis, data archaeology, and data dredging.
Many people treat datamining as a synonym for another popularly used term,
Knowledge Discovery in Databases", or KDD
Data mining is a technical methodology to detect information from huge data sets.
The main objective of data mining is to identify patterns, trends, or rules that
explain data behavior contextually. The data mining method uses mathematical
analysis to deduce patterns and trends, which were not possible through the old
methods of data exploration. Data mining is a handy and extremely convenient
methodology when it comes to dealing with huge volumes of data. In this article, we
explore some data mining functionalities that are measured to predict the type of
patterns in data sets.
Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. In general, data mining tasks can be classified into two types
including descriptive and predictive. Descriptive mining tasks define the common features of
the data in the database and the predictive mining tasks act inference on the current
information to develop predictions.
There are various data mining functionalities which are as follows −
Data characterization − It is a summarization of the general characteristics of an object class
of data. The data corresponding to the user-specified class is generally collected by a database
query. The output of data characterization can be presented in multiple forms.
Data discrimination − It is a comparison of the general characteristics of target class data
objects with the general characteristics of objects from one or a set of contrasting classes. The
target and contrasting classes can be represented by the user, and the equivalent data objects
fetched through database queries.
Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset. There are two parameters that are used for determining the association
rules −
It provides which identifies the common item set in the database.
Confidence is the conditional probability that an item occurs in a transaction when another
item occurs.
Classification − Classification is the procedure of discovering a model that represents and
distinguishes data classes or concepts, for the objective of being able to use the model to
predict the class of objects whose class label is anonymous. The derived model is established
on the analysis of a set of training data (i.e., data objects whose class label is common).
Prediction − It defines predict some unavailable data values or pending trends. An object can
be anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase/decrease trends in time-related
information.
Clustering − It is similar to classification but the classes are not predefined. The classes are
represented by data attributes. It is unsupervised learning. The objects are clustered or
grouped, depends on the principle of maximizing the intraclass similarity and minimizing the
intraclass similarity.
Outlier analysis − Outliers are data elements that cannot be grouped in a given class or
cluster. These are the data objects which have multiple behaviour from the general behaviour
of other data objects. The analysis of this type of data can be essential to mine the knowledge.
Evolution analysis − It defines the trends for objects whose behaviour changes over some
time.
Classification of DM Systems –
DM task primitives
Data Mining Primitives:
A data mining task can be specified in the form of a data mining query, which is input to the data
mining system. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to inter-actively communicate with the data mining system during
discovery of knowledge.
The data mining task primitives includes the following:
• Task-relevant data
• Kind of knowledge to be mined
• Background knowledge
• Interestingness measurement
• Presentation for visualizing the discovered patterns
Task-relevant data
This specifies the portions of the database or the dataset of data in which the user is interested.
This includes the database attributes or data warehouse dimensions of interest (referred to as the
relevant attributes or dimensions).
The kind of knowledge to be mined
This specifies the data mining functions to be performed. Such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.
The background knowledge to be used in the discovery process
The knowledge about the domain is useful for guiding the knowledge discovery process for
evaluating the interesting patterns. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of abstraction.
An example of a concept hierarchy for the attribute (or dimension) age is shown in user beliefs
regarding relationships in the data are another form of background knowledge.
The interestingness measures and thresholds for pattern evaluation:
Different kinds of knowledge may have different interestingness measures.
For example, interestingness measures for association rules include support and confidence.
Rules whose support and confidence values are below user-specified thresholds are considered
uninteresting.
The expected representation for visualizing the discovered patterns. It refers to the discovered
patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and
cubes.
A data mining query language can be designed to incorporate these primitives, allowing users to
flexibly interact with data mining systems.
A data mining query language can be designed to incorporate these primitives, allowing users to
interact with data mining systems flexibly. Having a data mining query language provides a
foundation on which user-friendly graphical interfaces can be built.
Designing a comprehensive data mining language is challenging because data mining covers a
wide spectrum of tasks, from data characterization to evolution analysis. Each task has different
requirements. The design of an effective data mining query language requires a deep
understanding of the power, limitation, and underlying mechanisms of the various kinds of data
mining tasks. This facilitates a data mining system's communication with other information
systems and integrates with the overall information processing environment.
This specifies the portions of the database or the set of data in which the user is interested. This
includes the database attributes or data warehouse dimensions of interest (the relevant attributes
or dimensions).
In a relational database, the set of task-relevant data can be collected via a relational query
involving operations like selection, projection, join, and aggregation.
The data collection process results in a new data relational called the initial data relation. The
initial data relation can be ordered or grouped according to the conditions specified in the query.
This data retrieval can be thought of as a subtask of the data mining task.
This initial relation may or may not correspond to physical relation in the database. Since virtual
relations are called Views in the field of databases, the set of task-relevant data for data mining is
called a minable view.
This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.
This knowledge about the domain to be mined is useful for guiding the knowledge discovery
process and evaluating the patterns found. Concept hierarchies are a popular form of
background knowledge, which allows data to be mined at multiple levels of abstraction.
An example of a concept hierarchy for the attribute (or dimension) age is shown below. User
beliefs regarding relationships in the data are another form of background knowledge.
Different kinds of knowledge may have different interesting measures. They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns. For example,
interesting measures for association rules include support and confidence. Rules whose support
and confidence values are below user-specified thresholds are considered uninteresting.
This refers to the form in which discovered patterns are to be displayed, which may include
rules, tables, cross tabs, charts, graphs, decision trees, cubes, or other visual representations.
Users must be able to specify the forms of presentation to be used for displaying the discovered
patterns. Some representation forms may be better suited than others for particular kinds of
knowledge.
For example, generalized relations and their corresponding cross tabs or pie/bar charts are good
for presenting characteristic descriptions, whereas decision trees are common for classification.
The data mining system is integrated with a database or data warehouse system so that it can do
its tasks in an effective presence. A data mining system operates in an environment that needed
it to communicate with other data systems like a database system. There are the possible
integration schemes that can integrate these systems which are as follows −
No coupling − No coupling defines that a data mining system will not use any function of a
database or data warehouse system. It can retrieve data from a specific source (including a file
system), process data using some data mining algorithms, and therefore save the mining results
in a different file.
Such a system, though simple, deteriorates from various limitations. First, a Database system
offers a big deal of flexibility and adaptability at storing, organizing, accessing, and processing
data. Without using a Database/Data warehouse system, a Data mining system can allocate a
large amount of time finding, collecting, cleaning, and changing data.
Loose Coupling − In this data mining system uses some services of a database or data warehouse
system. The data is fetched from a data repository handled by these systems. Data mining
approaches are used to process the data and then the processed data is saved either in a file or in
a designated area in a database or data warehouse. Loose coupling is better than no coupling as
it can fetch some area of data stored in databases by using query processing or various system
facilities.
Semitight Coupling − In this adequate execution of a few essential data mining primitives can
be supported in the database/datawarehouse system. These primitives can contain sorting,
indexing, aggregation, histogram analysis, multi-way join, and pre-computation of some
important statistical measures, including sum, count, max, min, standard deviation, etc.
Tight coupling − Tight coupling defines that a data mining system is smoothly integrated into
the database/data warehouse system. The data mining subsystem is considered as one functional
element of an information system.
Data mining queries and functions are developed and established on mining query analysis, data
structures, indexing schemes, and query processing methods of database/data warehouse systems.
It is hugely desirable because it supports the effective implementation of data mining functions,
high system performance, and an integrated data processing environment.
Issues in DM
KDD Process
KDD- Knowledge Discovery in Databases
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of
discovering knowledge in data and emphasizes the high-level applications of specific Data Mining
techniques. It is a field of interest to researchers in various fields, including artificial intelligence,
machine learning, pattern recognition, databases, statistics, knowledge acquisition for expert
systems, and data visualization.
The main objective of the KDD process is to extract information from data in the context of large
databases. It does this by using Data Mining algorithms to identify what is deemed knowledge.
The availability and abundance of data today make knowledge discovery and Data Mining a
matter of impressive significance and need. In the recent development of the field, it isn't
surprising that a wide variety of techniques is presently accessible to specialists and experts.
The KDD Process
The knowledge discovery process(illustrates in the given figure) is iterative and interactive,
comprises of nine steps. The process is iterative at each stage, implying that moving back to the
previous actions might be required. The process has many imaginative aspects in the sense that
one cant presents one formula or make a complete scientific categorization for the correct
decisions for each step and application type. Thus, it is needed to understand the process and the
different requirements and possibilities in each stage.
The process begins with determining the KDD objectives and ends with the implementation of
the discovered knowledge. At that point, the loop is closed, and the Active Data Mining starts.
Subsequently, changes would need to be made in the application domain. For example, offering
various features to cell phone users in order to reduce churn. This closes the loop, and the impacts
are then measured on the new data repositories, and the KDD process again. Following is a concise
description of the nine-step KDD process, Beginning with a managerial step:
This is the initial preliminary step. It develops the scene for understanding what should be done
with the various decisions like transformation, algorithms, representation, etc. The individuals
who are in charge of a KDD venture need to understand and characterize the objectives of the
end-user and the environment in which the knowledge discovery process will occur ( involves
relevant prior knowledge).
Once defined the objectives, the data that will be utilized for the knowledge discovery process
should be determined. This incorporates discovering what data is accessible, obtaining important
data, and afterward integrating all the data for knowledge discovery onto one set involves the
qualities that will be considered for the process. This process is important because of Data Mining
learns and discovers from the accessible data. This is the evidence base for building the models.
If some significant attributes are missing, at that point, then the entire study may be unsuccessful
from this respect, the more attributes are considered. On the other hand, to organize, collect, and
operate advanced data repositories is expensive, and there is an arrangement with the opportunity
for best understanding the phenomena. This arrangement refers to an aspect where the interactive
and iterative aspect of the KDD is taking place. This begins with the best available data sets and
later expands and observes the impact in terms of knowledge discovery and modeling.
In this step, data reliability is improved. It incorporates data clearing, for example, Handling the
missing quantities and removal of noise or outliers. It might include complex statistical techniques
or use a Data Mining algorithm in this context. For example, when one suspects that a specific
attribute of lacking reliability or has many missing data, at this point, this attribute could turn into
the objective of the Data Mining supervised algorithm. A prediction model for these attributes
will be created, and after that, missing data can be predicted. The expansion to which one pays
attention to this level relies upon numerous factors. Regardless, studying the aspects is significant
and regularly revealing by itself, to enterprise data frameworks.
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction( for example, feature selection and extraction
and record sampling), also attribute transformation(for example, discretization of numerical
attributes and functional transformation). This step can be essential for the success of the entire
KDD project, and it is typically very project-specific. For example, in medical assessments, the
quotient of attributes may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well as efforts and transient
issues. For example, studying the impact of advertising accumulation. However, if we do not
utilize the right transformation at the starting, then we may acquire an amazing effect that insights
to us about the transformation required in the next iteration. Thus, the KDD process follows upon
itself and prompts an understanding of the transformation required.
Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For example,
considering precision versus understandability, the previous is better with neural networks, while
the latter is better with decision trees. For each system of meta-learning, there are several
possibilities of how it can be succeeded. Meta-learning focuses on clarifying what causes a Data
Mining algorithm to be fruitful or not in a specific issue. Thus, this methodology attempts to
understand the situation under which a Data Mining algorithm is most suitable. Each algorithm
has parameters and strategies of leaning, such as ten folds cross-validation or another division for
training and testing.
At last, the implementation of the Data Mining algorithm is reached. In this stage, we may need
to utilize the algorithm several times until a satisfying outcome is obtained. For example, by
turning the algorithms control parameters, such as the minimum number of instances in a single
leaf of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact on the
Data Mining algorithm results. For example, including a feature in step 4, and repeat from there.
This step focuses on the comprehensibility and utility of the induced model. In this step, the
identified knowledge is also recorded for further use. The last step is the use, and overall feedback
and discovery results acquire by Data Mining.
Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and measure
the impacts. The accomplishment of this step decides the effectiveness of the whole KDD process.
There are numerous challenges in this step, such as losing the "laboratory conditions" under which
we have worked. For example, the knowledge was discovered from a certain static depiction, it is
usually a set of data, but now the data becomes dynamic. Data structures may change certain
quantities that become unavailable, and the data domain might be modified, such as an attribute
that may have a value that was not expected previously.
Data preprocessing is the process of transforming raw data into an understandable format. It
is also an important step in data mining as we cannot work with raw data. The quality of the
data should be checked before applying machine learning or data mining algorithms.
Preprocessing of data is mainly to check the data quality. The quality can be checked by the
following
2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
2. Data Integration
3. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
Min-max normalization performs a linear transformation on the original data. Suppose that minA
and maxA are the minimum and maximum values of an attribute, A. Min-max normalization maps
a value, v, of A to v 0 in the range [new minA, newmaxA]
In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean and standard deviation of A. A value, v, of A is normalized to v 0.
Normalization by decimal scaling normalizes by moving the decimal point of values of attribute
A. The number of decimal points moved depends on the maximum absolute value of A. A value,
v, of A is normalized to v0.
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order
to get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless.
If after reconstruction from compressed data, original data can be retrieved,
such reduction are called lossless reduction else it is called lossy reduction. The
two effective methods of dimensionality reduction are:Wavelet transforms and
PCA (Principal Component Analysis).
Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Step: 2
Smoothing by bin means:
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Binning method - Example (Cont..)
Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Step: 2
Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
Data mining is the process of discovering interesting knowledge from large amounts
of data stored either in databases, data warehouses, or other information repositories.
Based on this view, the architecture of a typical data mining system may have the
following major components:
Pattern evaluation
Knowledge base
Data cleansing
Data Integration Filtering
There is a lot of confusion between data mining and data analysis. Data mining functions are used
to define the trends or correlations contained in data mining activities. While data analysis is used
to test statistical models that fit the dataset, for example, analysis of a marketing campaign, data
mining uses Machine Learning and mathematical and statistical models to discover patterns
hidden in the data. In comparison, data mining activities can be divided into two categories:
This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.
Data Discretization
Top-down Discretization -
• If the process starts by first finding one or a few points called split points or cut
points to split the entire attribute range and then repeat this recursively on the
resulting intervals.
Bottom-up Discretization -
Concept Hierarchies
Data can be associated with class or concept. For example, in the All Electronics store, classes
of items for. sale include computers and printers and concepts of customers include big
spender b and budget spender. It can. useful to describe individual classes and concepts in
summarized, concise, and yet precise terms. Such descriptions of a class or a concept are
called class/concept-description. These descriptions can be derived via-
Data Characterization –
Data Discrimination –
It is a comparison of general features of target class data objects with the general features from
one or a set of contrasting classes. The target and contrasting classes can be specified by the
user, the corresponding data objects retrieved through database queries. For example, the user
may like to compare the general features of software products whose sales are increased by
10% in the last year with those whose, sales are decreased by at least 30% during the same
period: The method used for data discrimination are similar to those used for data
characterization.
he simplest kind of descriptive data mining is called concept description. A concept usually
refers to a collection of data such as frequent_buyers, graduate_students and so on.
As data mining task concept description is not a simple enumeration of the data. Instead,
concept description generates descriptions for characterization and comparison of the data.
It is sometimes called class description when the concept to be described refers to a class of
objects
• Characterization: It provides a concise and succinct summarization of the given
collection of data.
• Comparison: It provides descriptions comparing two or more collections of data.
Data Generalization
A process that abstracts a large set of task-relevant data in a database from a low conceptual
level to higher ones.
Data Generalization is a summarization of general features of objects in a target class and
produces what is called characteristic rules.
The data relevant to a user-specified class are normally retrieved by a database query and
run through a summarization module to extract the essence of the data at different levels of
abstractions.
For example, one may want to characterize the "OurVideoStore" customers who regularly
rent more than 30 movies a year. With concept hierarchies on the attributes describing the
target class, the attribute-oriented induction method can be used, for example, to carry out
data summarization.
Note that with a data cube containing a summarization of data, simple OLAP operations fit
the purpose of data characterization.
Approaches:
• Data cube approach(OLAP approach).
• Attribute-oriented induction approach.
Cross-Tabulation:
• Mapping results into cross-tabulation form (similar to contingency tables).
Visualization Techniques:
• Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
• Mapping generalized results in characteristic rules with quantitative information
associated with it.
Summary
Data generalization is the process that abstracts a large set of task-relevant data in a database
from a low conceptual level to higher ones.
It is a summarization of general features of objects in a target class and produces what is
called characteristic rules.
ATTRIBUTE RELEVANCE
Reasons for attribute relevance analysis
There are several reasons for attribute relevance analysis are as follows −
• It can decide which dimensions must be included.
• It can produce a high level of generalization.
• It can reduce the number of attributes that support us to read patterns easily.
The basic concept behind attribute relevance analysis is to evaluate some measure that can
compute the relevance of an attribute regarding a given class or concept. Such measures
involve information gain, ambiguity, and correlation coefficient.
Attribute relevance analysis for concept description is implemented as follows −
Data collection − It can collect data for both the target class and the contrasting class by
query processing.
Preliminary relevance analysis using conservative AOI − This step recognizes a set of
dimensions and attributes on which the selected relevance measure is to be used.
AOI can be used to implement preliminary analysis on the data by eliminating attributes
having a high number of distinct values. It can be conservative, the AOI implemented should
employ attribute generalization thresholds that are set reasonably large to enable more
attributes to be treated in further relevance analysis by the selected measure.
Remove − This process removes irrelevant and weakly relevant attributes using the selected
relevance analysis measure.
Generate the concept description using AOI − It can implement AOI using a less
conservative set of attribute generalization thresholds. If the descriptive mining function is
class characterization, only the original target class working relation is included now.
If the descriptive mining function is class characterization, only the original target class
working relation is included. If the descriptive mining function is class characterization, only
the original target class working relation is included. If the descriptive mining function is
class comparison, both the original target class working relation and the original contrasting
class working relation are included.
CLASS COMPARISONS ASSOCIATION RULE MINING:
Market basket analysis - basic concepts
Market Basket Analysis: A Motivating Example
Frequent itemset mining leads to the discovery of associations and correlations
among items in large transactional or relational data sets. With massive amounts of data
continuously being collected and stored, many industries are becoming interested in
mining such patterns from their databases. The discovery of interesting correlation
relation- ships among huge amounts of business transaction records can help in
many busi- ness decision-making processes such as catalog design, cross-marketing,
and customer shopping behavior analysis.
A typical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different items that
customers place in their “shopping baskets” (Figure 6.1). The discovery of these
associations can help retailers develop marketing strategies by gaining insight into
which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread (and what kind of
bread) on the same trip
Shopping Baskets
milk
bread milk bread milk bread
cereal sugar eggs
butter
Customer 1 Customer 2 Customer 3
sugar
eggs
Market Analyst
to the supermarket? This information can lead to increased sales by helping retailers do
selective marketing and plan their shelf space.
Let’s look at an example of how market basket analysis can be useful.
Market basket analysis. Suppose, as manager of an All Electronics branch, you would like
to learn more about the buying habits of your customers. Specifically, you wonder, “Which
groups or sets of items are customers likely to purchase on a given trip to the store?” To answer your
question, market basket analysis may be performed on the retail data of customer
transactions at your store. You can then use the results to plan marketing or advertising
strategies, or in the design of a new catalog. For instance, market basket anal ysis may help you
design different store layouts. In one strategy, items that are frequently purchased together
can be placed in proximity to further encourage the combined sale of such items. If customers
who purchase computers also tend to buy antivirus software at the same time, then placing
the hardware display close to the software display may help increase the sales of both
items.
In an alternative strategy, placing hardware and software at opposite ends of the store may
entice customers who purchase such items to pick up other items along the way. For instance,
after deciding on an expensive computer, a customer may observe security systems for sale
while heading toward the software display to purchase antivirus software, and may decide to
purchase a home security system as well. Market basket analysis can also help retailers plan
which items to put on sale at reduced prices. If customers tend to purchase computers and
printers together, then having a sale on printers may encourage the sale of printers as well as
computers.
If we think of the universe as the set of items available at the store, then each item has a Boolean
variable representing the presence or absence of that item. Each basket can then be
represented by a Boolean vector of values assigned to these variables. The Boolean vectors
can be analyzed for buying patterns that reflect items that are frequently associated or
purchased together. These patterns can be represented in the form of association rules. For
example, the information that customers who purchase computers also tend to buy
antivirus software at the same time is represented in the following association rule:
Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for Rule (6.1)
means that 2% of all the transactions under analysis show that computer and antivirus
software are purchased together. A confidence of 60% means that 60% of the customers who
purchased a computer also bought the software. Typically, associa tion rules are considered
interesting if they satisfy both a minimum support threshold and a minimum
confidence threshold. These thresholds can be a set by users or domain experts.
Additional analysis can be performed to discover interesting statistical correlations
between associated items.
ASSOCIATION RULE MINING
Association rule learning is a rule-based machine learning method for discovering interesting
relations between variables in large databases.
It is intended to identify strong rules discovered in databases using some measures of
interestingness
❖ Learning of Association rules is used to find relationships between attributes in large databases.
An association rule, A=> B, will be of the form” for a set of transactions, some value of itemset
A determines the values of itemset B under the condition in which minimum support and
confidence are met”.
❖ Support and Confidence can be represented by the following example:
❖ The above statement is an example of an association rule. This means that there is a 2%
transaction that bought bread and butter together and there are 60% of customers who bought
bread as well as butter
Association rule mining consists of 2 steps:
1. Find all the frequent itemsets.
2. Generate association rules from the above frequent itemsets
Finding frequent item sets: Apriori algorithm - generating rules – Improved
Apriori algorithm – Incremental ARM – Associative Classification – Rule
Mining
APRIORI ALGORITHM
❖ With the quick growth in e-commerce applications, there is an accumulation vast quantity of
data in months not in years. Data Mining, also known as Knowledge Discovery in
Databases(KDD), to find anomalies, correlations, patterns, and trends to predict outcomes.
❖ Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent itemsets
and relevant association rules. It is devised to operate on a database containing a lot of
transactions, for instance, items brought by customers in a store.
❖ It is very important for effective Market Basket Analysis and it helps the customers in
purchasing their items with more ease which increases the sales of the markets. It has also been
used in the field of healthcare for the detection of adverse drug reactions. It produces association
rules that indicates what all combinations of medications and patient characteristics lead to ADRs
WHAT IS AN ITEMSET?
❖ A set of items together is called an itemset. If any itemset has k-items it is called a k-itemset.
An itemset consists of two or more items. An itemset that occurs frequently is called a frequent
itemset. Thus frequent itemset mining is a data mining technique to identify the items that often
occur together.
➢ For Example, Bread and butter, Laptop and Antivirus software, etc
WHAT IS A FREQUENT ITEMSET?
❖ A set of items is called frequent if it satisfies a minimum threshold value for support and
confidence. Support shows transactions with items purchased together in a single transaction.
Confidence shows transactions where the items are purchased one after the other.
❖ For frequent itemset mining method, we consider only those transactions which meet minimum
threshold support and confidence requirements. Insights from these mining algorithms offer a lot
of benefits, cost-cutting and improved competitive advantage.
❖ There is a tradeoff time taken to mine data and the volume of data for frequent mining. The
frequent mining algorithm is an efficient algorithm to mine the hidden patterns of itemsets within
a short time and less memory consumption
FREQUENT PATTERN MINING
❖ The frequent pattern mining algorithm is one of the most important techniques of data mining
to discover relationships between different items in a dataset. These relationships are represented
in the form of association rules. It helps to find the irregularities in data.
❖ FPM has many applications in the field of data analysis, software bugs, cross-marketing, sale
campaign analysis, market basket analysis, etc.
❖ Frequent itemsets discovered through Apriori have many applications in data mining tasks.
Tasks such as finding interesting patterns in the database, finding out sequence and Mining of
association rules is the most important of them.
❖ Association rules apply to supermarket transaction data, that is, to examine the customer
behavior in terms of the purchased products. Association rules describe how often the items are
purchased together
WHY FREQUENT ITEMSET MINING?
❖ Frequent itemset or pattern mining is broadly used because of its wide applications in mining
association rules, correlations and graph patterns constraint that is based on frequent patterns,
sequential patterns, and many other data mining tasks.
❖ Apriori says:
➢ P (I+A) < minimum support threshold, then I+A is not frequent, where A also belongs to
itemset.
➢ If an itemset set has value less than minimum support then all of its supersets will also fall
below min support, and thus can be ignored. This property is called the Antimonotone property
Transactional Data for an AllElectronics support= 22% 7L1 0 L1 is equivalent to L1 × L1, since
the definition of Lk 0 Lk requires the two joining itemsets toshare k − 1 = 0 items.
Branch
TID List of item IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T500 I1, I3
T600 I2, I3
T700 I1, I3
T900 I1,
Generation of the candidate itemsets and frequent itemsets, where the minimum support count is 2.
1. Next, the transactions in D are scanned and the support count of each candidate
itemset in C2 is accumulated, as shown in the middle table of the second row in
Figure 6.2.
2. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-
itemsets in C2 having minimum support.
3. The generation of the set of the candidate 3-itemsets, C3, is detailed in Figure 6.3. From
the join step, we first get C3 L2 0 L2 I1, I2, I3 , I1, I2, I5 , I1, I3, I5 ,
I2, I3, I4 , I2, I3, I5 , I2, I4, I5 . Based on the Apriori property that all subsets of a
frequent itemset must also be frequent, we can determine that the four latter candidates
cannot possibly be frequent. We therefore remove them from C3, thereby saving the effort
of unnecessarily obtaining their counts during the subsequent scan of D to determine L3.
Note that when given a candidate k-itemset, we only need to checkif its (k − 1)-subsets
are frequent since the Apriori algorithm uses a level-wise
(a) Join: C3 = L2 0 L2 = {{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}}
0{{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}}
= {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
(b) Prune using the Apriori property: All nonempty subsets of a frequent itemset must
also be frequent. Do any of the candidates have a subset that is not frequent?
The 2-item subsets of {I1, I2, I3} are {I1, I2}, {I1, I3}, and {I2, I3}. All 2-item subsets of
{I1, I2, I3} are members of L2. Therefore, keep {I1, I2, I3} in C3.
The 2-item subsets of {I1, I2, I5} are {I1, I2}, {I1, I5}, and {I2, I5}. All 2-item subsets of
{I1, I2, I5} are members of L2. Therefore, keep {I1, I2, I5} in C3.
The 2-item subsets of {I1, I3, I5} are {I1, I3}, {I1, I5}, and {I3, I5}. {I3, I5} is not a
member of L2, and so it is not frequent. Therefore, remove {I1, I3, I5} from C3.
The 2-item subsets of {I2, I3, I4} are {I2, I3}, {I2, I4}, and {I3, I4}. {I3, I4} is not a
member of L2, and so it is not frequent. Therefore, remove {I2, I3, I4} from C3.
The 2-item subsets of {I2, I3, I5} are {I2, I3}, {I2, I5}, and {I3, I5}. {I3, I5} is not a
member of L2, and so it is not frequent. Therefore, remove {I2, I3, I5} from C3.
The 2-item subsets of {I2, I4, I5} are {I2, I4}, {I2, I5}, and {I4, I5}. {I4, I5} is not a
member of L2, and so it is not frequent. Therefore, remove {I2, I4, I5} from C3.
( c ) Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after pruning.
Generation and pruning of candidate 3-itemsets, C3, from L2 using the Apriori property.
search strategy. The resulting pruned version of C3 is shown in the first table of the
bottom row of Figure 6.2
1. The transactions in D are scanned to determine L3, consisting of those
candidate 3-itemsets in C3 having minimum support (Figure 6.2).
2. The algorithm uses L3 0 L3 to generate a candidate set of 4-itemsets, C4.
Although the join results in {{I1, I2, I3, I5}}, itemset {I1, I2, I3, I5} is pruned
because its subset I2, I3, I5 is not frequent. Thus, C4 φ, and the algorithm
terminates, having found all of the frequent itemsets
Advantages
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in large databases
Disadvantages
1. It requires high computation if the itemsets are very large and the minimum
support is kept very low.
2. The entire database needs to be scanned.
PSEUDO CODE OF APRIORI ALGORITHM
❖ It is noted that analysis of past transaction data can provide very valuable information on
customer buying behavior, and thus improve the quality of business decisions.
❖ With the increasing use of the record-based databases whose data is being continuously
added, updated, deleted etc.
❖ Examples of such applications include Web log records, stock market data, grocery sales
data, transactions in e-commerce, and daily weather/traffic records etc.
❖ In many applications, we would like to mine the transaction database for a fixed amount of
most recent data (say, data in the last 12 months)
❖ Mining is not a one-time operation, a naive approach to solve the incremental mining
problem is to re-run the mining algorithm on the updated database
ASSOCIATIVE CLASSIFICATION
❖ Associative classification (AC) is a branch of a wide area of scientific study known as data
mining. Associative classification makes use of association rule mining for extracting efficient
rules, which can precisely generalize the training data set, in the rule discovery process.
❖ An associative classifier (AC) is a kind of supervised learning model that uses association
rules to assign a target value. The term associative classification was coined by Bing Liu et al., in
which the authors defined a model made of rules "whose right-hand side are restricted to the
classification class attribute"
FP GROWTH ALGORITHM
• The FP-Growth Algorithm, proposed by Han in, is an efficient and scalable method for
mining the complete set of frequent patterns by pattern fragment growth, using an
extended prefix-tree structure for storing compressed and crucial information about
frequent patterns named frequent-pattern tree (FP-tree).
• This algorithm is an improvement to the Apriori method. A frequent pattern is
generated without the need for candidate generation. FP growth algorithm represents
the database in the form of a tree called a frequent pattern tree or FP tree.
• This tree structure will maintain the association between the itemsets. The database is
fragmented using one frequent item. This fragmented part is called “pattern fragment”.
The itemsets of these fragmented patterns are analyzed. Thus with this method, the
search for frequent itemsets is reduced comparatively
FP TREE
❖ Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the
database. The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP
tree represents an item of the itemset.
❖ The root node represents null while the lower nodes represent the itemsets. The association
of the nodes with the lower nodes that is the itemsets with the other itemsets are maintained
while forming the tree
FP GROWTH STEPS
1) The first step is to scan the database to find the occurrences of the itemsets in the
database. This step is the same as the first step of Apriori. The count of 1-itemsets in the
database is called support count or frequency of 1-itemset.
2) The second step is to construct the FP tree. For this, create the root of the tree. The root
is represented by null.
3) The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the itemset in it. The itemset with the max count is taken
at the top, the next itemset with lower count and so on. It means that the branch of the
tree is constructed with transaction itemsets in descending order of count.
4) The next transaction in the database is examined. The itemsets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch (for example in the 1st transaction), then this transaction branch would
share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this
transaction
5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked according
to transactions.
6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along
with the links of the lowest nodes. The lowest node represents the frequency pattern length 1.
From this, traverse the path in the FP Tree. This path or paths are called a conditional pattern
base. Conditional pattern base is a sub-database consisting of prefix paths in the FP tree
occurring with the lowest node (suffix).
7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
8) Frequent Patterns are generated from the Conditional FP Tree
1. The lowest node item I5 is not considered as it does not have a min support count,
hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore
considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1}. This forms the
conditional pattern base.
3. The conditional pattern base is considered a transaction database, an FP-tree is
constructed. This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the
min support count.
4. This path will generate all combinations of frequent patterns :
{I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree :
{I2:4, I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}. 6. For I1,
the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and
frequent patterns are generated: {I2, I1:4}
Advantages Of FP Growth Algorithm
1. This algorithm needs to scan the database only twice when compared to Apriori which
scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm
1. FP Tree is more cumbersome and difficult to build than Apriori.
2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory
MODULE 4
Classification and prediction:
What is classification and prediction? – Issues regarding Classification and prediction:
Classification methods: Decision tree, Bayesian Classification, Rule based, CART, Neural Network
Prediction methods: Linear and nonlinear regression, Logistic Regression. Introduction of tools
such as DB Miner /WEKA/DTREG DM Tools.
What is classification and prediction?
There are two forms of data analysis that can be used to extract models describing important
classes or predict future data trends. These two forms are as follows:
1. Classification
2. Prediction
We use classification and prediction to extract a model, representing the data classes to
predict future data trends. Classification predicts the categorical labels of data with the
prediction models. This analysis provides us with the best understanding of the data at a large
scale.
Classification models predict categorical class labels, and prediction models predict
continuous-valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky or a prediction model to predict the expenditures
in dollars of potential customers on computer equipment given their income and occupation.
What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set
of data is used as training data. The set of input data and the corresponding outputs are
given to the algorithm. So, the training data set includes the input data and their
associated class labels. Using the training dataset, the algorithm derives a model or the
classifier. The derived model can be a decision tree, mathematical formula, or a neural
network. In classification, when unlabeled data is given to the model, it should find the
class to which it belongs. The new data provided to the model is the test data set.
The bank needs to analyze whether giving a loan to a particular customer is risky or
not. For example, based on observable data for multiple loan borrowers, a classification
model may be established that forecasts credit risk. The data could track job records,
homeownership or leasing, years of residency, number, type of deposits, historical credit
ranking, etc. The goal would be credit ranking, the predictors would be the other
characteristics, and the data would represent a case for each consumer. In this example,
a model is constructed to find the categorical label. The labels are risky or safe.
The functioning of classification with the assistance of the bank loan application has been
mentioned above. There are two stages in the data classification system: classifier or
model creation and classification classifier.
1. Developing the Classifier or model creation: This level is the learning stage or the
learning process. The classification algorithms construct the classifier in this stage. A
classifier is constructed from a training set composed of the records of databases and their
corresponding class names. Each category that makes up the training set is referred to as
a category or class. We may also refer to these records as samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this level.
The test data are used here to estimate the accuracy of the classification algorithm. If the
consistency is deemed sufficient, the classification rules can be expanded to cover new
data records. It includes:
o Sentiment Analysis: Sentiment analysis is highly helpful in social media
monitoring. We can use it to extract social media insights. We can build sentiment
analysis models to read and analyze misspelled words with advanced machine
learning algorithms. The accurate trained models provide consistently accurate
outcomes and result in a fraction of the time.
o Document Classification: We can use document classification to organize the
documents into sections according to the content. Document classification refers
to text classification; we can classify the words in the entire document. And with
the help of machine learning classification algorithms, we can execute it
automatically.
o Image Classification: Image classification is used for the trained categories of an
image. These could be the caption of the image, a statistical value, a theme. You
can tag images to train your model for relevant categories by applying supervised
learning algorithms.
o Machine Learning Classification: It uses the statistically demonstrable algorithm
rules to execute analytical tasks that would take humans hundreds of more hours
to perform.
3. Data Classification Process: The data classification process can be categorized into five
steps:
o Create the goals of data classification, strategy, workflows, and architecture of data
classification.
o Classify confidential details that we store.
o Using marks by data labelling.
o To improve protection and obedience, use effects.
o Data is complex, and a continuous method is a classification.
1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google
documents, social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging
based on in-house protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from
various devices and platforms.
5. Archive: Here, data is eventually archived within an industry's storage systems.
6. Publication: Through the publication of data, it can reach customers. They can then view
and download in the form of dashboards.
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same
as in classification, the training dataset contains the inputs and corresponding numerical
output values. The algorithm derives the model or a predictor according to the training
dataset. The model should find a numerical output when the new data is given. Unlike in
classification, this method does not have a class label. The model predicts a continuous-
valued function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on
the facts such as the number of rooms, the total area, etc., is an example for prediction.
For example, suppose the marketing manager needs to predict how much a particular
customer will spend at his company during a sale. We are bothered to forecast a numerical
value in this case. Therefore, an example of numeric prediction is the data processing
activity. In this case, a model or a predictor will be developed that forecasts a continuous
or ordered value function.
1. Data Cleaning: Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques, and the
problem of missing values is solved by replacing a missing value with the most
commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the
following methods.
o Normalization: The data is transformed using normalization.
Normalization involves scaling all values for a given attribute to make them
fall within a small specified range. Normalization is used when the neural
networks or the methods involving measurements are used in the learning
step.
o Generalization: The data can also be transformed by generalizing it to the
higher concept. For this purpose, we can use the concept hierarchies
o Accuracy: The accuracy of the classifier can be referred to as the ability of the classifier to
predict the class label correctly, and the accuracy of the predictor can be referred to as
how well a given predictor can estimate the unknown value.
o Speed: The speed of the method depends on the computational cost of generating and
using the classifier or predictor.
o Robustness: Robustness is the ability to make correct predictions or classifications. In the
context of data mining, robustness is the ability of the classifier or predictor to make
correct predictions from incoming unknown data.
o Scalability: Scalability refers to an increase or decrease in the performance of the classifier
or predictor based on the given data.
o Interpretability: Interpretability is how readily we can understand the reasoning behind
predictions or classification made by the predictor or classifier.
Classification Prediction
Classification is the process of identifying which category a Predication is the process of identifying
new observation belongs to based on a training data set the missing or unavailable numerical data
containing observations whose category membership is for a new observation.
known.
In classification, the accuracy depends on finding the class In prediction, the accuracy depends on
label correctly. how well a given predictor can guess the
value of a predicated attribute for new
data.
In classification, the model can be known as the classifier. In prediction, the model can be known as
the predictor.
A model or the classifier is constructed to find the A model or a predictor will be constructed
categorical labels. that predicts a continuous-valued
function or ordered value.
For example, the grouping of patients based on their For example, We can think of prediction
medical records can be considered a classification. as predicting the correct treatment for a
particular disease for a person.
Classification methods:
DECISION TREE
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or Not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next decision
node (distance from the office) and one leaf node based on the corresponding labels. The
next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types
of tree pruning technology used:
BAYESIAN CLASSIFICATION
Data Mining Bayesian Classifiers
In numerous applications, the connection between the attribute set and the class variable
is non- deterministic. In other words, we can say the class label of a test record cant be
assumed with certainty even though its attribute set is the same as some of the training
examples. These circumstances may emerge due to the noisy data or the presence of
certain confusing factors that influence classification, but it is not included in the analysis.
For example, consider the task of predicting the occurrence of whether an individual is at
risk for liver illness based on individuals eating habits and working efficiency. Although
most people who eat healthly and exercise consistently having less probability of
occurrence of liver disease, they may still do so due to other factors. For example, due to
consumption of the high-calorie street foods and alcohol abuse. Determining whether an
individual's eating routine is healthy or the workout efficiency is sufficient is also subject
to analysis, which in turn may introduce vulnerabilities into the leaning issue.
Bayesian classification uses Bayes theorem to predict the occurrence of any event.
Bayesian classifiers are the statistical classifiers with the Bayesian probability
understandings. The theory expresses how a level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This
is known as the marginal probability.
Bayesian interpretation:
Where P (X⋂Y) is the joint probability of both X and Y being true, because
Bayesian network:
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other
statistical graph, a DAG consists of a set of nodes and links, where the links signify the
connection between the nodes.
The nodes here represent random variables, and the edges define the relationship
between these variables.
Example
• Given:
• A doctor knows that meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50,000
• Prior probability of any patient having stiff neck is 1/20
• If a patient has stiff neck, what’s the probability he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) = = = 0.0002
P( S ) 1 / 20
RULE BASED,
• Classify records by using a collection of “if…then…” rules
• Rule: (Condition) → y where
• Condition is a conjunctions of attributes
• y is the class label
• LHS: rule antecedent or condition
• RHS: rule consequent
• Examples of classification rules:
• (Blood Type=Warm) (Lay Eggs=Yes) → Birds
• (Taxable Income < 50K) (Refund=Yes) → Evade=No
example
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
express a rule in the following from −
IF condition THEN conclusion
Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and
these tests are logically ANDed.
• The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from
a decision tree.
Points to remember −
To extract a rule from a decision tree −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training
data. We do not require to generate a decision tree first. In this algorithm, each rule for
a given class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the
general strategy the rules are learned one at a time. For each time rules are learned, a
tuple covered by the rule is removed and the process continues for the rest of the tuples.
This is because the path to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class
at a time. When learning a rule from a class Ci, we want the rule to cover all the tuples
from class C only and no tuple form any other class.
Algorithm: Sequential Covering
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
Output: A Set of IF-THEN rules.
Method:
Rule_set={ }; // initial set of rules learned is empty
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
CART
• CART is an alternative decision tree building algorithm. It can handle both classification
and regression tasks.
• This algorithm uses a new metric named gini index to create decision points for
classification tasks. We will mention a step by step CART decision tree example by
Gini index
• Gini index is a metric for classification tasks in CART. It stores sum of squared
probabilities of
• each class. We can formulate it as illustrated below.
• Gini = 1 – Σ (Pi)2 for i=1 to number of classes.
• Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final
decisions for outlook feature.
• Temperature is a nominal feature and it could have 3 different values: Cool, Hot and Mild.
• Summarize decisions for temperature feature.
• Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5
• Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375
• Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445
• We’ll calculate weighted sum of gini index for temperature feature
• Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 =
0.439
2. The CART algorithm combines both testings with a test data set and cross-
3. CART allows one to utilize the same variables many times in various regions of
groups of variables.
trim the tree down to its ideal size. This method reduces the likelihood of missing
6. To choose the input set of variables, CART can be used in combination with other
prediction algorithms.
The CART algorithm is a subpart of Random Forest, which is one of the most powerful
questions, the responses to which decide the following question if any. The ultimate
outcome of these questions is a tree-like structure with terminal nodes when there are
no more questions.
This algorithm is widely used in making Decision Trees through Classification and
Regression. Decision Trees are widely used in data mining to create a model that
predicts the value of a target based on the values of many input variables (or
independent variables).
NEURAL NETWORK PREDICTION METHODS:
Linear and nonlinear regression, Logistic Regression.
REGRESSION
Linear Regression
Linear regression is the type of regression that forms a relationship between the target
variable and one or more independent variables utilizing a straight line. The given
equation represents the equation of linear regression
Y = a + b*X + e.
Where,
In linear regression, the best fit line is achieved utilizing the least squared method, and it
minimizes the total sum of the squares of the deviations from each data point to the line
of regression. Here, the positive and negative deviations do not get canceled as all the
deviations are squared.
• Often the relationship between x and y cannot be approximated with a straight line or curve
for that nonlinear regression technique may be used.
• Alternatively, the data could be preprocessed to make the relationship linear.
LOGISTIC REGRESSION
A linear regression is not appropriate for predicting the value of a binary variable for two
reasons:
• A linear regression will predict values outside the acceptable range (e.g. predicting
probabilities outside the range 0 to 1).
• Since the experiments can only have one of two possible values for each experiment, the
residuals(random errors) will not be normally distributed about the predicted line.
A logistic regression produces a logistic curve, which is limited to values between 0 and 1.
• Logistic regression is similar to a linear regression, but the curve is constructed using the
natural logarithm “odds” of the target variable, rather than the probability.
Linear Regression
Logistic regression is basically a supervised classification algorithm. In a classification
problem, the target variable(or output), y, can take only discrete values for a given set
of features(or inputs), X.
Contrary to popular belief, logistic regression IS a regression model. The model builds
a regression model to predict the probability that a given data entry belongs to the
category numbered as “1”. Just like Linear regression assumes that the data follows a
linear function, Logistic regression models the data using the sigmoid function.
1. binomial: target variable can have only 2 possible types: “0” or “1” which
may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
2. multinomial: target variable can have 3 or more possible types which are
not ordered(i.e. types have no quantitative significance) like “disease A” vs
“disease B” vs “disease C”.
3. ordinal: it deals with target variables with ordered categories. For example,
a test score can be categorized as:“very poor”, “poor”, “good”, “very good”.
Here, each category can be given a score like 0, 1, 2, 3.
Data Mining
Applications
Intrusion
Recommender
Detection
Systems
and Prevention
• The concept of BSCs was first introduced in 1992 by David Norton and Robert Kaplan,
who took previous metric performance measures and adapted them to include
nonfinancial information.
• BSCs were originally developed for for-profit companies but were later adapted for
use by nonprofits and government agencies.
• The balanced scorecard involves measuring four main aspects of a business: Learning
and growth, business processes, customers, and finance.
• BSCs allow companies to pool information in a single report, to provide information
into service and quality in addition to financial performance, and to help improve
efficiencies.
Understanding Balanced Scorecards (BSCs)
Accounting academic Dr. Robert Kaplan and business executive and theorist Dr. David
Norton first introduced the balanced scorecard. The Harvard Business Review first published
it in the 1992 article "The Balanced Scorecard—Measures That Drive Performance." Both
Kaplan and Norton worked on a year-long project involving 12 top-performing companies.
Their study took previous performance measures and adapted them to include nonfinancial
information.1
Companies can easily identify factors hindering business performance and outline strategic
changes tracked by future scorecards.
BSCs were originally meant for for-profit companies but were later adapted for nonprofit
organizations and government agencies.2 It is meant to measure the intellectual capital of a
company, such as training, skills, knowledge, and any other proprietary information that gives
it a competitive advantage in the market. The balanced scorecard model reinforces good
behavior in an organization by isolating four separate areas that need to be analyzed. These
four areas, also called legs, involve:
The scorecard can provide information about the firm as a whole when viewing company
objectives. An organization may use the balanced scorecard model to implement strategy
mapping to see where value is added within an organization. A company may also use a BSC
to develop strategic initiatives and strategic objectives.1 This can be done by assigning tasks
and projects to different areas of the company in order to boost financial and operational
efficiencies, thus improving the company's bottom line.
These four legs encompass the vision and strategy of an organization and require active
management to analyze the data collected.
The balanced scorecard analyzes is often referred to as a management tool rather than a
measurement tool because of its application by a company's key personnel.
Benefits of a Balanced Scorecard (BSC)
There are many benefits to using a balanced scorecard. For instance, the BSC allows
businesses to pool together information and data into a single report rather than having to deal
with multiple tools. This allows management to save time, money, and resources when they
need to execute reviews to improve procedures and operations.1
Scorecards provide management with valuable insight into their firm's service and quality in
addition to its financial track record. By measuring all of these metrics, executives are able to
train employees and other stakeholders and provide them with guidance and support. This
allows them to communicate their goals and priorities in order to meet their future goals.
Another key benefit of BSCs is how it helps companies reduce their reliance on inefficiencies
in their processes. This is referred to as suboptimization. This often results in
reduced productivity or output, which can lead to higher costs, lower revenue, and a
breakdown in company brand names and their reputations.1
MARKET SEGMENTATION
Before businesses and companies release their products and services, they first decide on
whom to cater those. Different products and services, even if from the same company, can be
marketed towards different group of people.
That is why companies and organizations use different methods to know where they can best
market their product. One of the most used methods is market research. This is where
businesses gather, analyze and interpret all the information collected. After that, companies
now need to group their customers for an easier marketing. That method is called market
segmentation.
Market segmentation is not only used for business-customers relationship. They are also used
for business-business relationship. The same statistical method is used for segmentation but
the characteristics like the demographics and categories will be different.
Data used for market segmentation is usually from an account or CRM (Customer Relationship
Management) database. Sometimes they also use external databases like researches and
surveys. The ones from databases are usually the product of data mining.
In a simpler term, data mining is where systems study large databases to have new information
that can be used for businesses. These new information are used to forecast and calculate new
trends.
A lot of benefits can be derived from using data mining. One obvious benefit is that it will be
easier to discover unseen relationships and patterns in databases. It can be used for making
predictions about future trends and for marketing teams to devise their tactics to fit in with it.
Also, it will set apart a company from its competitor because of the data they know.
For marketing purposes, data mining is such a huge help. Using the database of Customer
Relationship Management (CRM), the demographics — age, sex, religion, income, occupation
and education, geographic, psychographic, and behavioral information of the customers will
be helpful in segmenting them. The segmentation process will be faster and easier. Also, with
the new data and information that comes with data mining, it can also help with the market
segmentation.
As for its business purpose, knowing your target market and their needs and wants is a lot
cheaper than releasing different products and services that will cater to different customers. It
will help businesses and companies use the full potential of their resources while still making
sales rather than trial and error way.
Moreover, it will be easier for marketing teams to sell the services and products because the
market has one desire and wishes from companies.
RETAIL INDUSTRY
The retail industry is a major application area for data mining because it collects huge amounts
of records on sales, users shopping history, goods transportation, consumption, and service.
The quantity of data collected continues to expand promptly, especially because of the
increasing ease, accessibility, and popularity of business conducted on the internet, or e-
commerce.
Today, multiple stores also have websites where users can create purchases online. Some
businesses, including Amazon.com (www.amazon.com), exist solely online, without any
brick-and-mortar (i.e., physical) store areas. Retail data support a rich source for data mining.
Retail data mining can help identify user buying behaviors, find user shopping patterns and
trends, enhance the quality of user service, achieve better user retention and satisfaction,
increase goods consumption ratios, design more effective goods transportation and
distribution policies, and decrease the cost of business.
There are a few examples of data mining in the retail industry are as follows −
Design and construction of data warehouses based on the benefits of data mining −
Because retail data cover a broad spectrum (such as sales, customers, employees,goods
transportation, consumption, and services), there can be several methods to design a data
warehouse for this market.
The levels of detail to contain can also vary substantially. The results of preliminary data
mining exercises can be used to support guide the design and development of data warehouse
architecture. This contains deciding which dimensions and levels to contain and what pre-
processing to implement in order to facilitate effective data mining.
Multidimensional analysis of sales, customers, products, time, and region − The retail
market needed timely data regarding customer requirements, product sales, trends, and
fashions, and the quality, cost, profit, and service of commodities. It is essential to provide
dynamic multidimensional analysis and visualization tools, such as the construction of
sophisticated data cubes according to the requirement of data analysis.
Analysis of the effectiveness of sales campaigns − The retail market conducts sales
campaigns using advertisements, coupons, and several types of discounts and bonuses to
promote products and attract users. Careful analysis of the efficiency of sales campaigns can
support improve company profits.
Multidimensional analysis can be used for these goals by comparing the number of sales and
the multiple transactions including the sales items during the sales period versus those
including the same items before or after the sales campaign.
TELECOMMUNICATIONS INDUSTRY
Expanding and growing at a fast pace, especially with the advent of the internet. Data mining
can enable key industry players to improve their service quality to stay ahead in the game.
Pattern analysis of spatiotemporal databases can play a huge role in mobile telecommunication,
mobile computing, and also web and information services. And techniques like outlier analysis
can detect fraudulent users. Also, OLAP and visualization tools can help compare information,
such as user group behaviour, profit, data traffic, system overloads, etc.
Design and construction of data warehouses: Because retail data cover a wide spec trum
(including sales, customers, employees, goods transportation, consumption,
and services), there can be many ways to design a data warehouse for this industry. The
levels of detail to include can vary substantially. The outcome of preliminary data
mining exercises can be used to help guide the design and development of data warehouse
structures. This involves deciding which dimensions and levels to include and what
preprocessing to perform to facilitate effective data mining.
accounts; and (3) discover unusual patterns that may need special attention. Many of these
patterns can be discovered by multidimensional analysis, cluster analysis, and outlier analysis.
As another industry that handles huge amounts of data, the telecommunication industry
has quickly evolved from offering local and long-distance telephone services to providing
many other comprehensive communication services. These include cellular phone, smart
phone, Internet access, email, text messages, images, computer and web data transmissions, and
other data traffic. The integration of telecommunication, com puter network, Internet, and
numerous other means of communication and computing has been under way, changing the face
of telecommunications and computing. This has created a great demand for data mining to help
understand business dynamics, identify telecommunication patterns, catch fraudulent activities,
make better use of resources, and improve service quality.
Data mining tasks in telecommunications share many similarities with those in the retail
industry. Common tasks include constructing large-scale data warehouses, performing
multidimensional visualization, OLAP, and in-depth analysis of trends, customer patterns,
and sequential patterns. Such tasks contribute to business improvements, cost reduction,
customer retention, fraud analysis, and sharpening the edges of competition. There are many
data mining tasks for which customized data mining tools for telecommunication have been
flourishing and are expected to play increasingly important roles in business.
Data mining has been popularly used in many other industries, such as insurance,
manufacturing, and health care, as well as for the analysis of governmental and institutional
administration data. Although each industry has its own characteristic data sets and
application demands, they share many common principles and methodologies. Therefore,
through effective mining in one industry, we may gain experience and methodologies that can
be transferred to other industrial applications.
banking & finance and CRM
Financial Analysis
The banking and finance industry relies on high-quality, reliable data. In loan markets,
financial and user data can be used for a variety of purposes, like predicting loan payments and
determining credit ratings. And data mining methods make such tasks more manageable.
Classification techniques facilitate the separation of crucial factors that influence customers’
banking decisions from the irrelevant ones. Further, multidimensional clustering techniques
allow the identification of customers with similar loan payment behaviours. Data analysis and
mining can also help detect money laundering and other financial crimes
Phase 1: Discovery
• –The data science team learn and investigate the problem.
• Develop context and understanding.
• Come to know about data sources needed and available for the project.
• The team formulates initial hypothesis that can be later tested with data.
Phase 2: Data Preparation
• –Steps to explore, preprocess, and condition data prior to modeling and analysis.
• It requires the presence of an analytic sandbox, the team execute, load, and transform,
to get data into the sandbox.
• Data preparation tasks are likely to be performed multiple times and not in predefined
order.
• Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine,
etc.
Phase 3: Model Planning
• –Team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
• In this phase, data science team develop data sets for training, testing, and production
purposes.
• Team builds and executes models based on the work done in the model planning phase.
• Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 4: Model Building
• –Team develops datasets for testing, training, and production purposes.
• Team also considers whether its existing tools will suffice for running the models or if
they need more robust environment for executing models.
• Free or open-source tools – Rand PL/R, Octave, WEKA.
• Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results
• –After executing model team need to compare outcomes of modeling to criteria
established for success and failure.
• Team considers how best to articulate findings and outcomes to various team members
and stakeholders, taking into account warning, assumptions.
• Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
Operationalize
• –The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.
• This approach enables team to learn about performance and related constraints of the
model in production environment on small scale , and make adjustments before full
deployment.
• The team delivers final reports, briefings, codes.
• Free or open source tools – Octave, WEKA, SQL, MADlib.
Introduction to Big data Business Analytics
• is the use of advanced analytic techniques against very large, diverse data sets that
include structured, semi-structured and unstructured data, from different sources, and
in different sizes from terabytes to zettabytes.
Big data analytics helps businesses to get insights from today's huge data resources. People,
organizations, and machines now produce massive amounts of data. Social media, cloud
applications, and machine sensor data are just some examples
• Five Key Types of Big Data Analytics Every Business Analyst Should Know
• Prescriptive Analytics.
• Diagnostic Analytics.
• Descriptive Analytics.
• Predictive Analytics.
• Cyber Analytics.
BIG DATA DATA ANALYTICS
1. Big data refers to the large volume of data Data Analytics refers to the process of
and also the data is increasing with a rapid analyzing the raw data and finding out
speed with respect to time. conclusions about that information.
2Big data includes Structured, Unstructured Descriptive, Diagnostic, Predictive,
and Semi-structured the three types of data. Prescriptive are the four basic types of
data analytics.
The purpose of big data is to store huge The purpose of data analytics is to
volume of data and to process it. analyze the raw data and find out insights
for the information.
Parallel computing and other complex Predictive and statistical modelling with
automation tools are used to handle big data. relatively simple tools are used to handle
data analytics.
Big data operations are handled by big data Data analytics is performed by skilled
professionals. data analysts.
Big data analysts need the knowledge of Data Analysts need the knowledge of
programming, NoSQL databases, distributed programming, statistics, and
systems and frameworks. mathematics.
Big data is mainly found in financial Data analytics is mainly used in business
services, Media and Entertainment, for risk detection and management,
communication, Banking, information science, travelling, health care, Gaming,
technology, and retail etc. energy management, and information
technology.
It supports in dealing with huge volume of It supports in examining raw data and
data. recognizing useful information.
It is considered as the first step as first big It is considered as second step as it
data generated and then stored. performs analysis on the large data sets.
Some of the big data tools are Apache Some of the data analytics tools are
Hadoop, Cloudera Distribution for Hadoop, Tableau Public, Python, Apache Spark,
Cassandra, MongoDB etc. Excel, RapidMiner, KNIME etc.
State of the practice in analytics role of data scientists Key roles for
successful analytic project
Roles & Responsibilities of a Data Scientist
• Management: The Data Scientist plays an insignificant managerial role where he
supports the construction of the base of futuristic and technical abilities within the Data
and Analytics field in order to assist various planned and continuing data analytics
projects.
• Analytics: The Data Scientist represents a scientific role where he plans, implements,
and assesses high-level statistical models and strategies for application in the business’s
most complex issues. The Data Scientist develops econometric and statistical models
for various problems including projections, classification, clustering, pattern analysis,
sampling, simulations, and so forth.
• Strategy/Design: The Data Scientist performs a vital role in the advancement of
innovative strategies to understand the business’s consumer trends and management as
well as ways to solve difficult business problems, for instance, the optimization of
product fulfillment and entire profit.
• Collaboration: The role of the Data Scientist is not a solitary role and in this position,
he collaborates with superior data scientists to communicate obstacles and findings to
relevant stakeholders in an effort to enhance drive business performance and decision-
making.
• Knowledge: The Data Scientist also takes leadership to explore different technologies
and tools with the vision of creating innovative data-driven insights for the business at
the most agile pace feasible. In this situation, the Data Scientist also uses initiative in
assessing and utilizing new and enhanced data science methods for the business, which
he delivers to senior management of approval.
• Other Duties: A Data Scientist also performs related tasks and tasks as assigned by the
Senior Data Scientist, Head of Data Science, Chief Data Officer, or the Employer.
Data Scientist Data Analyst Data Engineer
The focus will be on the futuristic The main focus of a data analyst is Data Engineers focus on
display of data. on optimization of scenarios, for optimization techniques and the
example how an employee can construction of data in a
enhance the company’s product conventional manner. The
growth. purpose of a data engineer is
continuously advancing data
consumption
Data scientists present both Data formation and cleaning of raw Data formation and cleaning of
supervised and unsupervised data, interpreting and visualization raw data, interpreting and
learning of data, say regression and of data to perform the analysis and visualization of data to perform
classification of data, Neural to perform the technical summary the analysis and to perform the
networks, etc. of data technical summary of data
Skills required for Data Scientist are Skills required for Data Analyst are Skills required for Data Engineer
Python, R, SQL, Pig, SAS, Apache Python, R, SQL, SAS. are MapReduce, Hive, Pig
Hadoop, Java, Perl, Spark. Hadoop, techniques