Data Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

By: Dr. Anup A. S.

School of Management
D. Y. Patil University, Ambi,
Pune
CHAPTER 001

DATA MINING
CH_001_DATA MINING

Key points:
A. Data Mining Task Primitives:
I. Data
II. Information
III. knowledge
B. Attribute Types:
I. Nominal
II. Binary
III. Ordinal and Numeric attributes
IV. Discrete versus Continuous Attribute
C. Data Mining Applications
D. Intro to Data Pre-processing
E. Data Cleaning :
I. Missing Values
II. Noisy data
F. Data Integration:
I. Redundancy & Correlation analysis
G. Data Reduction:
I. Overview of Data Reduction Strategies
CH_001_DATA MINING

I. Data:

Data can be defined as a representation of facts, concepts, or


instructions in a formalized manner, which should be suitable for
communication, interpretation, or processing by human or
electronic machine.
CH_001_DATA MINING

II. Information:

Information is organized or classified data, which has some


meaningful values for the receiver. Information is the processed data on
which decisions and actions are based.

For the decision to be meaningful, the processed data must qualify for the
following characteristics −

Timely − Information should be available when required.

Accuracy − Information should be accurate.

Completeness − Information should be complete.


CH_001_DATA MINING

III. Knowledge:

knowledge (of about something) information, understanding and

skills that you have gained through learning or experience


CH_001_DATA MINING

B. Attribute Types:
An attribute is an object’s property or characteristics. For example. A person’s hair colour, air
humidity etc.

An attribute set defines an object. The object is also referred to as a record of the instances or entity.

1. Nominal Attribute:

Nominal Attributes only provide enough attributes to differentiate between one object and another.
Such as Student Roll No., Sex of the Person.

2. Binary Attribute:

These are 0 and 1. Where 0 is the absence of any features and 1 is the inclusion of any characteristics.

3. Ordinal Attribute:

The ordinal attribute value provides sufficient information to order the objects. Such as Rankings,
Grades, Height
CH_001_DATA MINING

IV. Discrete versus Continuous Attributes


CH_001_DATA MINING

C. Data Mining Applications:

• Data is a set of discrete objective facts about an event or a process that have little use by
themselves unless converted into information. We have been collecting numerous data, from simple
numerical measurements and text documents to more complex information such as spatial data,
multimedia channels, and hypertext documents.

• Technically, data mining is the computational process of analyzing data from different
perspectives, dimensions, angles and categorizing/summarizing it into meaningful information. Data
Mining can be applied to any type of data e.g. Data Warehouses, Transactional Databases, Relational
Databases, Multimedia Databases, Spatial Databases, Time-series Databases, World Wide Web.
CH_001_DATA MINING

C. Data Mining Applications/ Areas:


CH_001_DATA MINING

Data Mining Applications:

1. Scientific Analysis:

Scientific simulations are generating bulks of data every day. This includes data collected
from nuclear laboratories, data about human psychology, etc.

Examples:

• Sequence analysis in bioinformatics

• Classification of astronomical objects

• Medical decision support


CH_001_DATA MINING

2. Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital network.
Network intrusions often involve stealing valuable network resources. Data mining technique plays a
vital role in searching intrusion detection, network attacks, and anomalies.

Examples:

• Detect security violations

• Misuse Detection

• Anomaly Detection
CH_001_DATA MINING

3. Business Transactions:

Every business industry is memorized for perpetuity. Such transactions are usually time-
related and can be inter-business deals or intra-business operations.

Examples:

• Direct mail targeting

• Stock trading

• Customer segmentation

• Churn prediction (Churn prediction is one of the most popular Big Data use cases in business)
CH_001_DATA MINING

4. Market Basket Analysis:

Market Basket Analysis is a technique that gives the careful study of purchases done by a
customer in a supermarket. This concept identifies the pattern of frequent purchase items by
customers.

Examples:

• Data mining concepts are in use for Sales and marketing to provide better customer service, to
improve cross-selling opportunities, to increase direct mail response rates.

• Customer Retention in the form of pattern identification and prediction of likely defections is possible
by Data mining.

• Risk Assessment and Fraud area also use the data-mining concept for identifying inappropriate or
unusual behavior etc.
CH_001_DATA MINING

5. Education:

For analyzing the education sector, data mining uses Educational Data Mining (EDM) method.
This method generates patterns that can be used both by learners and educators. By using data
mining EDM we can perform some educational task:

Examples:

• Predicting students admission in higher education

• Predicting students profiling

• Predicting student performance

• Teachers teaching performance

• Curriculum development

• Predicting student placement opportunities


CH_001_DATA MINING

6. Research:

A data mining technique can perform predictions, classification, clustering, associations, and
grouping of data with perfection in the research area. Rules generated by data mining are unique to find
results. In most of the technical research in data mining, we create a training model and testing model.

Examples:

• Classification of uncertain data.

• Information-based clustering.

• Decision support system

• Web Mining

• Domain-driven data mining

• IoT (Internet of Things)and Cybersecurity

• Smart farming IoT(Internet of Things)


CH_001_DATA MINING

7. Healthcare and Insurance:

A Pharmaceutical sector can examine its new deals force activity and their outcomes to
improve the focusing of high-value physicians and figure out which promoting activities will have the
best effect in the following upcoming months, Whereas the Insurance sector, data mining can help to
predict which customers will buy new policies, identify behavior patterns of risky customers and
identify fraudulent behavior of customers.

Example:

• Claims analysis i.e which medical procedures are claimed together.

• Identify successful medical therapies for different illnesses.

• Characterizes patient behavior to predict office visits.


CH_001_DATA MINING

8. Transportation:

A diversified transportation company with a large direct sales force can apply data mining to
identify the best prospects for its services. A large consumer merchandise organization can apply
information mining to improve its business cycle to retailers.

Examples

• Determine the distribution schedules among outlets.

• Analyze loading patterns.


CH_001_DATA MINING

9. Financial/Banking Sector:

A credit card company can leverage its vast warehouse of customer transaction data to
identify customers most likely to be interested in a new credit product.

• Credit card fraud detection.

• Identify ‘Loyal’ customers.

• Extraction of information related to customers.

• Determine credit card spending by customer groups.


CH_001_DATA MINING

D. Intro to Data Pre-processing:

What is data preprocessing?

For machine learning, we need data. Lots of it. The more we have, the better our model. Machine
learning algorithms are data-hungry. But there’s a catch. They need data in a specific format.

Why data – preprocessing?

Real-world data is often noisy, incomplete with missing entries, and more often than not unsuitable for
direct use for building models or solving complex data-related problems. There might be erroneous
data, or the data might be unordered, unstructured, and unformatted.
CH_001_DATA MINING

Data pre-processing steps

In data pre-processing several stages or steps are there. All the steps are listed below –

1. Data Collection

2. Data import

3. Data Inspection

4. Data Encoding

5. Data interpolation

6. Data splitting into train and test sets

7. Feature scaling
CH_001_DATA MINING

1. Data Collection:

Data collection is the stage when we collect data from various sources. Data might be laying
across several storages or several servers and we need to get all that data collected in one single
location for the ease of access.
CH_001_DATA MINING

2. Data import:

Data import is the process of importing data into the software such as R or python for data cleaning
purposes. Sometimes the data is so huge in size that we have to take special care for importing it into the
processing server/software. Tools like pandas, dask, NumPy, and matplotlib are handy when operating on
such huge volumes of data.

Pandas

Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on
top of the Python programming language.

• Installing

• Importing

Numpy

It is a library for the Python programming language, adding support for large, multi-dimensional arrays and
matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
CH_001_DATA MINING

3. Data Inspection:

After the data is imported, data is inspected for missing values and several sanity checks are done for
ensuring the consistency of data. Domain knowledge comes in handy in such scenarios.

Checking for missing data

• To check for missing data, we lookout for rows and columns which are having null or no data.

• If any such scenarios are found we have to make decisions based on scenarios and intuitions.

• Again the domain knowledge comes in handy in deciding the importance of certain columns.

• If a column has more than 40 percent of data missing then the column is discarded completely and
is considered good practice.
CH_001_DATA MINING

4. Data Encoding

Data is in general of two types, quantitative and qualitative.

Quantitative data is used to deal with numbers and things used to measure:

• dimensions (height, width, and length).

• Temperature

• Humidity

• Prices

• Area and volume


CH_001_DATA MINING

5. Data Interpolation:

• Interpolation is the process of using known data values to estimate unknown data values. Various
interpolation techniques are often used in the atmospheric sciences. One of the simplest methods,
linear interpolation, requires knowledge of two points and the constant rate of change between them.

• Data interpolation is used for adding missing values to the columns with cells having missing values.

• There are many different strategies which can be used to do interpolation, most prominent is average
interpolation, knn- interpolation etc.
CH_001_DATA MINING

6. Data splitting into train and test sets:

• Data before being fed into machine learning algorithms is divided into train and validation sets.

• Sklearn library of python provides a special function train-test-split for it. We can specify the
percentage of data we want as a test and the function divides the given data into train and test sets.

• It returns four arguments which are training independent variable, training dependent variable,
testing independent variables and testing dependent variable.
CH_001_DATA MINING

7. Feature Scaling:

• Feature scaling is standard normalization of data. This is done so that no independent variable has
more importance than any other independent variable.

• All columns are standardized individually so that they follow the same distribution. This is the last
step in data preprocessing.

• from sklearn.preprocessing import StandardScaler


CH_001_DATA MINING

E. Data Cleaning:

I. Missing Values:

A missing value can signify a number of different things. Perhaps the field was not applicable,
the event did not happen, or the data was not available. It could be that the person who entered the
data did not know the right value, or did not care if a field was not filled in.

II. Noisy data:

Noisy data are data with a large amount of additional meaningless information called noise.
This includes data corruption, and the term is often used as a synonym for corrupt data. It also
includes any data that a user system cannot understand and interpret correctly. Many systems, for
example, cannot use unstructured text. Noisy data can adversely affect the results of any data analysis
and skew conclusions if not handled properly. Statistical analysis is sometimes used to weed the noise
out of noisy data.
CH_001_DATA MINING

F. Data Integration:
I. Redundancy & Correlation analysis:
What is Data Redundancy ?

During data integration in data mining, various data stores are used. This can lead to the problem of
redundancy in data. An attribute (column or feature of data set) is called redundant if it can be derived
from any other attribute or set of attributes. Inconsistencies in attribute or dimension naming can also
lead to the redundancies in data set.

Data redundancy refers to the duplication of data in a computer system. This duplication can occur at
various levels, such as at the hardware or software level, and can be intentional or unintentional. The
main purpose of data redundancy is to provide a backup copy of data in case the primary copy is lost
or becomes corrupted. This can help to ensure the availability and integrity of the data in the event of a
failure or other problem.
CH_001_DATA MINING

Advantages of data redundancy include:

1. Increased data availability and reliability, as there are multiple copies of the data that can be used
in case the primary copy is lost or becomes unavailable.

2. Improved data integrity, as multiple copies of the data can be compared to detect and correct
errors.

3. Increased fault tolerance, as the system can continue to function even if one copy of the data is lost
or corrupted.
CH_001_DATA MINING

Disadvantages of data redundancy include:

1. Increased storage requirements, as multiple copies of the data must be maintained.

2. Increased complexity of the system, as managing multiple copies of the data can be difficult and
time-consuming.

3. Increased risk of data inconsistencies, as multiple copies of the data may become out of sync if
updates are not properly propagated to all copies

4. Reduced performance, as the system may have to perform additional work to maintain and access
multiple copies of the data.
CH_001_DATA MINING

G. Data Reduction:

The method of data reduction may achieve a condensed description of the original data which is much
smaller in quantity but keeps the quality of the original data.

Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving
the most important information. This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of irrelevant or redundant
information.
CH_001_DATA MINING

There are several different data reduction techniques that can be used in data mining, including:

• Data Sampling: This technique involves selecting a subset of the data to work with, rather than using the
entire dataset. This can be useful for reducing the size of a dataset while still preserving the overall trends
and patterns in the data.

• Dimensionality Reduction: This technique involves reducing the number of features in the dataset, either by
removing features that are not relevant or by combining multiple features into a single feature.

• Data Compression: This technique involves using techniques such as lossy or lossless compression to
reduce the size of a dataset.

• Data Discretization: This technique involves converting continuous data into discrete data by partitioning
the range of possible values into intervals or bins.

• Feature Selection: This technique involves selecting a subset of features from the dataset that are most
relevant to the task at hand.

• It’s important to note that data reduction can have a trade-off between the accuracy and the size of the
data. The more data is reduced, the less accurate the model will be and the less generalizable it will be.
CH_001_DATA MINING

Thank You!

You might also like