0% found this document useful (0 votes)
18 views

Unit3 Eda

The document discusses exploratory data analysis (EDA), including its importance and objectives in data science. It covers the key steps in EDA like data collection, cleaning, identifying variables and correlations, choosing statistical methods, and visualizing results. The document also describes different types of EDA like univariate, bivariate and multivariate analysis as well as graphical and non-graphical techniques.

Uploaded by

RAPTER GAMING
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Unit3 Eda

The document discusses exploratory data analysis (EDA), including its importance and objectives in data science. It covers the key steps in EDA like data collection, cleaning, identifying variables and correlations, choosing statistical methods, and visualizing results. The document also describes different types of EDA like univariate, bivariate and multivariate analysis as well as graphical and non-graphical techniques.

Uploaded by

RAPTER GAMING
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT-III EXPLORATORY DATA ANALYSIS

Data analysis involves different processes of cleaning, transforming, analyzing the data, and
building models to extract specific, relevant insights. These are beneficial for making
important business decisions in real-time situations. Exploratory Data Analysis is important
for any business. It lets data scientists analyze the data before reaching any conclusion. Also,
this makes sure that the results which are out are valid and applicable to business outcomes
and goals. This article on Exploratory Data Analysis will provide in-depth information on the
techniques and learning models in Data Science. Also, check the online Data Science
certificate course, which can help you build a strong foundation in Data Science.

What is Exploratory Data Analysis in Data Science?

Exploratory Data Analysis (EDA) is one of the techniques used for extracting vital features
and trends used by machine learning and deep learning models in Data Science. Thus, EDA
has become an important milestone for anyone working in data science. This article covers
the concept, meaning, tools, and techniques of EDA to give complete awareness to a beginner
wanting to launch a career in data science. The article also enlists those fields that regularly
apply EDA efficiently in promoting their business activities.

Importance of EDA in Data Science

The Data Science field is now very important in the business world as it provides many
opportunities to make vital business decisions by analyzing hugely gathered data.
Understanding the data thoroughly needs its exploration from every aspect. The impactful
features enable making meaningful and beneficial decisions; therefore, EDA occupies an
invaluable place in Data science.

Objective of Exploratory Data Analysis

The overall objective of exploratory data analysis is to obtain vital insights and hence usually
includes the following sub-objectives:

● Identifying and removing data outliers


● Identifying trends in time and space
● Uncover patterns related to the target
● Creating hypotheses and testing them through experiments
● Identifying new sources of data

Role of EDA in Data Science

The role of data exploration analysis is based on the use of objectives achieved as above.
After formatting the data, the performed analysis indicates patterns and trends that help to
take the proper actions required to meet the expected goals of the business. As we expect
specific tasks to be done by any executive in a particular job position, it is expected that
proper EDA will fully provide answers to queries related to a particular business decision. As
data science involves building models for prediction, they require optimum data features to
be considered by the model. Thus, EDA ensures that the correct ingredients in patterns and
trends are made available for training the model to achieve the correct outcome, like a
successful recipe. Therefore, carrying out the right EDA with the correct tool based on
befitting data will help achieve the expected goal.

Steps Involved in Exploratory Data Analysis (EDA)

The key components in an EDA are the main steps undertaken to perform the EDA. These
are as follows:

1. Data Collection

Nowadays, data is generated in huge volumes and various forms belonging to every sector of
human life, like healthcare, sports, manufacturing, tourism, and so on. Every business knows
the importance of using data beneficially by properly analyzing it. However, this depends on
collecting the required data from various sources through surveys, social media, and customer
reviews, to name a few. Without collecting sufficient and relevant data, further activities
cannot begin.

2. Finding all Variables and Understanding Them

When the analysis process starts, the first focus is on the available data that gives a lot of
information. This information contains changing values about various features or
characteristics, which helps to understand and get valuable insights from them. It requires
first identifying the important variables which affect the outcome and their possible impact.
This step is crucial for the final result expected from any analysis.

3. Cleaning the Dataset

The next step is to clean the data set, which may contain null values and irrelevant
information. These are to be removed so that data contains only those values that are relevant
and important from the target point of view. This will not only reduce time but also reduces
the computational power from an estimation point of view. Preprocessing takes care of all
issues, such as identifying null values, outliers, anomaly detection, etc.

4. Identify Correlated Variables

Finding a correlation between variables helps to know how a particular variable is related to
another. The correlation matrix method gives a clear picture of how different variables
correlate, which further helps in understanding vital relationships among them.

5. Choosing the Right Statistical Methods

As will be seen in later sections, depending on the data, categorical or numerical, the size,
type of variables, and the purpose of analysis, different statistical tools are employed.
Statistical formulae applied for numerical outputs give fair information, but graphical visuals
are more appealing and easier to interpret.

6. Visualizing and Analyzing Results


Once the analysis is over, the findings are to be observed cautiously and carefully so that
proper interpretation can be made. The trends in the spread of data and correlation between
variables give good insights for making suitable changes in the data parameters.
The data analyst should have the requisite capability to analyze and be well-versed in all
analysis techniques. The results obtained will be appropriate to data of that particular domain
and are suitable for use in retail, healthcare, and agriculture.

Aspiring data science professionals must understand and practice the above EDA data science
steps to master exploratory data analysis. Explore the Data Science Bootcamp
curriculum to know more.

Types of Exploratory Data Analysis

There are two main types of EDA:

1. Univariate
2. Multivariate

In univariate analysis, the output is a single variable and all data collected is for it. There is
no cause-and-effect relationship at all. For example, data shows products produced each
month for twelve months. In bivariate analysis, the outcome is dependent on two variables,
e.g., the age of an employee, while the relation with it is compared with two variables, i.e.,
his salary earned and expenses per month.

In multivariate analysis, the outcome is more than two, e.g., type of product and quantity sold
against the product price, advertising expenses, and discounts offered. The analysis of data is
done on variables that can be numerical or categorical. The result of the analysis can be
represented in numerical values, visualization, or graphical form. Accordingly, they could be
further classified as non-graphical or graphical.

1. Univariate Non-Graphical

It is the simplest of all types of data analysis used in practice. As the name suggests, uni
means only one variable is considered whose data (referred to as population) is compiled and
studied. The main aim of univariate non-graphical EDA is to find out the details about the
distribution of the population data and to know some specific parameters of statistics. The
significant parameters which are estimated from a distribution point of view are as follows:

● Central Tendency: This term refers to values located at the data's central position or
middle zone. The three generally estimated parameters of central tendency are mean,
median, and mode. Mean is the average of all values in data, while the mode is the
value that occurs the maximum number of times. The Median is the middle value with
equal observations to its left and right.
● Range: The range is the difference between the maximum and minimum value in the
data, thus indicating how much the data is away from the central value on the higher
and lower side.
● Variance and Standard Deviation: Two more useful parameters are standard
deviation and variance. Variance is a measure of dispersion that indicates the spread
of all data points in a data set. It is the measure of dispersion mostly used and is the
mean squared difference between each data point and mean, while standard deviation
is the square root value of it. The larger the value of standard deviation, the farther the
spread of data, while a low value indicates more values clustering near the mean.

2. Univariate Graphical

The graphs in this section are based on Auto MPG dataset available on the UCI repository.
Some common types of univariate graphics are:

● Stem-and-leaf Plots: This is a very simple but powerful EDA method used to display
quantitative data but in a shortened format. It displays the values in the data set,
keeping each observation intact but separating them as stem (the leading digits) and
remaining or trailing digits as leaves. But histogram is mostly used in its place now.
● Histograms (Bar Charts): These plots are used to display both grouped or
ungrouped data. On the x-axis, values of variables are plotted, while on the y-axis are
the number of observations or frequencies. Histograms are very simple to quickly
understand your data, which tell about values of data like central tendency, dispersion,
outliers, etc. The simplest fundamental graph is a histogram, which is a bar plot with
each bar representing the frequency, i.e., the count or proportion (the ratio of count to
the total count of occurrences) for various values.
There are many types of histograms, a few of which are listed below:

1. Simple Bar Charts: These are used to represent categorical variables with
rectangular bars, where the different lengths correspond to the values of the variables.
2. Multiple or Grouped charts: Grouped bar charts are bar charts representing multiple
sets of data items for comparison where a single color is used to denote one specific
series in the dataset.
3. Percentage Bar Charts: These are bar graphs that depict the data in the form of
percentages for each observation. The following image shows a percentage bar chart
with dummy values.
4. Box Plots: These are used to display the distribution of quantitative value in the data.
If the data set consists of categorical variables, the plots can show the comparison
between them. Further, if outliers are present in the data, they can be easily identified.
These graphs are very useful when comparisons are to be shown in percentages, like
values in the 25 %, 50 %, and 75% range (quartiles).

3. Multivariate Non-Graphical

The multivariate non-graphical exploratory data analysis technique is usually used to show
the connection between two or more variables with the help of either cross-tabulation or
statistics.

● For categorical data, an extension of tabulation called cross-tabulation is extremely


useful. For two variables, cross-tabulation is preferred by making a two-way table
with column headings that match the amount of one variable and row headings that
match the amount of the opposite two variables, then filling the counts with all
subjects that share an equivalent pair of levels.
● For each categorical variable and one quantitative variable, we can generate statistical
information for quantitative variables separately for every level of the specific
variable. We then compare the statistics across the number of categorical variables.
4. Multivariate Graphical

Graphics are used in multivariate graphical data to show the relationships between two or
more variables. Here the outcome depends on more than two variables, while the
change-causing variables can also be multiple.

Some common types of multivariate graphics include:

A) Scatter Plot

The essential graphical EDA technique for two quantitative variables is the scatter plot, so
one variable appears on the x-axis and the other on the y-axis and, therefore, the point for
every case in your dataset. This can be used for bivariate analysis.

B) Multivariate Chart

A Multivariate chart is a type of control chart used to monitor two or more interrelated
process variables. This is beneficial in situations such as process control, where engineers
are likely to benefit from using multivariate charts. These charts allow monitoring
multiple parameters together in a single chart. A notable advantage of using multivariate
charts is that they help minimize the total number of control charts for organizational
processes. Pair plots generated using the Seaborn library are a good example of
multivariate charts as they help visualize the relationships between all numerical variables
in the entire dataset at once.

C) Run Chart

A run chart is a data line chart drawn over time. In other words, a run chart visually
illustrates the process performance or data values in a time sequence. Rather than
summary statistics, seeing data across time yields a more accurate conclusion. A trend
chart or time series plot is another name for a run chart. The plot below depicts dummy
values of sales over a period of time.

D) Bubble Chart

Bubble charts scatter plots that display multiple circles (bubbles) in a two-dimensional
plot. These are used to assess the relationships between three or more numeric variables.
In a bubble chart, every single dot corresponds to one data point, and the values of the
variables for each point are indicated by different positions such as horizontal, vertical, dot
size, and dot colors.

E) Heat Map
A heat map is a colored graphical representation of multivariate data structured as a matrix
of columns and rows. The heat map transforms the correlation matrix into color coding
and represents these coefficients to visualize the strength of correlation among variables. It
assists in finding the best features suitable for building accurate Machine Learning
models.

Apart from the above, there is also the ‘Classification or Clustering analysis’ technique
used in EDA. It is an unsupervised type of machine learning used for the classification of
input data into specified categories or clusters exhibiting similar characteristics in various
groups. This can be further used to draw important interpretations in EDA.

Exploratory Data Analysis Tools

1. Python

Python is used for different tasks in EDA, such as finding missing values in data collection,
data description, handling outliers, obtaining insights through charts, etc. The syntax for EDA
libraries like Matplotlib, Pandas, Seaborn, NumPy, Altair, and more in Python is fairly simple
and easy to use for beginners. You can find many open-source packages in Python, such as
D-Tale, AutoViz, PandasProfiling, etc., that can automate the entire exploratory data analysis
process and save time.

2. R

R programming language is a regularly used option to make statistical observations and


analyze data, i.e., perform detailed EDA by data scientists and statisticians. Like Python, R is
also an open-source programming language suitable for statistical computing and graphics.
Apart from the commonly used libraries like ggplot, Leaflet, and Lattice, there are several
powerful R libraries for automated EDA, such as Data Explorer, SmartEDA, GGally, etc.

3. MATLAB

MATLAB is a well-known commercial tool among engineers since it has a very strong
mathematical calculation ability. Due to this, it is possible to use MATLAB for EDA but it
requires some basic knowledge of the MATLAB programming language.

Advantages of Using EDA

Here are a few advantages of using Exploratory Data Analysis -

1. Gain Insights into Underlying Trends and Patterns

EDA assists data analysts in identifying crucial trends quickly through data visualizations
using various graphs, such as box plots and histograms. Businesses also expect to make some
unexpected discoveries in the data while performing EDA, which can help improve certain
existing business strategies.

2. Improved Understanding of Variables


Data analysts can significantly improve their comprehension of many factors related to the
dataset. Using EDA, they can extract various information such as averages, means, minimum
and maximum, and more such information is required for preprocessing the data
appropriately.

3. Better Preprocess Data to Save Time

EDA can assist data analysts in identifying significant mistakes, abnormalities, or missing
values in the existing dataset. Handling the above entities is critical for any organization
before beginning a full study as it ensures correct preprocessing of data and may help save a
significant amount of time by avoiding mistakes later when applying machine learning
models.

4. Make Data-driven Decisions

The most significant advantage of employing EDA in an organization is that it helps


businesses to improve their understanding of data. With EDA, they can use the available tools
to extract critical insights and make conclusions, which assist in making decisions based on
the insights from the EDA.

Statistics:
1.Descriptive Statistics
2.Inferential Sattistics
Descriptive Statistics
Descriptive statistics are brief informational coefficients that summarize a given data set,
which can be either a representation of the entire population or a sample of a population or
Descriptive statistics is a means of describing features of a data set by generating summaries
about data samples. It's often depicted as a summary of data shown that explains the contents
of data. For example, a population census may include descriptive statistics regarding the
ratio of men and women in a specific city. Descriptive statistics, in short, help describe and
understand the features of a specific data set by giving short summaries about the sample and
measures of the data. Descriptive statistics are broken down into measures of central
tendency and measures of variability (spread). Measures of central tendency include the
mean, median, and mode, while measures of variability include standard deviation, variance,
minimum and maximum variables, kurtosis, and skewness.

Types of Descriptive Statistics

All descriptive statistics are either measures of central tendency or measures of variability,
also known as measures of dispersion.

1.Central Tendency(Mean,Median,Average)

Measures of central tendency focus on the average or middle values of data sets, whereas
measures of variability focus on the dispersion of data. These two measures use graphs, tables
and general discussions to help people understand the meaning of the analyzed data.
Measures of central tendency describe the center position of a distribution for a data set. A
person analyzes the frequency of each data point in the distribution and describes it using
the mean, median, or mode, which measures the most common patterns of the analyzed data
set.

2.Measures of Variability

Measures of variability (or the measures of spread) aid in analyzing how dispersed the
distribution is for a set of data. For example, while the measures of central tendency may give
a person the average of a data set, it does not describe how the data is distributed within the
set.

So while the average of the data maybe 65 out of 100, there can still be data points at both 1
and 100. Measures of variability help communicate this by describing the shape and spread of
the data set. Range, quartiles, absolute deviation, and variance are all examples of measures
of variability.

Consider the following data set: 5, 19, 24, 62, 91, 100. The range of that data set is 95, which
is calculated by subtracting the lowest number (5) in the data set from the highest (100).

3.Distribution

Distribution (or frequency distribution) refers to the quantity of times a data point occurs.
Alternatively, it is the measurement of a data point failing to occur. Consider a data set: male,
male, female, female, female, other. The distribution of this data can be classified as:

● The number of males in the data set is 2.


● The number of females in the data set is 3.
● The number of individuals identifying as other is 1.

Inferential Statistics

Inferential statistics helps study a sample of data and make conclusions about its population.
In inferential statistics predictions are made by taking any group of data in which you are
interested. It can be defined as a random sample of data taken from a population to describe
and make inference about the population. Any group of data which includes all the data you
are interested is known as population. It basically allows you to make predictions by taking a
small sample instead of working on whole population.

Descriptive Statistics vs. Inferential Statistics

S.No. Descriptive Statistics Inferential Statistics


It gives information about raw data It makes inference about population
1. which describes the data in some using data drawn from the
manner. population.
It helps in organizing, analyzing and to It allows us to compare data, make
2.
present data in a meaningful manner. hypothesis and predictions.
It is used to explain the chance of
3. It is used to describe a situation.
occurrence of an event.
It explain already known data and
It attempts to reach the conclusion
4. limited to a sample or population having
about the population.
small size.
It can be achieved with the help of
5. It can be achieved by probability.
charts, graphs, tables etc.

Descriptive statistics have a different function than inferential statistics, data sets that are
used to make decisions or apply characteristics from one data set to another.

Imagine another example where a company sells hot sauce. The company gathers data such
as the count of sales, average quantity purchased per transaction, and average sale per day of
the week. All of this information is descriptive, as it tells a story of what actually happened in
the past. In this case, it is not being used beyond being informational.

Let's say the same company wants to roll out a new hot sauce. It gathers the same sales data
above, but it crafts the information to make predictions about what the sales of the new hot
sauce will be. The act of using descriptive statistics and applying characteristics to a different
data set makes the data set inferential statistics. We are no longer simply summarizing data;
we are using it predict what

Benefits of descriptive statistics

One of the main benefits of using descriptive statistics is that they can simplify and organize
large amounts of data into a few numbers or graphs. This can make it easier to grasp the main
features and patterns of your data, as well as identify any outliers or errors. Descriptive
statistics can also help you compare different groups or variables within your data, such as by
using frequency tables, bar charts, or box plots. Descriptive statistics can provide a useful
overview and foundation for further analysis of your data.

Limitations of descriptive statistics

One of the main limitations of using descriptive statistics is that they cannot tell you anything
about the relationships, causes, or effects of your data. Descriptive statistics only describe
what the data is, not why it is that way or what it means. For example, you can use
descriptive statistics to calculate the mean and standard deviation of test scores, but you
cannot use them to infer whether the test was easy or hard, or whether the scores were
influenced by other factors. To answer these questions, you need to use inferential statistics,
which test hypotheses and estimate probabilities based on your data.

What is Hypothesis Generation?

Hypothesis generation is an educated “guess” of various factors that are impacting the
business problem that needs to be solved using machine learning. In framing a hypothesis, the
data scientist must not know the outcome of the hypothesis that has been generated based on
any evidence.A hypothesis may be simply defined as a guess. A scientific hypothesis is an
intelligent guess.
Hypothesis Testing:

Hypothesis testing is a formal procedure for investigating our ideas about the world
using statistics. It is most often used by scientists to test specific predictions, called
hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

1. State your research hypothesis as a null hypothesis and alternate hypothesis (Ho) and
(Ha or H1).
2. Collect data in a way designed to test the hypothesis.
3. Perform an appropriate statistical test.
4. Decide whether to reject or fail to reject your null hypothesis.
5. Present the findings in your results and discussion section.

Null Hypothesis and Alternate Hypothesis

The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no
bearing on the study's outcome unless it is rejected.

H0 is the symbol for it, and it is pronounced H-naught.

The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the
alternative hypothesis follows the rejection of the null hypothesis. H1 is the symbol for it.

The null hypothesis is typically an equality hypothesis between population parameters; for
example, a null hypothesis may claim that the population means return equals zero. The
alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the population
means the return is not equal to zero). As a result, they are mutually exclusive, and only one
can be correct. One of the two possibilities, however, will always be correct.

What is Clustering?

It is basically a type of unsupervised learning method. An unsupervised learning method is a


method in which we draw references from datasets consisting of input data without labeled
responses. Generally, it is used as a process to find meaningful structure, explanatory
underlying processes, generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
and dissimilar to the data points in other groups. It is basically a collection of objects on the
basis of similarity and dissimilarity between them. Or

Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.

What are the types of Clustering Methods?

Clustering itself can be categorized into two types viz. Hard Clustering and Soft Clustering.
In hard clustering, one data point can belong to one cluster only. But in soft clustering, the
output provided is a probability likelihood of a data point belonging to each of the
pre-defined numbers of clusters.

The clustering technique is commonly used for statistical data analysis.

Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But there are
also other various approaches of Clustering exist. Below are the main clustering methods
used in Machine learning:

1. Partitioning Clustering(centroid clustering)


2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering(Centroid Clustering)

It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is
the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between
the data points of one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering

In this method, the clusters are created based upon the density of the data points which are
represented in the data space. The regions that become dense due to the huge number of data
points residing in that region are considered as clusters.
The data points in the sparse region (the region where the data points are very less) are
considered as noise or outliers. The clusters created in these methods can be of arbitrary
shape. Following are the examples of Density-based clustering algorithms:

BSCAN (Density-Based Spatial Clustering of Applications with Noise)


DBSCAN groups data points together based on the distance metric. It follows the criterion
for a minimum number of data points. It can discover clusters of different shapes and sizes
from a large amount of data, which is containing noise and outliers.It takes two parameters
– eps and minimum points. Eps indicates how close the data points should be to be
considered as neighbors. The criterion for minimum points should be completed to consider
that region as a dense region.

OPTICS (Ordering Points to Identify Clustering Structure)

OPTICS follows a similar process as DBSCAN but overcomes one of its drawbacks, i.e.
inability to form clusters from data of arbitrary density. It considers two more parameters
which are core distance and reachability distance. Core distance indicates whether the data
point being considered is core or not by setting a minimum value for it.

Reachability distance is the maximum of core distance and the value of distance metric that is
used for calculating the distance among two data points. One thing to consider about
reachability distance is that its value remains not defined if one of the data points is a core
point.

DBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise)

HDBSCAN is a density-based clustering method that extends the DBSCAN methodology by


converting it to a hierarchical clustering algorithm.

Distribution Model-Based Clustering:

In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).

Hierarchical Clustering:

Hierarchical Clustering groups (Agglomerative or also called as Bottom-Up Approach) or


divides (Divisive or also called as Top-Down Approach) the clusters based on the distance
metrics.

In agglomerative clustering, initially, each data point acts as a cluster, and then it groups the
clusters one by one. This comes under in one of the most sought-after clustering methods.

Divisive is the opposite of Agglomerative, it starts off with all the points into one cluster and
divides them to create more clusters. These algorithms create a distance matrix of all the
existing clusters and perform the linkage between the clusters depending on the criteria of the
linkage. The clustering of the data points is represented by using a dendrogram. There are
different types of linkages: –
o Single Linkage: – In single linkage the distance between the two clusters is the shortest
distance between points in those two clusters.

o Complete Linkage: – In complete linkage, the distance between the two clusters is the
farthest distance between points in those two clusters.

o Average Linkage: – In average linkage the distance between the two clusters is the average
distance of every point in the cluster with every point in another cluster.

Read: Common Examples of Data Mining.

Fuzzy Clustering

In fuzzy clustering, the assignment of the data points in any of the clusters is not decisive.
Here, one data point can belong to more than one cluster. It provides the outcome as the
probability of the data point belonging to each of the clusters. One of the algorithms used in
fuzzy clustering is Fuzzy c-means clustering.

This algorithm is similar in approach to the K-Means clustering. It differs in the parameters
involved in the computation, like fuzzifier and membership values. In this type of clustering
method, each data point can belong to more than one cluster. This clustering technique
allocates membership values to each image point correlated to each cluster center based on
the distance between the cl

What is association?
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable. It tries to find some interesting relations or associations among the variables
of dataset

You might also like