Decriptive Statistics in Data Science

Role of Descriptive Statistics in Data Science
Nikita Singam(20MSM3068) & Kiranbala Nongthombam (20MSM3070)

Department of Mathematics,
University Institute of Sciences,
Chandigarh University, Gharuan,
Mohali, Punjab-140413, India
Abstract
Descriptive statistical analysis is a very important aspect of Machine
Learning and helps to interpret data. Statistics, which is a required initial stage, is
all about drawing conclusions from results. The most important statistical
descriptive principles will be discussed in this paper.
1. Data Science
Based on massive amounts of structured data or big data, data science
supplies meaningful information. Data science, or data-driven science,
incorporates various areas of work to analyze data for decision-making
applications in mathematics and computing. Data science incorporates various
disciplines to derive information from data, including statistics, scientific
methodology, and data processing.
2. Statistics
Statistics is the process of gathering and analyzing data to infer population-
representative proportions (sample). In other words, statistics classify knowledge
so that the population will make forecasts. Two statistical branches are:
• Descriptive Statistics
• Inferential Statistics
3. Descriptive Statistics
A statistic or measure that explains the data is Descriptive Statistics. In
order to make it easy to interpret the results, Descriptive Statistics summarizes the
data at hand by such numbers such as mean, median etc. Beyond what is available,
it does not require any generalization or inference. This suggests that descriptive
statistics are just the representation of the available data (sample) and not based
on any probability theory.
4. Inferential Statistics
To compare the variations between the treatment classes, Inferential
Statistics are also used. Inferential figures use the subset of subjects measured in
the trial to compare the treatment groups and generalize the broader population
of subjects.
5. Descriptive statistics
Descriptive statistics are used either by mathematical formulas or graphs or
tables to describe, present, summarize and organize the data (population).
I. Measures of Central Tendency
II. Measures of Dispersion (or Variability)
6. Measure of Central Tendency

A Measure of Central Tendency is a description of the data with one
number that usually defines the data center. Three kinds of this one number
description are:
a. Mean
Mean is defined as the ratio to the total number of observations of the sum
of all observations in the data. This is known as Average, as well. Therefore,
mean is a number around which the whole data set is distributed.
b. Median
Median is the point at which the whole data is divided into two separate
halves. One-half of the knowledge is less than the average, while the other half is
more than the same. The median is determined by arranging the data first in
ascending or descending order.
c. Mode
Mode is the number in the whole data set that has the highest frequency, or
in other words, mode is the number that appears the maximum number of times.
There may be one or more than one mode for a data.
7. Analysing the Measure of Central Tendency using R

Taking the vectors x=14,7,7,1.8,18,5,78,-21,0,-5 and the following
functions will be done.
Analysis Function
Mean mean()
Median median()
Mode mode()
▪ Mean
▪ Median
▪ Mode
8. Measure of Dispersion(Variability)
Measures of Dispersion define the distribution of the data around the core
value (or the Measures of Central Tendency).
a. Absolute Deviation from Mean
The variance in the data set is defined by the Absolute Deviation from Mean, also
called Mean Absolute Deviation (MAD), in that it says the average absolute
distance of of data point in the set. It is calculated as
b. Variance
Variance calculates how far data points from the average are spread out. A
high variance means that data points are uniformly scattered and a lower variance
suggests that the data points are similar to the data set average. It is calculated as
c. Standard Deviation
The Standard Deviation is considered the square root of Variation. It is
calculated as
d. Range
Range is the difference in the data collection between the maximum value
and the minimum value. It is given as
e. Skewness
Skewness determines the measure of asymmetry in the distribution of
probabilities. It can either be positive, negative or undefined.
• Positive Skew — When the tail on the right side of the curve is longer
than that on the left side, this is the situation. The mean is greater than the
mode for these distributions.
• Negative Skew — This is the case when, on the right hand, the tail on the
wrong side of the curve is greater than that. The mean is smaller than the
mode for these distributions.
f. Kurtosis
Kurtosis defines when, as applied to a typical distribution, the data is light
tailed (lack of outliers) or hard tailed (outliers present). There are three forms of
Kurtosis that are:
• Mesokurtic — When the kurtosis is zero, this is the case, close to the
normal distributions.
• Leptokurtic — This happens when the distribution tail is high (outer
present) and kurtosis is greater than the normal distribution tail.
• Platykurtic — This is when light (no outlier) is the tail of the distribution
and kurtosis is smaller than the normal distribution.
9. Analysing the Measure of Dispersion(Variability) using R

Taking the vectors x=14,7,7,1.8,18,5,78,-21,0,-5 and the following functions will
be done.
Analysis Function Library Used
Absolute Derivation from MeanAD() library(DescTools)
Mean
Variance var() library(DescTools)
Standard Deviation sd() library(DescTools)
Range range() library(DescTools)
Skewness skewness() library(e1017)
Kurtosis kurtosis() library(e1017)
• Absolute Deviation from Mean

• Variance
• Standard Deviation
• Range
• Skewness
• Kurtosis
Conclusion
Descriptive statistics allow analysts to explain, illustrate, or summarize
information in a meaningful way. Descriptive statistics doesn’t enable
conclusions to be taken beyond the data examined or reached in reference to any
hypothesis already made. Descriptive statistics thus allow a more meaningful
presentation of the data, thereby facilitating a more concise understanding of the
data. Knowledge about the next data group is given by descriptive statistics.
References
1. Niklas Donges
▪ https://towardsdatascience.com/intro-to-descriptive-
statistics-252e9c464ac9
2. Satyapriya Chaudhari
▪ https://towardsdatascience.com/descriptive-statistics-
f2beeaf7a8df
3. https://statisticsglobe.com

Decriptive Statistics in Data Science

Uploaded by

Decriptive Statistics in Data Science

Uploaded by

Role of Descriptive Statistics in Data Science

Nikita Singam(20MSM3068) & Kiranbala Nongthombam (20MSM3070)

6. Measure of Central Tendency

7. Analysing the Measure of Central Tendency using R

9. Analysing the Measure of Dispersion(Variability) using R

• Absolute Deviation from Mean

You might also like