Statistics PDF
Statistics PDF
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. In
other words, it is a mathematical discipline to collect, summarize data. Also, we can say that statistics is a
branch of applied mathematics. However, there are two important and basic ideas involved in statistics;
they are uncertainty and variation. The uncertainty and variation in different fields can be determined only
through statistical analysis. These uncertainties are basically determined by the probability that plays an
important role in statistics.
Statistics is used in business for: appraisal of value, consumer surveys, hiring decisions, insurance,
manufacturing, online business, real estate investing, rental housing, sales, and stock markets. Data
analysis, regression, forecasting, hypothesis testing, and more are used in these fields.
Of course, statistics is a tool that serves several purposes. It can give you insight into business operations,
help you examine what went well (or what went wrong), and make predictions about the future.
Business statistics is a method of using statistics to gain valuable information from the data available to a
company. Various techniques and principles of statistics are applied to gain insights that help to make
better decisions. It is a method of using numerical data that they collect from various sources. The
information can come from surveys, experiments or other information systems in the company. It helps
Organizations understand the reasons for various events in the present and predict the future. It can be
used in marketing, production planning, human resource planning, finance, etc.
Business statistics is a sub-field within the broader field of statistics. It consists of creating and
interpreting numerical data from various sources, such as surveys, experiments, or business
information systems. Business Statistics is an analysis that deals with multiple aspects of society. It
is a subject that requires statistical probabilities and logic, making it possible for you to use data. It
is reportedly one of the most demanding subjects to understand and study.
Companies benefit from seeing patterns in their activity. It can be done with business statistics. Looking
at past sales patterns, an organization can predict sales volumes in various situations. Using this
technique, one can determine if the company’s business proposition is viable. It is something that affects
the performance of the whole company. Businesses can also find out if a particular marketing campaign
has helped to attract more customers. It will help them in planning future campaigns in a better way.
Business statistics is the foundation for business analytics.
To understand ways to optimise a team’s performance, the company must first know their present
productivity levels and weak areas. It can be gathered from data they have already generated from
previous projects. But simply collecting data will not give the management any idea about improving the
performance. It is where business statistics help. They must analyse and interpret the data using statistical
methods. Though software programmes can do this, a human mind is needed to understand the
significance of the analysis and take necessary action.
Characteristics of Statistics
Importance of Statistics
To learn more about Statistical terms and formulas, register with BYJU’S – The Learning App today.
Types of Statistics
On a broader scale, statistics is classified into Descriptive Statistics and Inferential Statistics.
Descriptive Statistics
It involves the collection, organization, and presentation of data in such a way that it is easy to understand
and interpret. Descriptive statistics are used to answer questions like, what is the:
It helps us better understand the data we are looking at by summarizing the important information, such as
the highest and lowest value, the middle value (median), and how spread out the data is (range and
variance). These pieces of information help to conclude the group we are studying.
. The three most common types of central tendency are Mean, Median, and Mode.
Inferential Statistics
Inferential statistics is the branch of statistics that makes inferences (or predictions) about the population
dataset based on the sample dataset. It involves hypothesis testing, a process of using statistical methods
to determine whether the hypothesis about the population is likely true.Inferential statistics are widely
used in Scientific & Market Research and social sciences to make predictions, test hypotheses, and make
decisions based on a solid understanding of the data. It also helps to minimize errors and biases in the
result.
The difference between descriptive and inferential statistics can be drawn clearly on the following
grounds:
1. Descriptive Statistics is a discipline which is concerned with describing the population under
study. Inferential Statistics is a type of statistics; that focuses on drawing conclusions about the
population, on the basis of sample analysis and observation.
2. Descriptive Statistics collects, organises, analyzes and presents data in a meaningful way. On the
contrary, Inferential Statistics, compares data, test hypothesis and make predictions of the future
outcomes.
3. There is a diagrammatic or tabular representation of final result in descriptive statistics whereas
the final result is displayed in the form of probability.
4. Descriptive statistics describes a situation while inferential statistics explains the likelihood of the
occurrence of an event.
5. Descriptive statistics explains the data, which is already known, to summarise sample.
Conversely, inferential statistics attempts to reach the conclusion to learn about the population;
that extends beyond the data available.
What is a variable?
In programming, a variable is a value that can change, depending on conditions or on information passed
to the program. Typically, a program consists of instruction s that tell the computer what to do and data
that the program uses when it is running. The data consists of constants or fixed values that never change
and variable values (which are usually initialized to "0" or some default value because the actual values
will be supplied by a program's user). Usually, both constants and variables are defined as certain data
type s. Each data type prescribes and limits the form of the data. Examples of data types include: an
integer expressed as a decimal number, or a string of text characters, usually limited in length.
A “constant” simply means a fixed value or a value that does not change. A constant has a known
value.
A frequency distribution is a representation, either in a graphical or tabular format, that displays the
number of observations within a given interval. The frequency is how often a value occurs in an interval
while the distribution is the pattern of frequency of the variable.
The interval size depends on the data being analyzed and the goals of the analyst. The intervals must
be mutually exclusive and exhaustive. Frequency distributions are typically used within a statistical
context. Generally, frequency distributions can be associated with the charting of a normal distribution.
Cumulative Frequency
Cumulative frequency is the total of a frequency and all frequencies in a frequency distribution until a
certain defined class interval. Cumulative frequency is used to determine the number of observations that
lie above (or below) a particular value in a data set. The cumulative frequency is calculated using a
frequency distribution table, which can be constructed from stem and leaf plots or directly from the data.
The cumulative frequency is calculated by adding each frequency from a frequency distribution table to
the sum of its predecessors. The last value will always be equal to the total for all observations, since all
frequencies will already have been added to the previous total.
What is relative frequency distribution?
A related distribution is known as a relative frequency distribution, which shows the relative frequency of
each value in a dataset as a percentage of all frequencies.
A relative frequency distribution shows the proportion of the total number of observations associated with
each value or class of values and is related to a probability distribution, which is extensively used in
statistics.
For example, in the previous table we saw that there were 400 total households. To find the relative
frequency of each value in the distribution, we simply divide each individual frequency by 400:
Relative frequency distributions are useful because they allow us to understand how common a value is in
a dataset relative to all other values.
In the previous example we saw that 150 households had just one pet. But this number by itself isn’t
particularly useful.
Instead, knowing that 37.5% of all households in the sample had just one pet is more useful to know.
This helps us understand that a little more than 1 in 3 households had just one pet, which gives us some
perspective on how “common” it is to own just one pet.
What is histogram?
A pie chart, sometimes called a circle chart, is a way of summarizing a set of nominal data or displaying
the different values of a given variable (e.g. percentage distribution). This type of chart is a circle divided
into a series of segments. Each segment represents a particular category. The area of each segment is the
same proportion of a circle as the category is of the total data set.
Pie chart usually shows the component parts of a whole. Sometimes you will see a segment of the
drawing separated from the rest of the pie in order to emphasize an important piece of information. This is
called an exploded pie chart. Chart 5.4.1 is an example of an exploded pie chat
A pie chart is a type of graph that visually displays data in a circular chart. It records data in a circular
manner and then it is further divided into sectors that show a particular part of data out of the whole part
Bar graphs represent data using rectangular bars of uniform width along with equal spacing between
the rectangular bars.
What is sampling?
In statistics, quality assurance, and survey methodology, sampling is the selection of a subset or
a statistical sample (termed sample for short) of individuals from within a statistical population to
estimate characteristics of the whole population. Statisticians attempt to collect samples that are
representative of the population. Sampling has lower costs and faster data collection compared
to recording data from the entire population, and thus, it can provide insights in cases where it is
infeasible to measure an entire population.
What is Levels of measurement?
Levels of measurement also called scales of measurement, tell you how precisely variables are
recorded. In scientific research, a variable is anything that can take on different values across
your data set (e.g., height or test scores).
There are 4 levels of measurement:
Depending on the level of measurement of the variable, what you can do to analyze your data
may be limited. There is a hierarchy in the complexity and precision of the level of
measurement, from low (nominal) to high (ratio).
Level of measurement is important, as it determines the type of statistical analysis you can carry
out. As a result, it affects both the nature and the depth of insights you’re able to glean from
your data.
Certain statistical tests can only be performed where more precise levels of measurement have
been used, so it’s essential to plan in advance how you’ll gather and measure your data.
3. What are the four levels of measurement? Nominal, ordinal, interval, and ratio scales
explained
There are four types of measurement (or scales) to be aware of: nominal, ordinal, interval,
and ratio.
Each scale builds on the previous, meaning that each scale not only “ticks the same boxes” as
Let’s go through each in turn to give you an idea of what they are, and how they interact.
Nominal
The nominal scale simply categorizes variables according to qualitative labels (or names).
These labels and groupings don’t have any order or hierarchy to them, nor do they convey any
numerical value.
For example, the variable “hair color” could be measured on a nominal scale according to the
following categories: blonde hair, brown hair, gray hair, and so on..
Ordinal
The ordinal scale also categorizes variables into labeled groups, and these categories have an
For example, you could measure the variable “income” on an ordinal scale as follows:
• low income
• medium income
• high income.
• high school
• master’s degree
• doctorate
These are still qualitative labels (as with the nominal scale), but you can see that they follow a
hierarchical order..
Interval
The interval scale is a numerical scale which labels and orders variables, with a known, evenly
between 10 and 20 degrees Fahrenheit is exactly the same as the difference between, say, 50
Ratio
The ratio scale is exactly the same as the interval scale, with one key difference: The ratio
What is Mean?
Mean is an essential concept in mathematics and statistics. The mean is the average or the most common
value in a collection of numbers.
In statistics, it is a measure of central tendency of a probability distribution along median and mode. It is
also referred to as an expected value.
It is a statistical concept that carries a major significance in finance. The concept is used in various
financial fields, including but not limited to portfolio management and business valuation.
There are different ways of measuring the central tendency of a set of values. There are multiple ways to
calculate the mean. Here are the two most popular ones:
Arithmetic mean is the total of the sum of all values in a collection of numbers divided by the number of
numbers in a collection. It is calculated in the following way:
In finance, the arithmetic mean may be misleading in the calculations of returns, as it does not consider
the effects of volatility and compounding, producing an inflated value for the central point of the
distribution.
Mode Definition in Statistics
A mode is defined as the value that has a higher frequency in a given set of values. It is the value that
appears the Most number of times.
Example: In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 since it has appeared in the
set twice.
Statistics deals with the presentation, collection and analysis of data and information for a particular
purpose. We use tables, graphs, pie charts, bar graphs, pictorial representation, etc. After the proper
organization of the data, it must be further analyzed to infer helpful information.
For this purpose, frequently in statistics, we tend to represent a set of data by a representative value that
roughly defines the entire data collection. This representative value is known as the measure of central
tendency. By the name itself, it suggests that it is a value around which the data is centred. These
measures of central tendency allow us to create a statistical summary of the vast, organized data. One
such measure of central tendency is the mode of data.
Weighted Mean is an average computed by giving different weights to some of the individual values. If all
the weights are equal, then the weighted mean is the same as the arithmetic mean.
It represents the average of a given data. The Weighted mean is similar to the arithmetic mean or sample
mean. The Weighted mean is calculated when data is given in a different way compared to the arithmetic
mean or sample mean.
Weighted means generally behave in a similar approach to arithmetic means, they do have a few counter-
instinctive properties. Data elements with a high weight contribute more to the weighted mean than the
elements with a low weight.
The weights cannot be negative. Some may be zero, but not all of them; since division by zero is not
allowed. Weighted means play an important role in the systems of data analysis, weighted differential and
integral calculus.
The geometric mean is a measure of central tendency that averages a set of products. Its formula takes the
nth root of the product of n numbers.
Like the arithmetic mean, the geometric mean finds the center of a dataset. While the arithmetic mean
finds the center by summing the values and dividing by the number of observations, the geometric mean
finds the center by multiplying and then taking a root of the product.
Based on the calculation methods, the arithmetic mean is the better statistic when adding data is
appropriate, while the geometric mean is better when you need to multiply the data.
Measures of Dispersion
In statistics, the measures of dispersion help to interpret the variability of data i.e. to know how much
homogenous or heterogeneous the data is. In simple terms, it shows how squeezed or scattered the
variable is.
There are two main types of dispersion methods in statistics which are:
An absolute measure of dispersion contains the same unit as the original data set. The absolute dispersion
method expresses the variations in terms of the average of deviations of observations like standard or
means deviations. It includes range, standard deviation, quartile deviation, etc.
1. Range: It is simply the difference between the maximum value and the minimum value given in
a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set, square each of them and add each
square and finally divide them by the total no of values in the data set to get the variance.
Variance (σ2) = ∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard deviation i.e.
S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers into
quarters. The quartile deviation is half of the distance between the third and the first quartile.
5. Mean and Mean Deviation: The average of numbers is known as the mean and the arithmetic
mean of the absolute deviations of the observations from a measure of central tendency is
known as the mean deviation (also called mean absolute deviation).
The relative measures of dispersion are used to compare the distribution of two or more data sets. This
measure compares values without units. Common relative dispersion methods include:
1. Co-efficient of Range
2. Co-efficient of Variation
3. Co-efficient of Standard Deviation
4. Co-efficient of Quartile Deviation
5. Co-efficient of Mean Deviation
What is range?
The range in statistics for a given data set is the difference between the highest and lowest values. For
example, if the given data set is {2,5,8,10,3}, then the range will be 10 – 2 = 8.
Thus, the range could also be defined as the difference between the highest observation and lowest
observation. The obtained result is called the range of observation. The range in statistics
represents the spread of observations. Range Formula
The formula of the range in statistics can simply be given by the difference between the highest and
lowest values.
What Is Variance?
The term variance refers to a statistical measurement of the spread between numbers in a data set. More
specifically, variance measures how far each number in the set is from the mean (average), and thus from
every other number in the set. Variance is often depicted by this symbol: σ2. It is used by both analysts
and traders to determine volatility and market security.
The square root of the variance is the standard deviation (SD or σ), which helps determine the consistency
of an investment’s returns over a period of time.
Standard Deviation is a measure which shows how much variation (such as spread, dispersion, spread,)
from the mean exists. The standard deviation indicates a “typical” deviation from the mean. It is a popular
measure of variability because it returns to the original units of measure of the data set. Like the variance,
if the data points are close to the mean, there is a small variation whereas the data points are highly spread
out from the mean, then it has a high variance. Standard deviation calculates the extent to which the
values differ from the average. Standard Deviation, the most widely used measure of dispersion, is based
on all values. Therefore a change in even one value affects the value of standard deviation. It is
independent of origin but not of scale. It is also useful in certain advanced statistical problems.
Absolute Measures of Dispersion
If the dispersion of data within an experiment has to be determined then absolute measures of dispersion
should be used. These measures usually express variations in a data set with respect to the average of the
deviations of the observations. The most commonly used absolute measures of deviation are listed below.
Range: Given a data set, the range can be defined as the difference between the maximum value and the
minimum value.
Variance: The average squared deviation from the mean of the given data set is known as the variance.
This measure of dispersion checks the spread of the data about the mean.
Standard Deviation: The square root of the variance gives the standard deviation. Thus, the standard
deviation also measures the variation of the data about the mean.
Mean Deviation: The mean deviation gives the average of the data's absolute deviation about the central
points. These central points could be the mean, median, or mode.
Quartile Deviation: Quartile deviation can be defined as half of the difference between the third quartile
and the first quartile in a given data set.
In statistics, a percentile is a term that describes how a score compares to other scores from the same set.
While there is no universal definition of percentile, it is commonly expressed as the percentage of values
in a set of data scores that fall below a given value.
Imagine you have the marks of 20 students. Now, try to calculate the 90th percentile.
What is Decile?
Decile, percentile, quartile, and quintile are different types of quantiles in statistics. A quantile refers to a
value that divides the observations in a sample into equal subsections. There will always be 1 lesser
quantile than the number of subsections created.
Decile Formula
Decile Example
Suppose a data set consists of the following numbers: 24, 32, 27, 32, 23, 62, 45, 80, 59, 63, 36, 54, 57, 36,
72, 55, 51, 32, 56, 33, 42, 55, 30. The value of the first two deciles has to be calculated. The steps
required are as follows:
• Step 1: Arrange the data in increasing order. This gives 23, 24, 27, 30, 32, 32, 32, 33, 36, 36, 42,
45, 51, 54, 55, 55, 56, 57, 59, 62, 63, 72, 80.
• Step 2: Identify the total number of points. Here, n = 23
• Step 3: Apply the decile formula to calculate the position of the required data point. D(1) =
(n+1)10
• = 2.4. This implies the value of the 2.4th data point has to be determined. This will lie between the
scores in the 2nd and 3rd positions. In other words, the 2.4th data is 0.4 of the way between the scores 24
and 27
• Step 4: The value of the decile can be determined as [lower score + (distance)(higher score - lower
score)]. This is given as 24 + 0.4 * (27 – 24) = 25.2
• Step 5: Apply steps 3 and 4 to determine the rest of the deciles. D(2) = 2(n+1)10 = 4.8th data between
digit number 4 and 5. Thus, 30 + 0.8 * (32 – 30) = 31.6
What Is a Quartile?
A quartile is a statistical term that describes a division of observations into four defined intervals based on
the values of the data and how they compare to the entire set of observations. Quartiles are organized into
lower quartiles, median quartiles, and upper quartiles.
When the data points are arranged in increasing order, data are divided into four sections of 25% of the
data each
There are three quartile values—a lower quartile, median, and upper quartile—which divide the data set
into four ranges, each containing 25% of the data points:
• First quartile: The set of data points between the minimum value and the first quartile.
• Second quartile: The set of data points between the lower quartile and the median.
• Third quartile: The set of data between the median and the upper quartile.
• Fourth quartile: The set of data points between the upper quartile and the maximum value of the
data set
Quartile manual calculation requires more effort as there are formulas involved. Using the same values as
in the spreadsheet example:
• 59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98
Where n is the number of integers in your dataset, and the result is the position of the number in the
sequence dataset. So:
Here, we have the Q1 (fifth) value of 68, the Q2 (tenth and the median) value of 75, and the Q3 (fifteenth)
value of 84. The results differ slightly from the spreadsheet results because the spreadsheet calculates
them differently. Your graph would then look like this:
The difference between the upper and lower quartile is known as the interquartile range. The formula for
the interquartile range is given below
where Q1 is the first quartile and Q3 is the third quartile of the series.
The below figure shows the occurrence of median and interquartile range for the data set.
Semi Interquartile Range
The semi-interquartile range is defined as the measures of dispersion. Semi interquartile range also is
defined as half of the interquartile range. It is computed as one half the difference between the 75th
percentile (Q3) and the 25th percentile (Q1). The semi-interquartile range is one-half of the difference
between the first and third quartiles. The Formula for Semi Interquartile Range is
The median is the middle value of the distribution of the given data. The interquartile range (IQR) is the
range of values that resides in the middle of the scores. When a distribution is skewed, and the median is
used instead of the mean to show a central tendency, the appropriate measure of variability is the
Interquartile range.
Q2 – Median
It is a measure of dispersion based on the lower and upper quartile. Quartile deviation is obtained from
interquartile range on dividing by 2, hence also known as semi interquartile range.
Question:
Determine the interquartile range value for the first ten prime numbers.
Solution:
Now we have to get two parts i.e. lower half to find Q1 and the upper half to find Q3.
Q1 part : 2, 3, 5,7,11
Register with BYJU’S – The Learning App and also download the app for more interesting and engaging
videos.
What is Probability?
Probability denotes the possibility of the outcome of any random event. The meaning of this term is to
check the extent to which any event is likely to happen. For example, when we flip a coin in the air, what
is the possibility of getting a head? The answer to this question is based on the number of possible
outcomes. Here the possibility is either head or tail will be the outcome. So, the probability of a head to
come as a result is 1/2.
The probability is the measure of the likelihood of an event to happen. It measures the certainty of the
event. The formula for probability is given by;
P(E) = Number of Favourable Outcomes/Number of total outcomes
P(E) = n(E)/n(S)
Here,
Types of Probability
Depending on the nature of the outcome or the method used to calculate the chance of an event occurring,
several views or types of probabilities may exist. There are four major different types of probabilities:
• Classical Probability
• Axiomatic Probability
• Subjective Probability
• Empirical Probability
Classical Probability
Classical probability which is also known as the theoretical probability which further states that if there
are B equally likely outcomes in an experiment and event X has exactly A of them,
It entails tossing a coin or rolling dice. It’s computed by making a list of all the possible outcomes of the
activity and keeping track of what actually happens. When throwing a coin, for example, the possible
outcomes are heads or tails. If you toss the coin ten times, you must keep track of which outcome
occurred each time.
Axiomatic Probability
A series of rules or axioms by Kolmogorov are applied to all kinds of axiomatic probability. The
probability of each event occurring or not occurring can be calculated using these axioms, which are
written as,
Subjective Probability
Subjective probability takes into account a person’s personal belief on the probability of an event
occurring. For example, a fan’s opinion on the probability of a specific side winning a football match is
based on their personal beliefs and feelings rather than a rigorous quantitative calculation.
Empirical Probability
Conclusion:
The intersection of sets A and B is the set of all elements which are common to both A and B.
Suppose A is the set of even numbers less than 10 and B is the set of the first five multiples of 4, then the
intersection of these two can be identified as given below:
A = {2, 4, 6, 8}
What is union?
Union of two or more sets is the set containing all the elements of the given sets. Union of sets can be
written using the symbol “⋃”. Suppose the union of two sets X and Y can be represented as X ⋃ Y.
In probability theory and logic, a set of events is jointly or collectively exhaustive if at least one of the
events must occur. For example, when rolling a six-sided die, the events 1, 2, 3, 4, 5, and 6 balls of a
single outcome are collectively exhaustive, because they encompass the entire range of possible
outcomes.
Sometimes a small change can make a set that is not collectively exhaustive into one that is. A random
integer generated by a computer may be greater than or less than 5, but those are not collectively
exhaustive options. Changing one option to “greater than or equal to five” or adding five as an option
makes the set fit our criteria
Independent events are those events whose occurrence is not dependent on any other event. For
example, if we flip a coin in the air and get the outcome as Head, then again if we flip the coin but this
time we get the outcome as Tail. In both cases, the occurrence of both events is independent of each
other. It is one of the types of events in probability. Let us learn here the complete definition of
independent events along with its Venn diagram, examples and how it is different from mutually
exclusive events.#
.”
EXAMPLE
Suppose a coin is flipped two times. Previously, we found the sample space for this
experiment: S={HH,HT,TH,TT}
is tails.
Solution:
and TH. The complement of “exactly one head” consists of the outcomes HH and TT. These are the
outcomes in the sample space S
• that are NOT in the original event “exactly one head.”
• The event “at least one tail” consists of the outcomes HT, TH, and TT. The complement of “at least
one tail” consists of the outcomes HH. These are the outcomes in the sample space S
2. that are NOT in the original event “at least one tail.”
Contingency Table:
A contingency table is a tabular representation of categorical data . A contingency table usually shows
frequencies for particular combinations of values of two discrete random variable s X and Y. Each cell in
the table represents a mutually exclusive combination of X-Y values.
For example, consider a sample of N=200 beer-drinkers. For each drinker we have information on sex
(variable X, taking on 2 possible values: “Male” and “Female”) and preferred category of beer (variable
Y, taking on 3 possible values: “Light”, “Regular”, “Dark”). A contingency table for these data might
look like the following
Male 20 40 50 110
Female 50 20 20 90
Total: 70 60 70 200
This is a two-way 2×3 contingency table (i.e. two rows and three columns).
Sometimes three-way (and more) contingency tables are used. Suppose the beer-drinkers data, besides sex
and preference, are also stratified by age group. The third discrete variable Z (“Age”) in this case might,
for example, take on 4 values: “65”.
In this case we would have a three-way 2x3x4 contingency table, equivalent to 4 two-way 2×3
contingency tables (one 2×3 table for each of the 4 age-groups).
Regression and correlation are two of the most powerful and versatile statistical tools we can use to
solve common business problems. They are based on the belief that we can identify and quantify some
functional relationship between two or more variables. One variable is said to depend on another. We
might say Y depends on X where Y and X are any two variables.
Since Y depends on X, Y is the dependent variable and X is the independent variable. It is important to
identify which is the dependent variable and which is the independent variable in the regression model.
This depends on logic and what the statistician is trying to measure. The dean of the college wishes to
examine the relationship between
students’ grades and the time they spend studying. Data are collected on both variables. It is only
logical to presume that grades depend on the amount of quality time students spend with the books!
Thus, “grades” is the dependent variable and “time” is the independent variable
What is Regression?
Regression is a statistical method that tries to determine the strength and character of the
relationship between one dependent variable and a series of other variables. It is used in
finance, investing, and other disciplines.
What Are the Assumptions That Must Hold for Regression Models?
To properly interpret the output of a regression model, the following main assumptions about
the underlying data process of what you are analyzing must hold:
What is correlation?
Correlation is a statistical measure that expresses the extent (পরিমাণ) to which two variables are linearly
related (meaning they change together at a constant rate). It’s a common tool for describing simple
relationships without making a statement about cause and effect.
What is a variable?
A variable is any kind of attribute or characteristic that you are trying to measure,
manipulate and control in statistics and research. All studies analyze a variable, which can
describe a person, place, thing or idea. A variable's value can change between groups or
over times
A variable that stands alone and isn't A variable that relies on and can be
Definition changed by other variables or factors that changed by other factors that are
are measured measured
Example Age: Other variables such as where A grade someone gets on an exam
someone lives, what they eat or how much depends on factors such as how much
Independent variables Dependent variables
they exercise are not going to change their sleep they got and how long they
age studied
Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to
assess the strength of the relationship between variables and for modeling the future
relationship between them.
We should distinguish between simple regression and multiple regression. In simple regression, Y is said
to be a function of only one independent variable. Often referred to as bivariate regression because
there are only two variables, one dependent and one independent, simple regression is represented by
Formula (11.1). In a multiple regression model, Y is a function of two or more independent variables. A
regression model with k independent variables can be expressed as
Aspect Linear Correlation Curvilinear Correlation
Predicts changes in one variable based Predicts change, but the relationship
Predictability on changes in another with relative is more intricate, making predictions
accuracy within a linear framework. complex and context-dependent.