0% found this document useful (0 votes)
14 views

Statistics 02

The document discusses statistical series of two variables, also known as bivariate data, which involves analyzing two variables to determine their empirical relationship. There are three main types of relationships between two variables: positive, negative, and no relationship. Cloud points, also known as scatter plots, are used to graphically represent two-variable data. Other topics discussed include average point, covariance, correlation coefficient, regression analysis, and the least squares method for determining the line of best fit for a dataset.

Uploaded by

Lukong Louis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Statistics 02

The document discusses statistical series of two variables, also known as bivariate data, which involves analyzing two variables to determine their empirical relationship. There are three main types of relationships between two variables: positive, negative, and no relationship. Cloud points, also known as scatter plots, are used to graphically represent two-variable data. Other topics discussed include average point, covariance, correlation coefficient, regression analysis, and the least squares method for determining the line of best fit for a dataset.

Uploaded by

Lukong Louis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Statistics Notes Part 2

Statistical Series of Two Variables


Statistical series of two variables, also known as bivariate data, is a fundamental concept in statistics.
It involves the analysis of two variables to determine the empirical relationship between them.

Definition
A statistical series of two variables consists of pairs of observations corresponding to each individual
or object under study. Each pair of observations includes a value for the first variable and a value
for the second variable. For example, in a study of the relationship between height and weight in a
group of individuals, each individual's height and weight would constitute a pair of observations.

Types of Relationships
There are three main types of relationships that can exist between two variables:

1. Positive Relationship: As the value of one variable increases, the value of the other variable also
increases.
2. Negative Relationship: As the value of one variable increases, the value of the other variable
decreases.
3. No Relationship: There is no apparent pattern between the values of the two variables.

Cloud Points
Cloud points, also known as scatter plots, are graphical representations of two-variable data. Each
point on the plot represents a pair of observations.

Example

Consider the following data representing the heights and weights of five individuals:

Individual Height(cm) Weight(kg)


1 170 65
2 180 75
3 175 70
4 185 80
5 165 60

A scatter plot of this data would place each individual's height and weight as a point on a two-
dimensional graph.
Exercise Given a dataset of students' scores in Mathematics and English, plot a scatter plot to
visualize the relationship between the two scores.

Student Mathematics Score English Score


1 85 78
2 90 92
3 78 81
4 92 88
5 88 85
6 75 80
7 82 78
8 90 92
9 78 75
10 85 88

Average Point
The average point of a two-variable data set is the point whose coordinates are the means of the
respective variables.

Example

Using the same data from the previous example, the average point would be the average height and
the average weight:

Average height = (170 + 180 + 175 + 185 + 165) / 5 = 175 cm

Average weight = (65 + 75 + 70 + 80 + 60) / 5 = 70 kg

So, the average point is (175, 70).

Exercise

Calculate the average point for a dataset representing the ages and incomes of a group of individuals
below.

Individual Age Income($)


1 22 32000
2 25 65000
3 30 45000
4 35 28000
5 40 32000
6 45 60000
7 50 25000
8 55 65000
9 60 40000
10 65 32000

Covariance and Covariance Matrix of 2 Variables


Covariance is a measure of how much two variables change together. If the variables tend to increase
and decrease together, the covariance is positive. If one variable tends to increase when the other
decreases, the covariance is negative. A covariance of 0: No linear relationship.

The covariance matrix of two variables is a matrix that contains the variances of the variables along
the main diagonal, and the covariances between each pair of variables in the other positions.

The covariance of X and Y can be calculated using the formula:

Note that, n is the number of observations. Also, dividing by (n-1) means we are using sample
data(for population data it’s divided by n)

Example: Let's consider two variables, X and Y, with the following data:

X=[2,4,6,8,10]

Y=[1,3,5,7,9]

• Calculate the covariance between X and Y.


The covariance matrix of X and Y is then:

| Cov(X,X) Cov(X, Y) |

| Cov(Y, X) Cov(Y,Y) |

Exercises

The Coefficient of Correlation and Regression


The coefficient of correlation, also known as Pearson's correlation coefficient, is a measure of the
strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where
-1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship,
and 0 indicates no linear relationship.
Regression analysis is used to model the relationship between two variables. The regression line,
also known as the line of best fit, is the line that minimizes the sum of the squared residuals (the
differences between the observed and predicted values).

Example

Using the same data for X and Y from the previous example, the correlation coefficient can be
calculated using the formula:

The regression line can be calculated using the formula:

𝑦 = 𝑎 + 𝑏𝑥

where a is the y-intercept, b is the slope of the line (which can be calculated as 𝐶𝑜𝑣(𝑋, 𝑌) / 𝑉𝑎𝑟(𝑋)),
and x is the independent variable.

Exercise

Calculate the correlation coefficient of the given data.

Linear Adjustment by the Method of Least Squares

Least Square Method Definition

The least-squares method is a statistical method used to find the line of best fit of the form of an
equation such as y = mx + b to the given data. The curve of the equation is called the regression
line. Our main objective in this method is to reduce the sum of the squares of errors as much as
possible. This is the reason this method is called the least-squares method. This method is often used
in data fitting where the best fit result is assumed to reduce the sum of squared errors that is
considered to be the difference between the observed values and corresponding fitted value. The
sum of squared errors helps in finding the variation in observed data. For example, we have 4 data
points and using this method we arrive at the following graph.

Figure 1: Least square method

The two basic categories of least-square problems are ordinary or linear least squares and nonlinear
least squares.

Limitations for Least Square Method

Even though the least-squares method is considered the best method to find the line of best fit, it
has a few limitations. They are:

• This method exhibits only the relationship between the two variables. All other causes and effects
are not taken into consideration.
• This method is unreliable when data is not evenly distributed.
• This method is very sensitive to outliers. In fact, this can skew the results of the least-squares
analysis.

Least Square Method Graph

Look at the graph below, the straight line shows the potential relationship between the independent
variable and the dependent variable. The ultimate goal of this method is to reduce this difference
between the observed response and the response predicted by the regression line. Less residual
means that the model fits better. The data points need to be minimized by the method of reducing
residuals of each point from the line. There are vertical residuals and perpendicular residuals.
Vertical is mostly used in polynomials and hyperplane problems while perpendicular is used in
general as seen in the image below.
Figure 2: Least square method graph

Least Square Method Formula

Least-square method is the curve that best fits a set of observations with a minimum sum of squared
residuals or errors. Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …,
(xn, yn) in which all x’s are independent variables, while all y’s are dependent ones. This method is
used to find a linear line of the form y = mx + b, where y and x are variables, m is the slope, and b
is the y-intercept. The formula to calculate slope m and the value of b is given by:

𝑚 = (𝑛∑𝑥𝑦 − ∑𝑦∑𝑥)/𝑛∑𝑥 2 − (∑𝑥)2

𝑏 = (∑𝑦 − 𝑚∑𝑥)/𝑛

Here, n is the number of data points.

Following are the steps to calculate the least square using the above formulas.

• Step 1: Draw a table with 4 columns where the first two columns are for x and y points.
• Step 2: In the next two columns, find xy and (x)2.
• Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
• Step 4: Find the value of slope m using the above formula.
• Step 5: Calculate the value of b using the above formula.
• Step 6: Substitute the value of m and b in the equation y = mx + b

Let us look at an example to understand this better.

Example: Let's say we have data as shown below.


Solution: We will follow the steps to find the linear line.

Find the value of m by using the formula,

m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2

m = [(5×88) - (15×25)]/(5×55) - (15)2

m = (440 - 375)/(275 - 225)

m = 65/50 = 13/10

Find the value of b by using the formula,

b = (∑y - m∑x)/n

b = (25 - 1.3×15)/5

b = (25 - 19.5)/5

b = 5.5/5

So, the required equation of least squares is y = mx + b = 13/10x + 5.5/5.

Important Notes

• The least-squares method is used to predict the behaviour of the dependent variable with respect to
the independent variable.
• The sum of the squares of errors is called variance.
• The main aim of the least-squares method is to minimize the sum of the squared errors.

You might also like