Statistics 02
Statistics 02
Definition
A statistical series of two variables consists of pairs of observations corresponding to each individual
or object under study. Each pair of observations includes a value for the first variable and a value
for the second variable. For example, in a study of the relationship between height and weight in a
group of individuals, each individual's height and weight would constitute a pair of observations.
Types of Relationships
There are three main types of relationships that can exist between two variables:
1. Positive Relationship: As the value of one variable increases, the value of the other variable also
increases.
2. Negative Relationship: As the value of one variable increases, the value of the other variable
decreases.
3. No Relationship: There is no apparent pattern between the values of the two variables.
Cloud Points
Cloud points, also known as scatter plots, are graphical representations of two-variable data. Each
point on the plot represents a pair of observations.
Example
Consider the following data representing the heights and weights of five individuals:
A scatter plot of this data would place each individual's height and weight as a point on a two-
dimensional graph.
Exercise Given a dataset of students' scores in Mathematics and English, plot a scatter plot to
visualize the relationship between the two scores.
Average Point
The average point of a two-variable data set is the point whose coordinates are the means of the
respective variables.
Example
Using the same data from the previous example, the average point would be the average height and
the average weight:
Exercise
Calculate the average point for a dataset representing the ages and incomes of a group of individuals
below.
The covariance matrix of two variables is a matrix that contains the variances of the variables along
the main diagonal, and the covariances between each pair of variables in the other positions.
Note that, n is the number of observations. Also, dividing by (n-1) means we are using sample
data(for population data it’s divided by n)
Example: Let's consider two variables, X and Y, with the following data:
X=[2,4,6,8,10]
Y=[1,3,5,7,9]
| Cov(X,X) Cov(X, Y) |
| Cov(Y, X) Cov(Y,Y) |
Exercises
Example
Using the same data for X and Y from the previous example, the correlation coefficient can be
calculated using the formula:
𝑦 = 𝑎 + 𝑏𝑥
where a is the y-intercept, b is the slope of the line (which can be calculated as 𝐶𝑜𝑣(𝑋, 𝑌) / 𝑉𝑎𝑟(𝑋)),
and x is the independent variable.
Exercise
The least-squares method is a statistical method used to find the line of best fit of the form of an
equation such as y = mx + b to the given data. The curve of the equation is called the regression
line. Our main objective in this method is to reduce the sum of the squares of errors as much as
possible. This is the reason this method is called the least-squares method. This method is often used
in data fitting where the best fit result is assumed to reduce the sum of squared errors that is
considered to be the difference between the observed values and corresponding fitted value. The
sum of squared errors helps in finding the variation in observed data. For example, we have 4 data
points and using this method we arrive at the following graph.
The two basic categories of least-square problems are ordinary or linear least squares and nonlinear
least squares.
Even though the least-squares method is considered the best method to find the line of best fit, it
has a few limitations. They are:
• This method exhibits only the relationship between the two variables. All other causes and effects
are not taken into consideration.
• This method is unreliable when data is not evenly distributed.
• This method is very sensitive to outliers. In fact, this can skew the results of the least-squares
analysis.
Look at the graph below, the straight line shows the potential relationship between the independent
variable and the dependent variable. The ultimate goal of this method is to reduce this difference
between the observed response and the response predicted by the regression line. Less residual
means that the model fits better. The data points need to be minimized by the method of reducing
residuals of each point from the line. There are vertical residuals and perpendicular residuals.
Vertical is mostly used in polynomials and hyperplane problems while perpendicular is used in
general as seen in the image below.
Figure 2: Least square method graph
Least-square method is the curve that best fits a set of observations with a minimum sum of squared
residuals or errors. Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …,
(xn, yn) in which all x’s are independent variables, while all y’s are dependent ones. This method is
used to find a linear line of the form y = mx + b, where y and x are variables, m is the slope, and b
is the y-intercept. The formula to calculate slope m and the value of b is given by:
𝑏 = (∑𝑦 − 𝑚∑𝑥)/𝑛
Following are the steps to calculate the least square using the above formulas.
• Step 1: Draw a table with 4 columns where the first two columns are for x and y points.
• Step 2: In the next two columns, find xy and (x)2.
• Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
• Step 4: Find the value of slope m using the above formula.
• Step 5: Calculate the value of b using the above formula.
• Step 6: Substitute the value of m and b in the equation y = mx + b
m = 65/50 = 13/10
b = (∑y - m∑x)/n
b = (25 - 1.3×15)/5
b = (25 - 19.5)/5
b = 5.5/5
Important Notes
• The least-squares method is used to predict the behaviour of the dependent variable with respect to
the independent variable.
• The sum of the squares of errors is called variance.
• The main aim of the least-squares method is to minimize the sum of the squared errors.