A dataset’s sum of squares shows how dispersed the set’s data points are from its mean.
What Is the Sum of Squares?
The term “sum of squares” is a statistical measure used in regression analysis that quantifies how spread out the data points are around the mean or predicted values. The sum of squares helps identify the function that best fits the data by measuring how little it deviates from the observed values.
In a regression analysis, the goal is to determine how well a data series can be fitted to a function that might help to explain how the data series was generated. The sum of squares can be used in the financial world to determine the variance in asset values.
Key Takeaways
- The sum of squares measures the deviation of data points away from the mean value.
- A higher sum of squares indicates higher variability, while a lower result indicates lower variability from the mean.
- To calculate the sum of squares, subtract the mean from the data points, square the differences, and add them together.
- There are three types of sum of squares: total, residual, and regression.
- Investors can use the sum of squares to help make better decisions about their investments.
Understanding the Sum of Squares
The sum of squares measures how widely a set of data points is spread out from the mean. It is also known as variation. It is calculated by adding together the squared differences of each data point. To determine the sum of squares, square the distance between each data point and the line of best fit, then add them together. The line of best fit will minimize this value.
A low sum of squares indicates little variation between datasets, while a higher one indicates more variation. Variation refers to the difference between each dataset from the mean. You can visualize this in a chart. If the line doesn’t pass through all the data points, then there is some unexplained variability. We go into a little more detail about this in the next section below.
In statistics, the sum of squares is used to calculate the variance and standard deviation of a dataset, which are in turn used in regression analysis. Analysts and investors can use these techniques to make better decisions about their investments. Keep in mind, though, that using it means you’re making assumptions about using past performance. For instance, this measure can help you determine the level of volatility in a stock’s price or how the share prices of two companies compare.
Let’s say an analyst wants to know if Microsoft (MSFT) share prices tend to move in tandem with those of Apple (AAPL). The analyst can list out the daily prices for both stocks for a certain period (say, one, two, or 10 years) and create a linear model or a chart. If the relationship between both variables (i.e., the price of AAPL and MSFT) is not a straight line, then there are variations in the dataset that must be scrutinized.
Sum of Squares Formula
The following is the formula for the total sum of squares.
For a set X of n items:Sum of squares=i=0∑n(Xi−X)2where:Xi=The ith item in the setX=The mean of all items in the set(Xi−X)=The deviation of each item from the mean
Important
Variation is a statistical measure that is calculated or measured by using squared differences.
How to Calculate the Sum of Squares
You can see why the measurement is called the sum of squared deviations, or the sum of squares for short. You can use the following steps to calculate the sum of squares:
- Gather all the data points.
- Determine the mean/average.
- Subtract the mean/average from each individual data point.
- Square each total from Step 3.
- Add up the figures from Step 4.
In statistics, the mean is the average of a set of numbers, which is calculated by adding the values in the dataset together and dividing by the number of values. However, knowing the mean may not be enough to understand your data and draw conclusions. As such, it helps to know the variation in a set of measurements. How far individual values are from the mean may provide insight into how much variation exists and how well the values fit a regression line.
Types of Sum of Squares
The formula we highlighted earlier is used to calculate the total sum of squares. The total sum of squares is used to arrive at other types. The following are the other types of sum of squares.
Residual Sum of Squares
As noted above, if the line in the linear model created does not pass through all the measurements of value, then some of the variability that has been observed in the share prices is unexplained. The sum of squares is used to calculate whether a linear relationship exists between two variables, and any unexplained variability is referred to as the residual sum of squares (RSS).
The RSS allows you to determine the amount of error left between a regression function and the dataset after the model has been run. You can interpret a smaller RSS figure as a regression function that fits well with the data, while the opposite is true of a larger RSS figure.
Here is the formula for calculating the residual sum of squares:
SSE=i=1∑n(yi−y^i)2where:yi=Observed valuey^i=Value estimated by regression line
Regression Sum of Squares
The regression sum of squares is used to denote the relationship between the modeled data and a regression model. A regression model establishes whether there is a relationship between one or multiple variables. Having a low regression sum of squares indicates a better fit with the data. A higher regression sum of squares, though, means the model and the data aren’t a good fit together.
Here is the formula for calculating the regression sum of squares:
SSR=i=1∑n(y^i−yˉ)2where:y^i=Value estimated by regression lineyˉ=Mean value of a sample
Tip
Adding the sum of the deviations alone without squaring them results in a number equal to or close to zero, since the negative deviations will almost perfectly offset the positive deviations. To get a more realistic number, the sum of deviations must be squared. The sum of squares will always be a positive number because the square of any number, whether positive or negative, is always positive.
Limitations of Using the Sum of Squares
Making an investment decision on what stock to purchase requires many more observations than the ones listed here. An analyst may have to work with years of data to know with higher certainty how high or low the variability of an asset is. As more data points are added to the set, the sum of squares becomes larger as the values will be more spread out.
The most widely used measurements of variation are the standard deviation and variance. However, to calculate either of the two metrics, the sum of squares must first be calculated. The variance is the average of the sum of squares (i.e., the sum of squares divided by the number of observations). The standard deviation is the square root of the variance.
There are two methods of regression analysis that use the sum of squares:
- The linear least squares method
- The nonlinear least squares method
The least squares method refers to the fact that the regression function minimizes the sum of the squares of the variance from the actual data points. In this way, it is possible to draw a function that statistically provides the best fit for the data. Note that a regression function can either be linear (a straight line) or nonlinear (a curving line).
Example of Sum of Squares
Let’s use Microsoft as an example to show how you can arrive at the sum of squares.
Using the steps listed above, we gather the data. So if we’re looking at the company’s performance over a five-year period, we’ll need the closing prices for that time frame:
- $374.01
- $374.77
- $373.94
- $373.61
- $373.40
Now let’s figure out the average price. The sum of the total prices is $1,869.73, and the mean or average price is $1,869.73 ÷ 5 = $373.95.
Then, to figure out the sum of squares, we find the difference of each price from the average, square the differences, and add them together:
- SS = ($374.01 - $373.95)2 + ($374.77 - $373.95)2 + ($373.94 - $373.95)2 + ($373.61 - $373.95)2 + ($373.40 - $373.95)2
- SS = (0.06)2 + (0.82)2 + (-0.01)2 + (-0.34)2 + (-0.55)2
- SS = 1.0942
In the example above, 1.0942 shows that the variability in the stock price of MSFT over five days is very low, and investors looking to invest in stocks characterized by price stability and low volatility may opt for Microsoft.
How Do You Define the Sum of Squares?
The sum of squares is a form of regression analysis to determine the variance of data points from the mean. If there is a low sum of squares, it means there’s low variation. A higher sum of squares indicates higher variance. This can be used to help make more informed decisions by determining investment volatility or to compare groups of investments with one another.
How Do You Calculate the Sum of Squares?
To calculate the sum of squares, gather all your data points. Then, determine the mean or average by adding them all together and dividing that figure by the total number of data points. Next, figure out the differences between each data point and the mean. Then square those differences and add them together to give you the sum of squares.
How Does the Sum of Squares Help in Finance?
Investors and analysts can use the sum of squares to make comparisons between different investments or make decisions about how to invest. For instance, you can use the sum of squares to determine stock volatility. A low sum generally indicates low volatility, while higher volatility is derived from a higher sum of squares.
The Bottom Line
As an investor, you want to make informed decisions about where to put your money. While you can certainly do so using your gut instinct, there are tools at your disposal that can help you. The sum of squares takes historical data to give you an indication of implied volatility. Use it to see whether a stock is a good fit for you or to determine an investment if you’re on the fence between two different assets.
Keep in mind, though, that the sum of squares uses past performance as an indicator and doesn’t guarantee future performance.
Correction—May 18, 2023: This article has been corrected to state that the mean of the sum of squares should be subtracted from the data points.