I have located a table that breaks downs Pearson's correlation values into 3 categories. See below:
My question is this. One could have a large positive correlation of 1 as well as a perfect correlation. Is this table valid?
I have located a table that breaks downs Pearson's correlation values into 3 categories. See below:
My question is this. One could have a large positive correlation of 1 as well as a perfect correlation. Is this table valid?
First, the table isn't really "valid" or "invalid". It's a general guideline that will apply in some situations and not others. In some fields an R of $\pm{0.5}$ is very strong, in some it is very weak.
Second, a perfect correlation is, indeed, large.
I would expand @Peter's
In some fields an r of 0.5 is very strong, in some it is very weak
Clearly, in social sciences based on surveys/testing, correlation 0.5 between questions or items is considerably large; that same value will be seen as negligibly small in some branches of physics.
But another facet of it is the shape of the distribution and the degree of discretization of the scale.
Pearson $r$ theoretically varies between -1 and +1. However, in real data the empirical range that will not let out $r$ - computed between two given variables, - is usually narrower than the theoretical one. Linear correlation between X and Y has a chance to attain +1 if and only if the shape of the distribution is totally identical in X and Y (in other words, the two distributions differ no more then just linearly). Otherwise, the upper bound for $r$ will be lower than +1. Analogously, $r$ could attain -1 only if the shape of the distribution is fully identical in X and -Y; otherwise the lower bound for $r$ will be higher than -1. In order $r$ to have empirical range of variation as the theoretical one, -1 to +1, the two distributions must be not only identical in shape, - they must be both symmetrical in shape.
Real life correlating variables often have different and asymmetric distributional shapes. That means that $r$ for them has narrowed freedom to vary and can never reach +1 or -1 values. In the following example, X and Y correlate with $r=.573$, but those data could never give $r>.808$.
X(sorted) Y(asis, going with X)) Y(sorted ascendinhly, like X)
4 4 3
4 4 4
4 5 4
5 3 4
5 4 4
5 4 5
5 6 6
5 6 6
6 6 6
6 7 7
r(Xsorted,Yasis)= .573 - actual observed correlation
r(Xsorted,Ysorted)= .808 - maximal correlation attainable with the observed data
What does considerable narrowing of the range for $r$ mean? It means that the strength of the association between the two variables is underestimated by the traditional $r$; and the more it is undrestimated as the stronger is the association. Because strong association is then inevitably nonlinear, but $r$ skims only linear portion of it. If the analyst facing this problem comes wishing to scoop out more of the association, he would probably think of using a nonparametric Spearman $rho$.
But an alternative an seemingly reasonable solution would be to rescale the observed $r$ value to the upper (or to the lower, or to both) its empirical bounds(s). With the example above, $.573/.808=.709$ may be regarded as the "corrected" or "true" magnitude of $r$ in the context of our specific data. Spearman $rho$ is measuring linear relationship after streightening curved relationship (by ranking). Rescaled $r$ is referencing itself to the ceiling of the linearity possible with the given data. The ceiling or bound value of $.808$ could be, in a sense, labeled "perfect association" despite that it is less than $1$.
Such rescaling, has, of course its shortcomings, the main being is that a matrix of such coefficients is often not positive semidefinite.
If the data measurement scale is coarse, such as dichotomous, data react sharper to the inequality of marginal distributional shapes, and the empirical range for $r$ is narrower than it is with fine scale. As seen below:
Histograms
Two interval variables with different shape of distribution:
ooo
ooo oooooo
oooooo oooooo
oooooo oooooo
123456 123456 <- 6-grade scale
The upper empirical bound for r of these data = .956
The same two variables binned into dichotomous scale
(inequality of distribution shape preserved):
o
o
o
o o o
o o o
o o o
o o o o
o o o o
o o o o
o o o o
o o o o
o o o o
2 5 2 5 <- dichotomous scale
The upper empirical bound for r of these data = .707
And it is the reason why factor analysis done on binary data via Pearson $r$ often give flat scree plot, suggesting too many factors to extract. Strong correlations just have little chance to occure in binary data with varying item difficulties (i.e., marginal shapes). Torgerson adviced then computation of the above rescaled $r$ values, which are higher. Another way out is to compute tetrachoric $r$ which is also higher than ordinary $r$.