Hi Giovanni,
thanks, good you mention that. But one question: did you by chance mistake the scatter plot I posted in my last mail (from the random 5000 article sample, corr: 0.96) for the one of the 5800-6000 byte sample (corr: 0.04)? Because the ladder was not posted yet and one looks like this: [1] or in a cut-off version at ~20000 chars like this: [2].
By the way: I split up the random 5000 article sample I posted last time at the median (3709 bytes) into two parts, each 2500 articles big. For the "higher byte size" part (>3709 bytes) the correlation is 0,964 For the "lesser byte size" part (<3710 bytes ) the correlation only 0,295 See plots [3]
--> It seems from that sample as if the bigger articles seem to exhibit a higher correlation than the smaller ones.
I also plotted the residuals, they "kind of" show homogeneity of variance, although I would have to check that with a test. See [4]
All in all I feel that the question "How far and when can we rely on byte size to tell us the actual display character length of an article?" can maybe be better be answered when we look at the distribution of the ratio of bytes to display chars in different byte size segments. I'll try to look into that or encourage someone else to do it.
Best,
Fabian
[1] https://dl.dropboxusercontent.com/u/3021002/scatter_5800-6000_complete1.png [2] https://dl.dropboxusercontent.com/u/3021002/scatter_5800-6000_cutoff20T1.png
[3] over median: https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_corrOVERmedia... under median: https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_corrUNDERmedi... (DIFFERENT SCALE!) under median with the scale from the "over median" plot: https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_corrUNDERmedi...
[4] complete version: https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_residual1.png , cut-off at 20000 bytes page_len: https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_residual_shor...
________________________________________ From: Giovanni Luca Ciampaglia [[email protected]] Sent: Wednesday, August 07, 2013 3:55 PM To: Floeck, Fabian (AIFB) Subject: Re: [Wiki-research-l] Readable characters vs. size in bytes of articles
Hi Fabian,
in principle you should be able to recover the same correlation also in the range 5800-6000 Kb, provided that you control for the noise in the data. From your scatterplot it looks like that the variance of the residuals is constant (a scatter plot of the residuals should be enough to confirm this), so if you standardize the residuals by the standard deviation of the residuals you should be able to recover the correlation, though the significance of such estimate might be at risk if the sample size is small.
Thanks for the interesting discussion.
Best,
Giovanni
On Wed 07 Aug 2013 08:02:21 AM EDT, Floeck, Fabian (AIFB) wrote:
Update:
Not surprising and congruent with Aarons results, I also get a high linear correlation of 0.96 (random sample of 5000 articles) outside the 5800-6000 sample even if I filter out Disamb articles. See scatterplot [1].
So first of all, it can be fairly certainly concluded that our methods of cleaning are quite similar and are not the cause of mayor differences in the measurements. Secondly, this seems to be a very interesting distribution were the overall correlation is very high but in certain sections (I'm using a speculative plural here) of the distribution is very low. That means we can make the statement "In general, byte size and display char length of an article are highly correlated". This is however not automatically valid once you limit the byte size of the articles you look at in any way (for example, by just looking at Featured Articles, Stubs, etc. (I have yet to check these myself)). Then, you will have to check again if the statement holds true for the given subsample.
These results show very nicely that random sampling in a population is not an infallible "universal weapon" for inferring information about all subsamples of a population and should, where possible, always be cross-checked with non-random selective sample analysis.
Next steps I'd like to take is to look at articles above or below a certain size to see if the correlation differs (maybe because a different proportion of their content is made up of templates).
I'll give an update asap.
Best,
Fabian
[1] https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_NOdisamb1.png
-- Giovanni Luca Ciampaglia
Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University
✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ [email protected]