Ok, I don't want to spam, but...
I just finished downloading another random sample of 5000 articles with sizes of 5000 bytes and below (this time not excluding disambiguation articles, as the respective DB query is quite slow...) . For these, the correlation is 0.514. Plot [1]
This seems to be in favor my observation from earlier that the smaller articles in byte size *might* show a weaker correlation.
Very interesting topic, too bad I have to get back to project work now.. :)
Best,
Fabian
[1] https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_UNDER5001byte...
________________________________________ From: Giovanni Luca Ciampaglia [[email protected]] Sent: Wednesday, August 07, 2013 3:55 PM To: Floeck, Fabian (AIFB) Subject: Re: [Wiki-research-l] Readable characters vs. size in bytes of articles
Hi Fabian,
in principle you should be able to recover the same correlation also in the range 5800-6000 Kb, provided that you control for the noise in the data. From your scatterplot it looks like that the variance of the residuals is constant (a scatter plot of the residuals should be enough to confirm this), so if you standardize the residuals by the standard deviation of the residuals you should be able to recover the correlation, though the significance of such estimate might be at risk if the sample size is small.
Thanks for the interesting discussion.
Best,
Giovanni
On Wed 07 Aug 2013 08:02:21 AM EDT, Floeck, Fabian (AIFB) wrote:
Update:
Not surprising and congruent with Aarons results, I also get a high linear correlation of 0.96 (random sample of 5000 articles) outside the 5800-6000 sample even if I filter out Disamb articles. See scatterplot [1].
So first of all, it can be fairly certainly concluded that our methods of cleaning are quite similar and are not the cause of mayor differences in the measurements. Secondly, this seems to be a very interesting distribution were the overall correlation is very high but in certain sections (I'm using a speculative plural here) of the distribution is very low. That means we can make the statement "In general, byte size and display char length of an article are highly correlated". This is however not automatically valid once you limit the byte size of the articles you look at in any way (for example, by just looking at Featured Articles, Stubs, etc. (I have yet to check these myself)). Then, you will have to check again if the statement holds true for the given subsample.
These results show very nicely that random sampling in a population is not an infallible "universal weapon" for inferring information about all subsamples of a population and should, where possible, always be cross-checked with non-random selective sample analysis.
Next steps I'd like to take is to look at articles above or below a certain size to see if the correlation differs (maybe because a different proportion of their content is made up of templates).
I'll give an update asap.
Best,
Fabian
[1] https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_NOdisamb1.png
-- Giovanni Luca Ciampaglia
Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University
✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ [email protected]