Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

6 Aug 2013

      Hello,
When in 2008 I made some observations on language versions, it struck me
that in some cases the wikisyntax and the "meta article information" was
more KB than the whole encyclopedic content of an article. For example, the
wikicode of the article "Berlin" in Upper Sorabian consisted of more than
50 % characters for categories, interwiki links etc. This made me largely
disregarding the cooncerning features of the Wikimedia statistics.
Kind regards
Ziko
Am Dienstag, 6. August 2013 schrieb Aaron Halfaker :
...
I am removing all HTML tags and comments to include only those characters
that are shown on the screen.  This will include the content of tables
without including the markup contained within.  In other words, I stripped
anything out of the HTML that looked like a tag (e.g. "<foo>" and "</bar>")
or a comment ("<!-- [...] -->") but kept the in-between characters,
whitespace and all.
It seems much more reasonable to me that the difference is due to the fact
that Fabian's dataset is limited to a very narrow range of bytes.  To check
this hypothesis, I drew a new sample of pages with byte length between 5800
and 6000.
The pearson correlation that I found for that sample is* 0.06466406.  *This
corresponds nicely to the poor correlation that Fabian found.

I've update the plot[1] to show the difference visually.
-Aaron

http://commons.wikimedia.org/wiki/File:Bytes.content_length.scatter.correlat...
On Tue, Aug 6, 2013 at 6:04 AM, WereSpielChequers <
[email protected]> wrote:
Thanks both of you,
I suspect that you two are using very different rules to define "readable
characters", and for Aaron to get a close correlation and Fabian not to get
any correlation implies to me that Fabian is stripping out the things that
are not linked to article size, and that Aaron may be leaving such things
in.
For reasons that I'm going to pretend I don't understand, we have some
articles with a lot of redundant spaces. Others with so few you'd be
correct in thinking that certain editors have been making semiautomated
edits to strip out those spaces. I suspect that Fabian's formulae ignores
redundant spaces, and that Aaron's does not.
I picked on alt text because it is very patchy across the pedia, but
usually consistent at article level. I.e if someone has written a whole
paragraph of alt text for one picture they have probably done so for every
picture in an article, and conversely many articles will have no alt text
at all.
Similarly we have headings, and counterintuitively it is the subheadings
that add most non display characters. So an article like Peasant's revolt
will have 32 equals signs for its 8 headings, but 60 equal signs for its 10
subheadings. 92 bytes which I suspect one or both of you will have stripped
out. The actual display text of course omits all 92 of those bytes, but
repeats the content of those headings and subheadings in the contents
section.
The size of sections varies enormously  from one article to another, and
if there are three or fewer sections the contents section is not generated
at all. I suspect that the average length of section headings also has
quite a bit of variance as it is a stylistic choice. So I would expect that
a "display bytes" count that simply stripped out the multiple equal signs
would still be a pretty good correlation with article size, but a display
bytes count that factored in the complication that headings and subheadings
are displayed twice as they are repeated in the contents field, would have
another factor drifting it away from a good correlation with raw byte count.
But probably the biggest variance will be over infoboxes, tables, picture
captions, hidden comments and the like. If you strip all of them out,
including perhaps even the headings, captions and table contents, then you
are going to get a very poor fit between article length and readable byte
size. But I would be surprised if you could get Fabian's minimum display
size of 95 bytes from 6,000 byte articles without having at least one
article that consisted almost entirely of tables and which had been reduced
to a sentence or two of narrative. So my suspicion is that Aaron's plot is
at least including the displayed contents of tables et al whilst Fabian is
only measuring the prose sections and completely stripping out anything in
a table.
Both approaches of course have their merits, and there are even some
editors who were recent edit warring to keep articles they cared about free
from clutter by infoboxes and tables.
Regards
Jonathan
On 5 August 2013 21:16, Floeck, Fabian (AIFB) [email protected]wrote:
Hi,
thanks for your feedback Jonathan and Aaron.
@Jonathan: You are rightfully pointing at some things that could have been
done differently, as this was just an ad-hoc experiment.  What I did was
getting the curl result of "
http://en.wikipedia.org/w/api.php?action=parse&prop=text&pageid=X"  and
running it through BeautifulSoup [1] in Python.
Regarding references: yes, all the markup was stripped away which you
cannot see in form of readable characters as a human when you look at an
article. Take as an example [2]: in the final output (which was the base
for counting chars) what is left in characters of this reference is the
readable "[1]" and " ^ William Goldenberg at the Internet Movie Database".
Regarding alt text: it was completely stripped out. This can arguably be
done different, if you see it as "readable main article text" as well.
You are sure right that including these would lead to a higher
correlation. Looking at samples from the output, the increase in
correlation will however not be very big, but

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles