Jump to content

User:Mill 1/Project Chaining back the Years

From Wikipedia, the free encyclopedia

Preface

[edit]

The articles that list the recent deaths consistently rank among the most popular on Wikipedia.[1] However, it must have been in the summer of 2018 that I first got interested in the older versions of them. At the time the dead were listed per month ('dpm's') and per year ('dpy's').[2] I noticed wild differences between them in formatting, guidelines, coverage and sourcing. An explanation is that presently the dpm of the current month is edited intensively as the month progresses. And a lot a watchers make sure the guidelines are followed during and after the running month. However, this was not the case for pages listing deceased in the pre-Wikipedia era; they were put in dpy's afterwards.[3] Annoyed by the discrepancies between these types of pages I set out to standardize the formatting of the dpy's first.[4].

During that time I noticed something else which would become the main motivation for the initial phase of this endeavour: days were missing! There seemed to be days that nobody had died. This could not be and my OCD-tendencies immediately kicked in. An idea formed in my head: why not create dpm's for all months going back to 1995? It would solve the issue of the dpy's becoming very long and I could add missing days when processing a month. I would take the year 2005 as a starting point because I noticed that from 2006 onwards at least one deceased is listed for every day until the present.[5] I remember flinching at the idea when I realised I had to process more than 4000 days. "This is going to take me ten years." I thought. In the end it took only six.

This article tries to give an overview of the activities envolved during the project that I dubbed Chaining back the Years.[6] It also states some interesting milestones and statistics[7].

Three rounds

[edit]

In hindsight improving the deaths lists fell apart in three separate rounds of activities during which each existing dpm and dpy was processed.[8]

You can find information on the initial versions of the dpm's per round here.

Round 1: Breaking up the Deaths in Years

[edit]

Period: September 2018 – October 2020
Articles: Deaths in January 1997Deaths in December 2005

The first phase started by making the dpy's even longer before forking them into twelve separate dpm's. Regarding every month I needed to perform checks, find missing notable deceased for the list ('entries') and compile the wikitext that I could paste in a dpm. Obviously this was way too much work to accomplish by hand.

So before beginning I extended the functionality of the Excel application that I had already used for several other projects. It would proof to be indispensable when processing a month:

Processing a month using the Excel application

[edit]
Screenshot of the Excel tool which generated most of the wikitext

1. Dpm checks

[edit]

Before entries would be added/updated, the month at hand would be checked. Existing entries in the list would be cross-referenced with their corresponding bio's to look for discrepancies:

  • Are the existing entries in the correct day sub section?[9]
  • Do the existing entries link to a valid biography article?
  • Do the corresponding bio's contain the correct "[YEAR] deaths" and "[YEAR] births" categories?
  • Does every day sub section contain at least two (later three) entries?

2. Process specific days of the dpm

[edit]

After the initial checks the actual work on the article could commence. At first, I focused on filling the gaps in the days of death but soon I decided every day should contain at least two entries.[10] Processing a specific date started by clicking the 'Chk'-button in the 'Death per date' worksheet. Next tasks would be executed:

  • Resolve the list of bio's whose subject had died on a specific date.[11] More info can be found here.
  • Show the list alphabetized and per bio display if it is a stub or has any 'problem flags' like {{multiple issues}}, {{Notability}}, {{Unreferenced}}, {{mcn}}, {{One source}}.
  • Apply custom filtering to the list. More info on that here
  • Per bio try to resolve next parts of an entry by analyzing the bio's wikitext:
    • The (linked) name of the article
    • The date of birth and death to determine the subjects age
    • The nationality and reason for notability (by analyzing the opening sentence of the bio)[12]
    • The cause of death[13]

Result filtering

[edit]

From the start it was also clear to me that some inclusion filtering needed to be applied to the found new (and existing) entries. On some days more than 30 persons with a bio died. Stating them all would make the dpm's unwieldy and error prone. And lesser figures (often stubs and virtual orphans) distract from more notable entries.

So I experimented with conditions like not being a stub or having problem flags. Did not work. However, the tool looked for the date of death (DoD) only in the infobox of the person's bio. As a consequence, a biography having an infobox acted as a first filter. I also made the application look at the bio's text size (excluding the text in the infobox and stated categories, the 'net size'). This was the second filter. I settled for 4000 characters as the minimum 'net size' of a biography. This first attempt at grading WP:N worked, but it never sat well with me (and others). It was one of the reasons to initiate round 3.

3. Concluding processing a month
Processing a month would be concluded by two manual activities:

  • Search for additional causes of death regarding the entries[14].
  • Are there any 'reason for notability'-descriptions in the entry that needs trimming?

Chronology of activities

[edit]

Work started on 1 September 2018 by applying the same format to sections and list entries to all the dpy's[4] The first couple of months I worked on the 24 existing dpm's of 2004 and 2005, processing each month assisted by the Excel tool as described. At the same time I was also in the process of finishing another project.
On 2 February 2019 I standardized the guidelines and day sub sections of all dpm's between 2004 and 2015. Applying those changes finalized the first round of improvements regarding the dpm's of 2004 and 2005.

I could now focus on the Deaths in Year-pages. Next list shows when an entire year was completed after which it could be split up into 12 dpm's, finalizing their first round of improvements:

  • 2003: 10 February 2019
  • 2002: 17 February 2019
  • 2001: 17 February 2019
  • 2000: 23 February 2019
  • 1999: 19 May 2019
  • 1998: 9 November 2019
  • 1997: 4 October 2020

Regarding 1998 and 1997 (and 1996) a new dpm was created right after a month had been processed. The 12 dpm's were not created simultaneously anymore as is explained here. Processing of the years 1993-1998 was done in this processing page which would be initialized every time after a dpm was completed.[15][16]

Round 1 saw one final improvement. From the beginning I had noticed that the dpm's lacked references citing the deceased date (and cause) of death. I had started adding some citations to entries but it seemed to be a drop in a bucket. That's why I introduced a new feasable requirement: at least one reference per day sub section.[17] Around June 2020 I first started thinking about automating citations. The archive API of The New York Times especially offered great possibilities. So I wrote some code to experiment interacting with the NYTimes API to retrieve obituary data and create citations from them. I pasted the output in another processing user page: /References/The New York Times. The results were spectacular. I could now use this list of generated references as a source. So after processing a day I would also manually add citations of matching entries to the day sub section of the dpm. The first month I processed this way was September 1997. I worked my way back to January 1997, improving and bugfixing the code.

Eventually the software evolved into the WikipediaReferences-application. You can read more about it here. On November 14, 2020 (I learned from the GitHub commit) the application was finally able to add NYTimes-references to the corresponding entries of an entire dpm automatically. I decided to reprocess all the existing dpm's (1997 – 2005) so that their number of stated references would increase considerably. Work started with January 1997 on the same day heralding the start of the next round.

Milestones

[edit]

After 25 months round 1 was concluded by creating the last dpm of 1997.[18] By this time I already must've decided to extend the 'chaining back' period back to January 1990

Round 2: Adding NYTimes references

[edit]
Screenshot of the WikipediaReferences console application

Period: November 2020 – November 2021
Articles: Deaths in January 1995Deaths in December 2005

As already described at the end of the previous section the succes of WikipediaReferences application prompted me to re-process all dpm's that existed at that time (November 2020). Automatically adding NYTimes references using the tool would also become the additional third activity when wrapping up a month (see 3. Concluding processing a month in Round 1 for the other two activities).

Processing a month using the WikipediaReferences application

[edit]

Processing a particular dpm usually consisted of these steps:

  • After the regular processing of a dpm was concluded and the last entries were added/updated I would run the software to evaluate a dpm. See screenshot: I would select 'p', followed by some input to tell the application which month and which Wikipedia source page to process.
  • First the app would perform initial checks like looking for duplicate entries. The process is aborted if any issues are encountered.
  • If the initial issues are resolved the month in question is evaluated by comparing the NYTimes obit data with the entries in the dpm. After that the app offers to generate the the wikitext, including the added/updated references. However this was seldom the case. In most cases other actions were required first after which the evaluation was run again. Two types of actions exists:
    • If NYTimes obituary data exists for a listed entry than the resolved death date in the obituary is compared with the date of death in the entry's corresponding bio. Very often discrepancies would exist. One reason is that the death date stated in the bio is wrong.[19]. These discrepancies had to be corrected first.
    • The software would also spot potential entries: regarding the particular month NYTimes obit data would exist for bio's that were not present in the dpm. In fact, some many potential entries were suggested that I applied a notability filter on them.[20] I would add most of the suggested entries manually to the dpm source page.
  • After the correction/additions step I would re-run option 'Print month of death'. Sometimes several times until no more issues were encountered by the application.
  • After succcesful evaluation of the dpm I would instruct the app to generate the wikicode in a text file.
  • Processing a specific dpm is concluded by pasting the contents of the text file in the source page of the dpm and checking the result.

Chronology of activities

[edit]

Right after I uploaded the last code changes I started using the software on the existing dpm's. I really hit the ground running processing the years 1997 - 2000 within 6 weeks, adding and updating over a thousand citations (as well as adding quite a few entries suggested by the application).

By September 2021 I had processed all existing dpm's, increasing the number of references on a page considerably.[21]. I could now resume my efforts in the processing page were I prepared brand new dpm's starting with Deaths in December 1996. By now the software was firmly embedded in the way of working.

1995

[edit]

However, work was interrupted by another job. An editor had forked Deaths in 1995 into 12 dpm's without any regard for the different style and format, after which he added many entries. It took me a sh*tload of time bringing the new dpm's up to par.[22] The task involved a lot of corrections by hand as well, adding causes of death, shortening entry descriptions, meanwhile battling this lunatic. When cleaning up 1995 I also identified many unnotable entries, many of whom didn't even have an enwiki bio. And by this time I already decided to reprocess all the days of existing dpm's partly to apply the new notability algorithm to entries. This would mean that many 1995 entries would be cleansed from the lists. That's why I decided it would be a huge waste of time applying the WikipediaReferences tool to the 1995 entries; it would take a lot of effort correcting entries that would be removed at a later stage anyway. This is the reason why (alhough chronologically incorrect) this was actually a Round 1 job.

Milestones

[edit]

Still using the wiki_client Excel tool, Round 2 came to an end on 31 October 2021 with the creating of Deaths in September 1996

More details on the progress regarding Round 2 can be found here. In the table click on on title 'Round 2' to sort on the date when the processing of a dpm was finished.

Round 3: New rules: let's process every day (again)

[edit]

Period: November 2021 – November 2024
Articles: Deaths in November 1989Deaths in December 2005

So by now I've been at it for a couple of years and during that period two issues started bugging me more and more:

  1. The notabilty algorithm is faulty; I'm adding entries whose bio's are semi orphans. At the same time I miss notable entries because their bio's don't have infoboxes.
  2. Most entries do not have citations. After completing Round 2 this was improved somewhat but many dpm's now contain references that almost exclusively point to The New York Times as a source.

Wikidata

[edit]
Screenshot of the web application Wikipedia Deaths Pages

During my activities I had come across Wikidata when inspecting bio's. At some point I must have noticed that the data stored in a human Wikidata item could serve my purposes, especially these data properties:

  • Item's description (=reason for notablity regarding humans)
  • Date of death (DoD) statement
  • Date of birth (DoB) statement (needed to resolve an entry's age)
  • Cause of death statement
  • Number of wiki's in which the human is present

Investigating the Wikidata query capabilities made me realise that using Wikidata as a source offered huge advantages over using an entry's corresponding Wikipedia page. It would help me regarding the two issues, resolve the cause of death automatically and offer an alternative for the description part of an entry to generate.[23] There was also one final perk using Wikidata as source: the death date statement of many items contained references supporting the claim. This information could be used to generate references for entries automatically when processing a dpm. These were all great improvements. I realised that I had to re-process every day between 1990 and 2005 AGAIN. But since it was clear that it would hugely increase the quality and reliabilty of the dpm's I decided in a heartbeat I would do it. I still had to create the software though which ultimately would become the WikipediaDeathsPages web application.

At the heart of the app would be the query that would fetch the Wikidata data regarding a specific date of death. Unfortunately I am unfamiliar with the SPARQL query language. Luckily Wikidata:Request a query exists. With the help of volunteers over the course of a couple of months I was finally able to define the query. As input it would only require the date of death. The output is shown below as a table. As you can see it contains the basic data (alphabetized by article name!) I needed to generate the entries for a specific day (in this case 25 August 2001):[24]

item articlename itemLabel itemDescription sl[25] dob dod dod_refs[26] cod[27] mod[28]
Q11617 Aaliyah Aaliyah American singer and actress (1979–2001) 69 1979-01-16T00:00:00Z 2001-08-25T00:00:00Z stated in: Nederlandse Top 40~!stated in: Find a Grave~!Find a Grave memorial ID: 5727911~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Aaliyah~!subject named as: Aaliyah Dana Haughton~!Nederlandse Top 40 artist ID: aaliyah~!stated in: Integrated Authority File~!retrieved: 2014-04-09T00:00:00Z aviation accident accidental death
Q3298163 Madge Adam Madge Adam English solar astronomer (1912-2001) 15 1912-03-06T00:00:00Z 2001-08-25T00:00:00Z stated in: Who's Who~!Who's Who UK ID: U4983~!imported from Wikimedia project: English Wikipedia
Q6779010 Mary Barnard Mary Barnard American poet and translator (1909-2001) 3 1909-12-06T00:00:00Z 2001-08-25T00:00:00Z stated in: SNAC~!stated in: Find a Grave~!Find a Grave memorial ID: 6318601~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Mary Barnard~!subject named as: Mary Ethel Barnard~!SNAC ARK ID: w60s047j
Q1037163 Carl Brewer (ice hockey) Carl Brewer Canadian ice hockey player (1938-2001) 9 1938-10-21T00:00:00Z 2001-08-25T00:00:00Z stated in: SNAC~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Carl Brewer~!SNAC ARK ID: w6f76nsq~!stated in: Find a Grave~!Find a Grave memorial ID: 8466339~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Carl Thomas Brewer
Q10294559 Helmut Bruck Helmut Bruck German officer and Knight's Cross recipient 3 1913-02-16T00:00:00Z 2001-08-25T00:00:00Z imported from Wikimedia project: English Wikipedia
Q93784 John Chambers (make-up artist) John Chambers American make-up artist and prosthetic makeup expert 12 1923-09-12T00:00:00Z 2001-08-25T00:00:00Z imported from Wikimedia project: Italian Wikipedia
Q8079499 Üzeyir Garih Üzeyir Garih Turkish businessman 4 1929-01-01T00:00:00Z 2001-08-25T00:00:00Z
Q3547943 Diana Golden (skier) Diana Golden American alpine skier (1963-2001) 6 1963-03-20T00:00:00Z 2001-08-25T00:00:00Z imported from Wikimedia project: English Wikipedia breast cancer natural causes
Q6033955 Inigo Jackson Inigo Jackson actor (1933-2001) 1 1933-07-19T00:00:00Z 2001-08-25T00:00:00Z imported from Wikimedia project: English Wikipedia
Q155493 Philippe Léotard Philippe Léotard French singer and actor (1940-2001) 21 1940-08-28T00:00:00Z 2001-08-25T00:00:00Z GND ID: 119002469~!stated in: Roglo~!stated in: Integrated Authority File~!stated in: GeneaStar~!stated in: Who's Who in France~!stated in: Find a Grave~!Find a Grave memorial ID: 5860980~!retrieved: 2015-10-18T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Philippe Leotard~!Who's Who in France biography ID: 25159~!Roglo person ID: p=philippe;n=leotard~!GeneaStar person ID: leotardp~!stated in: filmportal.de~!stated in: BnF authorities~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2015-10-10T00:00:00Z~!reference URL: http://data.bnf.fr/ark:/12148/cb12070631t ~!subject named as: Philippe Léotard~!Filmportal ID: 0216ac0cf8fb4ce3a3e417812c4a5a72 respiratory failure natural causes
Q3764794 Ginzō Matsuo Ginzō Matsuo Japanese actor, voice actor and narrator 8 1951-12-26T00:00:00Z 2001-08-25T00:00:00Z imported from Wikimedia project: English Wikipedia
Q6243659 John L. Nelson John L. Nelson American jazz musician, songwriter, father of Prince 6 1916-06-29T00:00:00Z 2001-08-25T00:00:00Z imported from Wikimedia project: English Wikipedia
Q862381 Bill Pratney Bill Pratney New Zealand cyclist (1909-2001) 2 1909-05-20T00:00:00Z 2001-08-25T00:00:00Z
Q5671841 Harry Ramberg Harry Ramberg Swedish tennis player 4 1909-04-06T00:00:00Z 2001-08-25T00:00:00Z imported from Wikimedia project: Swedish Wikipedia
Q4807036 Asit Sen (director) Asit Sen film director 6 1922-09-24T00:00:00Z 2001-08-25T00:00:00Z imported from Wikimedia project: English Wikipedia
Q106222009 Ben Oumar Sy Ben Oumar Sy Guinean footballer and manager 1 1926-01-08T00:00:00Z 2001-08-25T00:00:00Z
Q173413 Ken Tyrrell Ken Tyrrell Racing driver and Formula one team owner (1924-2001) 18 1924-05-03T00:00:00Z 2001-08-25T00:00:00Z imported from Wikimedia project: Russian Wikipedia~!stated in: Encyclopædia Britannica Online~!retrieved: 2017-10-09T00:00:00Z~!Encyclopædia Britannica Online ID: biography/Ken-Tyrrell~!subject named as: Ken Tyrrell pancreatic cancer natural causes

Rethinking notability

[edit]

As already explained the algorithm that decided if a deceased should be listed was flawed. I had already noticed that more relevant people appear on more wiki's (winner). I also came to believe that more links to a bio suggests greater notability. The Wikidata query returned the number of site links per entry. The Wikipedia link count api could resolve the number of incoming links. At some point I came up with the concept of the "notability score" of a potential entry. This score is expressed as a product of the two aforementioned data points. For instance take John Chambers (make-up artist):

Number of site links: 12 (see column 'sl' in above table)
Number of pages linking to the bio: 237 (Link Count tool result, API result)
Hence John's notability score would 12 * 237 = 2.844

After much experimenting I settled for a minimum score of 48[29] for an entry to be listed. Although still not perfect it worked way better than the previous algorithm, with this as the end result.

References, revisited

[edit]

Wikidata references

[edit]

When I was building the Wikidata-query I had noticed that some online sources were stated quite often as references for death date statements for humans. Because of the structured way this information was stored I could use it to generate citations fo my entries. Obviously the online source is checked for existence and its contents searched for the date of death (DoD) before the information is used to create a reference.

Next sources are evaluated, in following specific order:

  1. Encyclopædia Britannica
  2. The Guardian
  3. The Independent
  4. Internet Broadway Database
  5. DB~e
  6. Biografisch Portaal
  7. FemBio
  8. filmportal.de
  9. Fichier des personnes décédées

This is an example of a generated reference based on the Wikidata DoD statement claims of José Craveirinha:
<ref>{{cite web |last1= |first1= |title=José Craveirinha |url=https://www.britannica.com/biography/Jose-Craveirinha |website=britannica.com |publisher=Encyclopædia Britannica Online |access-date=24 December 2023 |language= |date=}}</ref> [30]

Sports sites references

[edit]

During implementation of this I discovered an alternative way of automatically utilizing online sources. Websites use specific url patterns to identify resources on the host. Some of the websites use name-based patterns. For instance the site Cycling statistics uses next url to identify rider Jacques Anquetil:

https://www.procyclingstats.com/rider/jacques-anquetil

Knowing the specific pattern I could 'guess' url's using the label name of an entry. When processing DoD November 2, 2004 for instance rider Gerrie Knetemann would be one of the deceased returned by the Wikidata-query.

The software would send https://www.procyclingstats.com/rider/gerrie-knetemann as a request. If the web page exists its html is searched for the DoD.[31] If encountered the web page can now act as a citation and next web reference is generated:

<ref>{{cite web |last1= |first1= |title=Gerrie Knetemann |url=https://www.procyclingstats.com/rider/gerrie-knetemann |website=procyclingstats.com |publisher= |access-date=16 December 2023 |language= |date=}}</ref> [32]

This way of looking for citation sources is done when no Wikidata DoD-references were encountered. The mechanism was applied to next (sports) web sites, in following order:

Note: To decrease the number of http requests per entry I first looked in the entry's bio to determine if the person was known for any of the sports being evaluated. Only then the url would be compiled and called.

Second tier Wikidata references

[edit]

If no sports site reference could be resolved next Wikidata reference sources are evaluated (in that order):

Since these sources are stated very often as Wikidata DoD claims they now appear in abundance as references in the dpm's:
<ref>{{cite web |last1= |first1= |title=Jeanne Stuart - Social Networks and Archival Context |url=https://snaccooperative.org/ark:/99166/w6qp9q9c |website=snaccooperative.org |publisher= |access-date=24 December 2023 |language= |date=}}</ref> [34]

I finally had established an acceptable way of resolving notabilty and generating citations. Now I only had to cast it into a userthat would be mefriendly solution.

Wikipedia Deaths Pages

[edit]

From the start it was clear the solution was to be a web application. Because of the amount of text a console app would not be suitable and by then I had enough experience using web application framework Angular that I felt comfortable creating a single-page application to meet my front end needs.
I can not determine when I started developing the web site. Fact is that the new software was first used on 16 November 2021 (see Milestones). A lot of tweaking to the code followed in the following weeks. I remember expanding the citations functionality and bugfixing the Wikidata query.

When the first version was released the site contained all the functionality to process a dpm the way the Excel tool did, but with the implemented improvements.

To achieve this, functionality present in the Excel tool had to be programmed again for instance:

  • Initial dpm checks
  • Resolving data in the entry's bio, for instance the entry's description
  • Numerous text manipulation functions

More in-depth information on the app can be found here. But how was the web site used when processing a dpm?

Processing a month using the Web application

[edit]
Results of the initial article checks in the web application

A dpm article would be updated by following steps

  1. Perform the initial dpm checks. Consult #1. Dpm checks in Round 1 for specifics. Additional checks were looking for article redirects and named references. See the screendump for an example of the checks results.
  2. Any issues found have to be solved first e.g. moving an entry to the correct day-subsection in the dpm, fixing redirects, removing nowiki-entries, adding categories or correcting the DoD in the entry's bio.
  3. If all issues are solved processing the days in the dpm can commence.

Code excerpt

[edit]

Example of the C# code handling a piece of the challenge to determine the description part of the entry (which denotes the reason for a person being notable).

public string ResolveDescription(string wikiText)
{
    wikiText = RemoveReferences(wikiText);

    string description = GetInitialDescription(wikiText);

    if (description == null)
        return null;

    description = description.Replace("U.S. ", "American ", StringComparison.OrdinalIgnoreCase); // because of the end candidate '.'
    description = description.Replace("United States ", "American ", StringComparison.OrdinalIgnoreCase); 

    // Trucate string;  [,] [perhaps/probably] [best] known [mostly] for  ..  etc.            
    string[] endCandidates = new string[] { "Infobox", "infobox", "{|", "{{", " who ", " whose ", " notable ", " noted ", " known ", " better ", " spanning ", " originally ", " widely ", " responsible ", " remembered ",  " best ", " most ", " perhaps ", " reputed ", " born ", " considered ", " particularly ", "." };

    int posEnd = GetPositionDescriptionEnd(description, endCandidates);

    if (posEnd == InitialPosEnd)
        throw new InvalidWikipediaPageException($"None of the {endCandidates.Length} 'description end' candidates found (including '.') within {InitialMaxLengthDescription} chars from 'description start'. Change the opening sentence of the article. Description: \r\n{description}");

    description = description.Substring(0, posEnd);
    
    return RemoveWikiLinks(description);
}

private string GetInitialDescription(string wikiText)
{
    string[] descriptionStarts = new string[] { " was a ", " was an ", " was the ", " was one of ", " was " }; // " was " LAST!

    int pos = GetPositioninWikiText(wikiText, descriptionStarts);

    if (pos == -1)
        return null;

    return wikiText.Substring(pos, Math.Min(InitialMaxLengthDescription, wikiText.Length - pos));
}

Milestones

[edit]

Temp (under construction)

[edit]

Additional afhankelijkheid: Wikidata wijkt af..

algorithm based on Wikidata

  • New tooling: web application
  • New notability rules
  • Add cause of death automatically

Wikidata editor

Chronology of activities

[edit]

vanaf okt 2023:

Number of pages linking to the bio new method: 46 instead of 237 (result, API result[36])

Milestones

[edit]
  • The first day generated using the new software was 1 August 1996 on 16 November 2021

Side effects (under construction)

[edit]

During the entire process I would find many errors in the analyzed bio's. I must have corrected thousands of bio's during the course of this project. The most common fixes to bio's:

Also:

  • Wikidata is not magically updated when Wikipedia content changes. As a consequence I made some 3,000 edits in Wikidata to sync the death (and often birth) data.
  • Seven repositories om GitHub containing Wikipedia-related software
  • Created articles, f.i. Lesley Cunliffe and Kambara Tai
  • Created new dpy's for the years Deaths in 1980Deaths in 1989 because in 'datum' the were removed from the Year-pages (half of them already have been redirected to dpm's)
  • Currently (15 Nov. 2024) I rank number 1639 as Wikipedian with the most number of edits (64,720)!

Statistics (under construction)

[edit]

/Statistics

The statistics cover the period 1990-2005.

Statistics per 6 November 2024

[edit]
  • Total number of entries: 42,765
  • Total number of references: 27,268
  • Overall reference density[35]: 63.76% (27,268/42,765)
  • Total number of bytes (approx.): 10.7 million
  • Which translates to approx. 1,850 pages (A4)
  • Average number of entries per death day: 7.32 (42,765/5,843)
  • Average number of references per death day: 4.6666 (27,268/5,843)
  • Average number of entries per dpm: 222.73 (42,765/(12 * 16))
  • Average number of references per dpm: 142.02 (27,268/(12 * 16))
  • Death day with the most entries (74): Deaths in September 2001#11
  • Death day with the 2nd most entries (24): Deaths_in_December_2004#26
  • Death day with the 3rd most entries (23): Deaths in April 1993#27
  • Dpm with the most entries (310): Deaths in December 2005
  • Dpm with the most references (229): Deaths in December 1995
  • Dpm with highest reference density[35] (82.33%): Deaths in January 1999
  • Minimum reference density regarding all processed days: 30%
  • Month with the most deaths: December (3,980) (2nd: January (3,901))
  • Month with the least deaths: June (3,326)
  • Total number of views for all dpm's per year (2023): 846,402 (details)
  • Which translates to 4,408 view per page per year (= 367 views per dpm per month)

Statistics regarding the project

[edit]
  • Duration: 6 years and 2 months (September 2018 – November 2024)
  • Number of death days processed: 5,843
  • Number of created dpm's: 170
  • Number of added entries (approx.): 21,200
  • Number of added references (approx.): 22,700
  • Number of added bytes (approx.): 7.9 million
  • Which translates to approx. 1,400 pages (A4)
  • Number of edits (approx.): 22,000[38] (details)
  • Number of edits on Wikidata (approx.) (manual[39] and automated[40]): 3,000

Epilogue

[edit]

One question remains: Why? Why would anyone spend that much time on these trivial lists? Sure, I stumbled across a mess when I was looking for a challenge to help me become a better programmer. And in a way I became a slave of the applications I created; the custom and personal software worked so well that I felt the responsibilty of seeing it through. Perhaps I just wanted to leave something behind, albeit insignificant.
Or maybe, as Tony Stark put it: "Everybody needs a hobby."

References

[edit]
  1. ^ "Announcing Wikipedia's most popular articles of 2023". Wikimedia Foundation. 5 December 2023. Retrieved 20 January 2024.
  2. ^ In 2018 dpm's only existed for 2004 and later. Older deceased were organised in dpy's that existed for the years 19952003 (most of which were getting very long at the time).
  3. ^ Adding recent deaths has more or less been going on since November 2001 starting with the (red link) addition of Melanie Thornton (strangely in article Deaths in 2003). From December 2003 onwards it took off in earnest, accelerating in the following years.
  4. ^ a b I wrote some code to help me accomplish the task.
  5. ^ I wrote some code to check that as well
  6. ^ Named after Holding Back the Years, a hit song on the first vinyl album I ever bought.
  7. ^ Probably interesting for me exclusively ("Wikipedia" Activities available; just add meaning.)
  8. ^ Apart from the three main rounds other smaller improvement iterations were done as well like:
  9. ^ The cause of these errors is very often that the date of death in corresponding bio's had changed but was not reflected in the list.
  10. ^ And even later I decided every day sub-section should list a minimim of three entries. After that I did the same regarding the minimum number of references per day and so on..
  11. ^ Another way would have been to go through everyone listed in the category of deaths of a specific year. However, this would have meant processing the months of an entire year simultaneously. And I still would have had to query the bio's in search of the subject's date of death. Also, as I would find out, many bio's stated incorrect categories regarding the year of death (and birth).
  12. ^ In a lot af cases the nationality of a person was missing in the opening sentence so I had to fix the bio. Americans especially forget that the English Wikipedia is an international venture.
  13. ^ Causes of death of a person where suggested by displaying the first sentence in the bio that contained the string literals " murdered", " killed" or " died" (in that order). Although crude this algorithm worked well and saved me a lot of time.
  14. ^ I found out that above around the age of 65 the cause of death is often not stated in a bio's because, well, they just die of old age and 'natural causes' is not a valid cause of death (aproaching the age mentioned made this work a tad confronting at times)
  15. ^ During the course of the project a whopping total of 5078 edits were made in this page.
  16. ^ For undisclosed reasons 1993 and 1995 were partially processed in two other pages
  17. ^ This minimum was increased to two during round 3.
  18. ^ Actually this round was concluded when dpm Deaths in December 1995 was completed. This is explained here
  19. ^ It's staggering how many editors confuse the date of publication of a cited source with the date of demise.
  20. ^ During the course of the project the notability filter was subject to change. First I used the 'net article size filter'. This was later changed to the filter applied in Round 3: the number of incoming links to the corresponding article.
  21. ^ This corresponds with the gap of 10 months during which no work was done in the processing page.
  22. ^ I wrote some code to fix the format and some other stuff.
  23. ^ In almost all cases the information in the opening sentence of a bio proved to be more useful than the Wikidata description, however.
  24. ^ Actually the data was returned by the Wikdata as JSON after which it was deserialized to fitting objects.
  25. ^ Site links; the number of wiki's (including the English Wikipedia) in which the item is present.
  26. ^ References regarding the DoD (date of death). Data is delimited by the text '~!'
  27. ^ Cause of death
  28. ^ Manner of death
  29. ^ Initially this limit was 50 but soon I changed it to 48 because of its factorization qualities.
  30. ^ "José Craveirinha". britannica.com. Encyclopædia Britannica Online. Retrieved 24 December 2023.
  31. ^ The web sites used specific date formats to display the death date. Obviously this had to be taken into account when looking for the date.
  32. ^ "Gerrie Knetemann". procyclingstats.com. Retrieved 16 December 2023.
  33. ^ Not very successful, only 10 generated citations in total..
  34. ^ "Jeanne Stuart - Social Networks and Archival Context". snaccooperative.org. Retrieved 24 December 2023.
  35. ^ a b c The reference density is the number of refs / number of entries
  36. ^ Increase parameter 'srlimit' to see more link search results
  37. ^ Sometimes the person was still deemed alive in the bio until the correction
  38. ^ This breaks down to an average of slightly less than one added entry per edit but slightly more than one added reference per edit!
  39. ^ Wikidata; Preferences for me states 2,817 number of edits (per 15 Nov 2024)
  40. ^ The address of the client changes so only a limited set of edits are shown per session. 40 sessions * 5 edits per session = 200 automated edits