User:Mill 1/Project Chaining back the Years
“ | "This is going to take me ten years." I thought. In the end it took only six. | ” |
Preface
[edit]The articles that list the recent deaths consistently rank among the most popular on Wikipedia.[1] However, it must have been in the summer of 2018 that I first got interested in the older versions of them. At the time the dead were listed per month ('dpm's') and per year ('dpy's').[2] I noticed wild differences between them in formatting, guidelines, coverage and sourcing. An explanation is that presently the dpm of the current month is edited intensively as the month progresses. And a lot a watchers make sure the guidelines are followed during and after the running month. However, this was not the case for pages listing deceased in the pre-Wikipedia era; they were put in dpy's afterwards.[3] Annoyed by the discrepancies between these types of pages I set out to standardize the formatting of the dpy's first.[4].
During that time I noticed something else which would become the main motivation for the initial phase of this endeavour: days were missing! There seemed to be days that nobody had died. This could not be and my OCD-tendencies immediately kicked in. An idea formed in my head: why not create dpm's for all months going back to 1995? It would solve the issue of the dpy's becoming very long and I could add missing days when processing a month. I would take the year 2005 as a starting point because I noticed that from 2006 onwards at least one deceased is listed for every day until the present.[5] I remember flinching at the idea when I realised I had to process more than 4000 days. "This is going to take me ten years." I thought. In the end it took only six.
This article tries to give an overview of the activities envolved during the project that I dubbed Chaining back the Years.[6] It also states some interesting milestones and statistics[7].
Three rounds
[edit]In hindsight improving the deaths lists fell apart in three separate rounds of activities during which each existing dpm and dpy was processed.[8]
- Round 1: Breaking up the Deaths in Years (September 2018 – October 2020)
- Round 2: Adding NYTimes references (November 2020 – October 2021)
- Round 3: New rules: let's process every day (again) (November 2021 – October 2024)
You can find information on the initial versions of the dpm's per round here.
Round 1: Breaking up the Deaths in Years
[edit]Period: September 2018 – October 2020
Articles: Deaths in January 1997 – Deaths in December 2005
The first phase started by making the dpy's even longer before forking them into twelve separate dpm's. Regarding every month I needed to perform checks, find missing notable deceased for the list ('entries') and compile the wikitext that I could paste in a dpm. Obviously this was way too much work to accomplish by hand.
So before beginning I extended the functionality of the Excel application that I had already used for several other projects. It would proof to be indispensable when processing a month:
Processing a month using the Excel application
[edit]1. Dpm checks
[edit]Before entries would be added/updated, the month at hand would be checked. Existing entries in the list would be cross-referenced with their corresponding bio's to look for discrepancies:
- Are the existing entries in the correct day sub section?[9]
- Do the existing entries link to a valid biography article?
- Do the corresponding bio's contain the correct "[YEAR] deaths" and "[YEAR] births" categories?
- Does every day sub section contain at least two (later three) entries?
2. Process specific days of the dpm
[edit]After the initial checks the actual work on the article could commence. At first, I focused on filling the gaps in the days of death but soon I decided every day should contain at least two entries.[10] Processing a specific date started by clicking the 'Chk'-button in the 'Death per date' worksheet. Next tasks would be executed:
- Resolve the list of bio's whose subject had died on a specific date.[11] More info can be found here.
- Show the list alphabetized and per bio display if it is a stub or has any 'problem flags' like
{{multiple issues}}, {{Notability}}, {{Unreferenced}}, {{mcn}}, {{One source}}
. - Apply custom filtering to the list. More info on that here
- Per bio try to resolve next parts of an entry by analyzing the bio's wikitext:
Result filtering
[edit]From the start it was also clear to me that some inclusion filtering needed to be applied to the found new (and existing) entries. On some days more than 30 persons with a bio died. Stating them all would make the dpm's unwieldy and error prone. And lesser figures (often stubs and virtual orphans) distract from more notable entries.
So I experimented with conditions like not being a stub or having problem flags. Did not work. However, the tool looked for the date of death (DoD) only in the infobox of the person's bio. As a consequence, a biography having an infobox acted as a first filter. I also made the application look at the bio's text size (excluding the text in the infobox and stated categories, the 'net size'). This was the second filter. I settled for 4000 characters as the minimum 'net size' of a biography. This first attempt at grading WP:N worked, but it never sat well with me (and others). It was one of the reasons to initiate round 3.
3. Concluding processing a month
Processing a month would be concluded by two manual activities:
- Search for additional causes of death regarding the entries[14].
- Are there any 'reason for notability'-descriptions in the entry that needs trimming?
Chronology of activities
[edit]Work started on 1 September 2018 by applying the same format to sections and list entries to all the dpy's[4] The first couple of months I worked on the 24 existing dpm's of 2004 and 2005, processing each month assisted by the Excel tool as described. At the same time I was also in the process of finishing another project.
On 2 February 2019 I standardized the guidelines and day sub sections of all dpm's between 2004 and 2015. Applying those changes finalized the first round of improvements regarding the dpm's of 2004 and 2005.
I could now focus on the Deaths in Year-pages. Next list shows when an entire year was completed after which it could be split up into 12 dpm's, finalizing their first round of improvements:
- 2003: 10 February 2019
- 2002: 17 February 2019
- 2001: 17 February 2019
- 2000: 23 February 2019
- 1999: 19 May 2019
- 1998: 9 November 2019
- 1997: 4 October 2020
Regarding 1998 and 1997 (and 1996) a new dpm was created right after a month had been processed. The 12 dpm's were not created simultaneously anymore as is explained here. Processing of the years 1993-1998 was done in this processing page which would be initialized every time after a dpm was completed.[15][16]
Round 1 saw one final improvement. From the beginning I had noticed that the dpm's lacked references citing the deceased date (and cause) of death. I had started adding some citations to entries but it seemed to be a drop in a bucket. That's why I introduced a new feasable requirement: at least one reference per day sub section.[17] Around June 2020 I first started thinking about automating citations. The archive API of The New York Times especially offered great possibilities. So I wrote some code to experiment interacting with the NYTimes API to retrieve obituary data and create citations from them. I pasted the output in another processing user page: /References/The New York Times. The results were spectacular. I could now use this list of generated references as a source. So after processing a day I would also manually add citations of matching entries to the day sub section of the dpm. The first month I processed this way was September 1997. I worked my way back to January 1997, improving and bugfixing the code.
Eventually the software evolved into the WikipediaReferences-application. You can read more about it here. On November 14, 2020 (I learned from the GitHub commit) the application was finally able to add NYTimes-references to the corresponding entries of an entire dpm automatically. I decided to reprocess all the existing dpm's (1997 – 2005) so that their number of stated references would increase considerably. Work started with January 1997 on the same day heralding the start of the next round.
Milestones
[edit]- 1 September 2018: the first edit is made
- 10 February 2019: the first dpm is created
- 10 February 2019: the first dpy is nuked
- 2 March 2019: all dates since the start of the millenium to date have been accounted for
- 11 July 2020: the first day sub section is processed including generated NYTimes references
After 25 months round 1 was concluded by creating the last dpm of 1997.[18] By this time I already must've decided to extend the 'chaining back' period back to January 1990
Round 2: Adding NYTimes references
[edit]Period: November 2020 – November 2021
Articles: Deaths in January 1995 – Deaths in December 2005
As already described at the end of the previous section the succes of WikipediaReferences application prompted me to re-process all dpm's that existed at that time (November 2020). Automatically adding NYTimes references using the tool would also become the additional third activity when wrapping up a month (see 3. Concluding processing a month in Round 1 for the other two activities).
Processing a month using the WikipediaReferences application
[edit]Processing a particular dpm usually consisted of these steps:
- After the regular processing of a dpm was concluded and the last entries were added/updated I would run the software to evaluate a dpm. See screenshot: I would select 'p', followed by some input to tell the application which month and which Wikipedia source page to process.
- First the app would perform initial checks like looking for duplicate entries. The process is aborted if any issues are encountered.
- If the initial issues are resolved the month in question is evaluated by comparing the NYTimes obit data with the entries in the dpm. After that the app offers to generate the the wikitext, including the added/updated references. However this was seldom the case. In most cases other actions were required first after which the evaluation was run again. Two types of actions exists:
- If NYTimes obituary data exists for a listed entry than the resolved death date in the obituary is compared with the date of death in the entry's corresponding bio. Very often discrepancies would exist. One reason is that the death date stated in the bio is wrong.[19]. These discrepancies had to be corrected first.
- The software would also spot potential entries: regarding the particular month NYTimes obit data would exist for bio's that were not present in the dpm. In fact, some many potential entries were suggested that I applied a notability filter on them.[20] I would add most of the suggested entries manually to the dpm source page.
- After the correction/additions step I would re-run option 'Print month of death'. Sometimes several times until no more issues were encountered by the application.
- After succcesful evaluation of the dpm I would instruct the app to generate the wikicode in a text file.
- Processing a specific dpm is concluded by pasting the contents of the text file in the source page of the dpm and checking the result.
Chronology of activities
[edit]Right after I uploaded the last code changes I started using the software on the existing dpm's. I really hit the ground running processing the years 1997 - 2000 within 6 weeks, adding and updating over a thousand citations (as well as adding quite a few entries suggested by the application).
By September 2021 I had processed all existing dpm's, increasing the number of references on a page considerably.[21]. I could now resume my efforts in the processing page were I prepared brand new dpm's starting with Deaths in December 1996. By now the software was firmly embedded in the way of working.
1995
[edit]However, work was interrupted by another job. An editor had forked Deaths in 1995 into 12 dpm's without any regard for the different style and format, after which he added many entries. It took me a sh*tload of time bringing the new dpm's up to par.[22] The task involved a lot of corrections by hand as well, adding causes of death, shortening entry descriptions, meanwhile battling this lunatic. When cleaning up 1995 I also identified many unnotable entries, many of whom didn't even have an enwiki bio. And by this time I already decided to reprocess all the days of existing dpm's partly to apply the new notability algorithm to entries. This would mean that many 1995 entries would be cleansed from the lists. That's why I decided it would be a huge waste of time applying the WikipediaReferences tool to the 1995 entries; it would take a lot of effort correcting entries that would be removed at a later stage anyway. This is the reason why (alhough chronologically incorrect) this was actually a Round 1 job.
Milestones
[edit]- 14 November 2020 the first dpm is processed using the fully functional WikipediaReferences application
- 12 September 2021: processing the first new dpm using the tool
Still using the wiki_client Excel tool, Round 2 came to an end on 31 October 2021 with the creating of Deaths in September 1996
More details on the progress regarding Round 2 can be found here. In the table click on on title 'Round 2' to sort on the date when the processing of a dpm was finished.
Round 3: New rules: let's process every day (again)
[edit]Period: November 2021 – November 2024
Articles: Deaths in November 1989 – Deaths in December 2005
So by now I've been at it for a couple of years and during that period two issues started bugging me more and more:
- The notabilty algorithm is faulty; I'm adding entries whose bio's are semi orphans. At the same time I miss notable entries because their bio's don't have infoboxes.
- Most entries do not have citations. After completing Round 2 this was improved somewhat but many dpm's now contain references that almost exclusively point to The New York Times as a source.
Wikidata
[edit]During my activities I had come across Wikidata when inspecting bio's. At some point I must have noticed that the data stored in a human Wikidata item could serve my purposes, especially these data properties:
- Item's description (=reason for notablity regarding humans)
- Date of death (DoD) statement
- Date of birth (DoB) statement (needed to resolve an entry's age)
- Cause of death statement
- Number of wiki's in which the human is present
Investigating the Wikidata query capabilities made me realise that using Wikidata as a source offered huge advantages over using an entry's corresponding Wikipedia page. It would help me regarding the two issues, resolve the cause of death automatically and offer an alternative for the description part of an entry to generate.[23] There was also one final perk using Wikidata as source: the death date statement of many items contained references supporting the claim. This information could be used to generate references for entries automatically when processing a dpm. These were all great improvements. I realised that I had to re-process every day between 1990 and 2005 AGAIN. But since it was clear that it would hugely increase the quality and reliabilty of the dpm's I decided in a heartbeat I would do it. I still had to create the software though which ultimately would become the WikipediaDeathsPages web application.
At the heart of the app would be the query that would fetch the Wikidata data regarding a specific date of death. Unfortunately I am unfamiliar with the SPARQL query language. Luckily Wikidata:Request a query exists. With the help of volunteers over the course of a couple of months I was finally able to define the query. As input it would only require the date of death. The output is shown below as a table. As you can see it contains the basic data (alphabetized by article name!) I needed to generate the entries for a specific day (in this case 25 August 2001):[24]
item | articlename | itemLabel | itemDescription | sl[25] | dob | dod | dod_refs[26] | cod[27] | mod[28] |
---|---|---|---|---|---|---|---|---|---|
Q11617 | Aaliyah | Aaliyah | American singer and actress (1979–2001) | 69 | 1979-01-16T00:00:00Z | 2001-08-25T00:00:00Z | stated in: Nederlandse Top 40~!stated in: Find a Grave~!Find a Grave memorial ID: 5727911~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Aaliyah~!subject named as: Aaliyah Dana Haughton~!Nederlandse Top 40 artist ID: aaliyah~!stated in: Integrated Authority File~!retrieved: 2014-04-09T00:00:00Z | aviation accident | accidental death |
Q3298163 | Madge Adam | Madge Adam | English solar astronomer (1912-2001) | 15 | 1912-03-06T00:00:00Z | 2001-08-25T00:00:00Z | stated in: Who's Who~!Who's Who UK ID: U4983~!imported from Wikimedia project: English Wikipedia | ||
Q6779010 | Mary Barnard | Mary Barnard | American poet and translator (1909-2001) | 3 | 1909-12-06T00:00:00Z | 2001-08-25T00:00:00Z | stated in: SNAC~!stated in: Find a Grave~!Find a Grave memorial ID: 6318601~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Mary Barnard~!subject named as: Mary Ethel Barnard~!SNAC ARK ID: w60s047j | ||
Q1037163 | Carl Brewer (ice hockey) | Carl Brewer | Canadian ice hockey player (1938-2001) | 9 | 1938-10-21T00:00:00Z | 2001-08-25T00:00:00Z | stated in: SNAC~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Carl Brewer~!SNAC ARK ID: w6f76nsq~!stated in: Find a Grave~!Find a Grave memorial ID: 8466339~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Carl Thomas Brewer | ||
Q10294559 | Helmut Bruck | Helmut Bruck | German officer and Knight's Cross recipient | 3 | 1913-02-16T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | ||
Q93784 | John Chambers (make-up artist) | John Chambers | American make-up artist and prosthetic makeup expert | 12 | 1923-09-12T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: Italian Wikipedia | ||
Q8079499 | Üzeyir Garih | Üzeyir Garih | Turkish businessman | 4 | 1929-01-01T00:00:00Z | 2001-08-25T00:00:00Z | |||
Q3547943 | Diana Golden (skier) | Diana Golden | American alpine skier (1963-2001) | 6 | 1963-03-20T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | breast cancer | natural causes |
Q6033955 | Inigo Jackson | Inigo Jackson | actor (1933-2001) | 1 | 1933-07-19T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | ||
Q155493 | Philippe Léotard | Philippe Léotard | French singer and actor (1940-2001) | 21 | 1940-08-28T00:00:00Z | 2001-08-25T00:00:00Z | GND ID: 119002469~!stated in: Roglo~!stated in: Integrated Authority File~!stated in: GeneaStar~!stated in: Who's Who in France~!stated in: Find a Grave~!Find a Grave memorial ID: 5860980~!retrieved: 2015-10-18T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Philippe Leotard~!Who's Who in France biography ID: 25159~!Roglo person ID: p=philippe;n=leotard~!GeneaStar person ID: leotardp~!stated in: filmportal.de~!stated in: BnF authorities~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2015-10-10T00:00:00Z~!reference URL: http://data.bnf.fr/ark:/12148/cb12070631t ~!subject named as: Philippe Léotard~!Filmportal ID: 0216ac0cf8fb4ce3a3e417812c4a5a72 | respiratory failure | natural causes |
Q3764794 | Ginzō Matsuo | Ginzō Matsuo | Japanese actor, voice actor and narrator | 8 | 1951-12-26T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | ||
Q6243659 | John L. Nelson | John L. Nelson | American jazz musician, songwriter, father of Prince | 6 | 1916-06-29T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | ||
Q862381 | Bill Pratney | Bill Pratney | New Zealand cyclist (1909-2001) | 2 | 1909-05-20T00:00:00Z | 2001-08-25T00:00:00Z | |||
Q5671841 | Harry Ramberg | Harry Ramberg | Swedish tennis player | 4 | 1909-04-06T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: Swedish Wikipedia | ||
Q4807036 | Asit Sen (director) | Asit Sen | film director | 6 | 1922-09-24T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | ||
Q106222009 | Ben Oumar Sy | Ben Oumar Sy | Guinean footballer and manager | 1 | 1926-01-08T00:00:00Z | 2001-08-25T00:00:00Z | |||
Q173413 | Ken Tyrrell | Ken Tyrrell | Racing driver and Formula one team owner (1924-2001) | 18 | 1924-05-03T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: Russian Wikipedia~!stated in: Encyclopædia Britannica Online~!retrieved: 2017-10-09T00:00:00Z~!Encyclopædia Britannica Online ID: biography/Ken-Tyrrell~!subject named as: Ken Tyrrell | pancreatic cancer | natural causes |
Rethinking notability
[edit]As already explained the algorithm that decided if a deceased should be listed was flawed. I had already noticed that more relevant people appear on more wiki's (winner). I also came to believe that more links to a bio suggests greater notability. The Wikidata query returned the number of site links per entry. The Wikipedia link count api could resolve the number of incoming links. At some point I came up with the concept of the "notability score" of a potential entry. This score is expressed as a product of the two aforementioned data points. For instance take John Chambers (make-up artist):
Number of site links: 12 (see column 'sl' in above table)
Number of pages linking to the bio: 237 (Link Count tool result, API result)
Hence John's notability score would 12 * 237 = 2.844
After much experimenting I settled for a minimum score of 48[29] for an entry to be listed. Although still not perfect it worked way better than the previous algorithm, with this as the end result.
References, revisited
[edit]Wikidata references
[edit]When I was building the Wikidata-query I had noticed that some online sources were stated quite often as references for death date statements for humans. Because of the structured way this information was stored I could use it to generate citations fo my entries. Obviously the online source is checked for existence and its contents searched for the date of death (DoD) before the information is used to create a reference.
Next sources are evaluated, in following specific order:
- Encyclopædia Britannica
- The Guardian
- The Independent
- Internet Broadway Database
- DB~e
- Biografisch Portaal
- FemBio
- filmportal.de
- Fichier des personnes décédées
This is an example of a generated reference based on the Wikidata DoD statement claims of José Craveirinha:
<ref>{{cite web |last1= |first1= |title=José Craveirinha |url=https://www.britannica.com/biography/Jose-Craveirinha |website=britannica.com |publisher=Encyclopædia Britannica Online |access-date=24 December 2023 |language= |date=}}</ref>
[30]
Sports sites references
[edit]During implementation of this I discovered an alternative way of automatically utilizing online sources. Websites use specific url patterns to identify resources on the host. Some of the websites use name-based patterns. For instance the site Cycling statistics uses next url to identify rider Jacques Anquetil:
https://www.procyclingstats.com/rider/jacques-anquetil
Knowing the specific pattern I could 'guess' url's using the label name of an entry. When processing DoD November 2, 2004 for instance rider Gerrie Knetemann would be one of the deceased returned by the Wikidata-query.
The software would send https://www.procyclingstats.com/rider/gerrie-knetemann as a request. If the web page exists its html is searched for the DoD.[31] If encountered the web page can now act as a citation and next web reference is generated:
<ref>{{cite web |last1= |first1= |title=Gerrie Knetemann |url=https://www.procyclingstats.com/rider/gerrie-knetemann |website=procyclingstats.com |publisher= |access-date=16 December 2023 |language= |date=}}</ref>
[32]
This way of looking for citation sources is done when no Wikidata DoD-references were encountered. The mechanism was applied to next (sports) web sites, in following order:
- baseball-reference.com
- pro-football-reference.com
- basketball-reference.com
- hockey-reference.com
- olympedia.org
- worldfootball.net
- procyclingstats.com
- where2golf.com[33]
Note: To decrease the number of http requests per entry I first looked in the entry's bio to determine if the person was known for any of the sports being evaluated. Only then the url would be compiled and called.
Second tier Wikidata references
[edit]If no sports site reference could be resolved next Wikidata reference sources are evaluated (in that order):
Since these sources are stated very often as Wikidata DoD claims they now appear in abundance as references in the dpm's:
<ref>{{cite web |last1= |first1= |title=Jeanne Stuart - Social Networks and Archival Context |url=https://snaccooperative.org/ark:/99166/w6qp9q9c |website=snaccooperative.org |publisher= |access-date=24 December 2023 |language= |date=}}</ref>
[34]
I finally had established an acceptable way of resolving notabilty and generating citations. Now I only had to cast it into a userthat would be mefriendly solution.
Wikipedia Deaths Pages
[edit]From the start it was clear the solution was to be a web application. Because of the amount of text a console app would not be suitable and by then I had enough experience using web application framework Angular that I felt comfortable creating a single-page application to meet my front end needs.
I can not determine when I started developing the web site. Fact is that the new software was first used on 16 November 2021 (see Milestones). A lot of tweaking to the code followed in the following weeks. I remember expanding the citations functionality and bugfixing the Wikidata query.
When the first version was released the site contained all the functionality to process a dpm the way the Excel tool did, but with the implemented improvements.
To achieve this, functionality present in the Excel tool had to be programmed again for instance:
- Initial dpm checks
- Resolving data in the entry's bio, for instance the entry's description
- Numerous text manipulation functions
More in-depth information on the app can be found here. But how was the web site used when processing a dpm?
Processing a month using the Web application
[edit]A dpm article would be updated by following steps
- Perform the initial dpm checks. Consult #1. Dpm checks in Round 1 for specifics. Additional checks were looking for article redirects and named references. See the screendump for an example of the checks results.
- Any issues found have to be solved first e.g. moving an entry to the correct day-subsection in the dpm, fixing redirects, removing nowiki-entries, adding categories or correcting the DoD in the entry's bio.
- If all issues are solved processing the days in the dpm can commence.
Code excerpt
[edit]Example of the C# code handling a piece of the challenge to determine the description part of the entry (which denotes the reason for a person being notable).
public string ResolveDescription(string wikiText) { wikiText = RemoveReferences(wikiText); string description = GetInitialDescription(wikiText); if (description == null) return null; description = description.Replace("U.S. ", "American ", StringComparison.OrdinalIgnoreCase); // because of the end candidate '.' description = description.Replace("United States ", "American ", StringComparison.OrdinalIgnoreCase); // Trucate string; [,] [perhaps/probably] [best] known [mostly] for .. etc. string[] endCandidates = new string[] { "Infobox", "infobox", "{|", "{{", " who ", " whose ", " notable ", " noted ", " known ", " better ", " spanning ", " originally ", " widely ", " responsible ", " remembered ", " best ", " most ", " perhaps ", " reputed ", " born ", " considered ", " particularly ", "." }; int posEnd = GetPositionDescriptionEnd(description, endCandidates); if (posEnd == InitialPosEnd) throw new InvalidWikipediaPageException($"None of the {endCandidates.Length} 'description end' candidates found (including '.') within {InitialMaxLengthDescription} chars from 'description start'. Change the opening sentence of the article. Description: \r\n{description}"); description = description.Substring(0, posEnd); return RemoveWikiLinks(description); } private string GetInitialDescription(string wikiText) { string[] descriptionStarts = new string[] { " was a ", " was an ", " was the ", " was one of ", " was " }; // " was " LAST! int pos = GetPositioninWikiText(wikiText, descriptionStarts); if (pos == -1) return null; return wikiText.Substring(pos, Math.Min(InitialMaxLengthDescription, wikiText.Length - pos)); }
Milestones
[edit]- 2 November 2024 The last dpm is processed using the new rules
- 6 November 2024 The last reference was added manually bringing the minimum percentage to 30% reference density[35] regarding all the days of 1990-2005.
Temp (under construction)
[edit]Additional afhankelijkheid: Wikidata wijkt af..
algorithm based on Wikidata
- New tooling: web application
- New notability rules
- Add cause of death automatically
Wikidata editor
Chronology of activities
[edit]Number of pages linking to the bio new method: 46 instead of 237 (result, API result[36])
Milestones
[edit]- The first day generated using the new software was 1 August 1996 on 16 November 2021
Side effects (under construction)
[edit]During the entire process I would find many errors in the analyzed bio's. I must have corrected thousands of bio's during the course of this project. The most common fixes to bio's:
- Adding the nationality of a person in the opening sentence
- Adding the date of death of a person[37]
- Correcting the date of death of a person
- Correcting categories regarding the year of death (and birth)
Also:
- Wikidata is not magically updated when Wikipedia content changes. As a consequence I made some 3,000 edits in Wikidata to sync the death (and often birth) data.
- Seven repositories om GitHub containing Wikipedia-related software
- Created articles, f.i. Lesley Cunliffe and Kambara Tai
- Created new dpy's for the years Deaths in 1980 – Deaths in 1989 because in 'datum' the were removed from the Year-pages (half of them already have been redirected to dpm's)
- Currently (15 Nov. 2024) I rank number 1639 as Wikipedian with the most number of edits (64,720)!
Statistics (under construction)
[edit]The statistics cover the period 1990-2005.
Statistics per 6 November 2024
[edit]- Total number of entries: 42,765
- Total number of references: 27,268
- Overall reference density[35]: 63.76% (27,268/42,765)
- Total number of bytes (approx.): 10.7 million
- Which translates to approx. 1,850 pages (A4)
- Average number of entries per death day: 7.32 (42,765/5,843)
- Average number of references per death day: 4.6666 (27,268/5,843)
- Average number of entries per dpm: 222.73 (42,765/(12 * 16))
- Average number of references per dpm: 142.02 (27,268/(12 * 16))
- Death day with the most entries (74): Deaths in September 2001#11
- Death day with the 2nd most entries (24): Deaths_in_December_2004#26
- Death day with the 3rd most entries (23): Deaths in April 1993#27
- Dpm with the most entries (310): Deaths in December 2005
- Dpm with the most references (229): Deaths in December 1995
- Dpm with highest reference density[35] (82.33%): Deaths in January 1999
- Minimum reference density regarding all processed days: 30%
- Month with the most deaths: December (3,980) (2nd: January (3,901))
- Month with the least deaths: June (3,326)
- Total number of views for all dpm's per year (2023): 846,402 (details)
- Which translates to 4,408 view per page per year (= 367 views per dpm per month)
Statistics regarding the project
[edit]- Duration: 6 years and 2 months (September 2018 – November 2024)
- Number of death days processed: 5,843
- Number of created dpm's: 170
- Number of added entries (approx.): 21,200
- Number of added references (approx.): 22,700
- Number of added bytes (approx.): 7.9 million
- Which translates to approx. 1,400 pages (A4)
- Number of edits (approx.): 22,000[38] (details)
- Number of edits on Wikidata (approx.) (manual[39] and automated[40]): 3,000
Epilogue
[edit]One question remains: Why? Why would anyone spend that much time on these trivial lists? Sure, I stumbled across a mess when I was looking for a challenge to help me become a better programmer. And in a way I became a slave of the applications I created; the custom and personal software worked so well that I felt the responsibilty of seeing it through. Perhaps I just wanted to leave something behind, albeit insignificant.
Or maybe, as Tony Stark put it: "Everybody needs a hobby."
References
[edit]- ^ "Announcing Wikipedia's most popular articles of 2023". Wikimedia Foundation. 5 December 2023. Retrieved 20 January 2024.
- ^ In 2018 dpm's only existed for 2004 and later. Older deceased were organised in dpy's that existed for the years 1995–2003 (most of which were getting very long at the time).
- ^ Adding recent deaths has more or less been going on since November 2001 starting with the (red link) addition of Melanie Thornton (strangely in article Deaths in 2003). From December 2003 onwards it took off in earnest, accelerating in the following years.
- ^ a b I wrote some code to help me accomplish the task.
- ^ I wrote some code to check that as well
- ^ Named after Holding Back the Years, a hit song on the first vinyl album I ever bought.
- ^ Probably interesting for me exclusively ("Wikipedia" Activities available; just add meaning.)
- ^ Apart from the three main rounds other smaller improvement iterations were done as well like:
- ^ The cause of these errors is very often that the date of death in corresponding bio's had changed but was not reflected in the list.
- ^ And even later I decided every day sub-section should list a minimim of three entries. After that I did the same regarding the minimum number of references per day and so on..
- ^ Another way would have been to go through everyone listed in the category of deaths of a specific year. However, this would have meant processing the months of an entire year simultaneously. And I still would have had to query the bio's in search of the subject's date of death. Also, as I would find out, many bio's stated incorrect categories regarding the year of death (and birth).
- ^ In a lot af cases the nationality of a person was missing in the opening sentence so I had to fix the bio. Americans especially forget that the English Wikipedia is an international venture.
- ^ Causes of death of a person where suggested by displaying the first sentence in the bio that contained the string literals " murdered", " killed" or " died" (in that order). Although crude this algorithm worked well and saved me a lot of time.
- ^ I found out that above around the age of 65 the cause of death is often not stated in a bio's because, well, they just die of old age and 'natural causes' is not a valid cause of death (aproaching the age mentioned made this work a tad confronting at times)
- ^ During the course of the project a whopping total of 5078 edits were made in this page.
- ^ For undisclosed reasons 1993 and 1995 were partially processed in two other pages
- ^ This minimum was increased to two during round 3.
- ^ Actually this round was concluded when dpm Deaths in December 1995 was completed. This is explained here
- ^ It's staggering how many editors confuse the date of publication of a cited source with the date of demise.
- ^ During the course of the project the notability filter was subject to change. First I used the 'net article size filter'. This was later changed to the filter applied in Round 3: the number of incoming links to the corresponding article.
- ^ This corresponds with the gap of 10 months during which no work was done in the processing page.
- ^ I wrote some code to fix the format and some other stuff.
- ^ In almost all cases the information in the opening sentence of a bio proved to be more useful than the Wikidata description, however.
- ^ Actually the data was returned by the Wikdata as JSON after which it was deserialized to fitting objects.
- ^ Site links; the number of wiki's (including the English Wikipedia) in which the item is present.
- ^ References regarding the DoD (date of death). Data is delimited by the text '~!'
- ^ Cause of death
- ^ Manner of death
- ^ Initially this limit was 50 but soon I changed it to 48 because of its factorization qualities.
- ^ "José Craveirinha". britannica.com. Encyclopædia Britannica Online. Retrieved 24 December 2023.
- ^ The web sites used specific date formats to display the death date. Obviously this had to be taken into account when looking for the date.
- ^ "Gerrie Knetemann". procyclingstats.com. Retrieved 16 December 2023.
- ^ Not very successful, only 10 generated citations in total..
- ^ "Jeanne Stuart - Social Networks and Archival Context". snaccooperative.org. Retrieved 24 December 2023.
- ^ a b c The reference density is the number of refs / number of entries
- ^ Increase parameter 'srlimit' to see more link search results
- ^ Sometimes the person was still deemed alive in the bio until the correction
- ^ This breaks down to an average of slightly less than one added entry per edit but slightly more than one added reference per edit!
- ^ Wikidata; Preferences for me states 2,817 number of edits (per 15 Nov 2024)
- ^ The address of the client changes so only a limited set of edits are shown per session. 40 sessions * 5 edits per session = 200 automated edits