Wikidata:Requests for permissions/Bot/fromCrossrefBot
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 20:08, 6 March 2023 (UTC)[reply]
fromCrossrefBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools) Operator: Carlinmack (talk • contribs • logs)
Task/s: Importing licenses for 1.45 million CC licensed papers from the Crossref April 2022 dump.
Code: https://github.com/carlinmack/qid-id/blob/main/crossref.py#L123
Function details: Background information Crossref is a central index of research items across publishers[1] and in April 2022 they released a metadata dump with 134 million records. Crossref states that bibliographic metadata is not subject to copyright and so I believe we are free to annotate Wikidata with the information derived from the dump.
In short, my code iterates over the full dump and extracts DOIs and license information for licenses which contain "creativecommons". Then I query WQDS with the resulting list of DOIs to find whether they exist in Wikidata and whether they are already licensed. I then create a list of quickstatements for the DOIs which are in Wikidata but do not have a license.
In December 2022 I ran this code and found 300k items to annotate and processed this via QuickStatement batches. After subsequent discussion I discovered that Wikidata normalises DOIs in uppercase and so I had missed all DOIs which contain letters. I re-ran my code and found 1.45 million DOIs which could be annotated on Wikidata with licenses.
I have written and tested the code to make these edits, linked above. I have not included the data file as it is 134mb but can share if it is of interest.
A more complete write up can be found on my user page.
Carlinmack (talk) 14:27, 27 February 2023 (UTC)[reply]
- have you done test edits? BrokenSegue (talk) 20:39, 28 February 2023 (UTC)[reply]
- Yep https://www.wikidata.org/w/index.php?title=Q54803016&action=history . < That's with the script, I did ~100 QuickStatement batches previously https://quickstatements.toolforge.org/#/batches/Carlinmack so I have confidence in the statements executing as expected Carlinmack (talk) 20:49, 28 February 2023 (UTC)[reply]
- ok LGTM. Support BrokenSegue (talk) 21:08, 28 February 2023 (UTC)[reply]
- Yep https://www.wikidata.org/w/index.php?title=Q54803016&action=history . < That's with the script, I did ~100 QuickStatement batches previously https://quickstatements.toolforge.org/#/batches/Carlinmack so I have confidence in the statements executing as expected Carlinmack (talk) 20:49, 28 February 2023 (UTC)[reply]