Wikidata:Requests for permissions/Bot/So9qBot

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

Not done No follow-up on the request to see if this is still active. @So9q: feel free to re-open this if you want to follow up on it (revert this edit, add it back to the list of bot requests). Thanks. Mike Peel (talk) 21:27, 4 February 2022 (UTC)[reply]

So9qBot

So9qBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: So9q (talk • contribs • logs)

Task/s:

import DOIs found in Wikipedia
import ISBNs found in Wikipedia
import JSTORIDs found in Wikipedia

Code: https://github.com/dpriskorn/asseeibot

Function details: Find DOIs and ISBNs and upload them using the sourceMD tool (if possible, if not, read from CrossRef API and mimic sourceMD). Only upload DOIs that are found in Wikipedia and missing in Wikidata.

The bot is gonna use WikidataIntegrator (which defaults to and respects maxlag=5) if sourceMD cannot be used.

So this is gonna make a ton of new items? Can you show an example? How many new items are we talking about per day? How will this code attempt to determine if an item already exists? BrokenSegue (talk) 13:36, 8 April 2021 (UTC)[reply]

@BrokenSegue:Yes this will create a lot of new items. One for every DOI in WP. I have no estimate of how many but say 1-10 mio (only talking DOIs, I have not investigated the ISBNs but they are probably a few million also)? It will immensely help WP editors to cite scientific articles using the CiteQ template (after the article is has been used in any other Wikipedia and found by the bot). Currently the script runs on my local machine finding DOIs (not uploading itself yet) and that results in about 40 new DOIs an hour only looking at enWP. These DOIs are all missing in WD. Extrapolating that will mean that the bot will create about 28800 new items per month. You can install and run it yourself if you feel like. I posted instructions in the git-repo for how to install the requirements. :)--So9q (talk) 05:19, 12 April 2021 (UTC)[reply]

Please perform some 50-250 test edits. Lymantria (talk) 14:12, 8 April 2021 (UTC)[reply]

@Lymantria:I will finish writing the upload part (since sourceMD is broken and not viable) and ping you when the tests edits have been done.--So9q (talk)

It's a really big benefit if you may import books from ISBN, though the SourceMD tool provide minimal information (as of 2019) and other sources would be needed.--GZWDer (talk) 15:33, 14 April 2021 (UTC)[reply]

The best source for ISBNs that I have found is the website of oclc, example of result of 978-0-486-61272-0. The title and author and year is not copyrightable under US copyright, so we can scrape it. I might get blocked though, but we will see.--So9q (talk) 19:03, 18 April 2021 (UTC)[reply]

Before reusing any code from SourceMD tool please aware serious issues described in GitHub issues.--GZWDer (talk) 14:29, 18 April 2021 (UTC)[reply]

@GZWDer: Thanks for the warning!--So9q (talk) 16:37, 18 April 2021 (UTC)[reply]

Strong oppose we don't need more scientific papers, we should be moving them somewhere else and delete them here. Multichill (talk) 10:35, 25 April 2021 (UTC)[reply]

@Multichill:Since you propose to "moving them somewhere else" I wonder whether you are interested in working towards that becoming a reality? Maybe a new partnership project with a group of universities and/or the a new grant from the Sloan foundation to WikiCite/Shared citatons? Based on the telegram group and the wiki-pages the WikiCite project seems very stale to me.

I actually think that from a narrow Wikipedian perspective most of the 34 mio. scientific articles imported by others (mostly by @Daniel Mietchen: and @GZWDer: from what I have seen) are pretty useless. On the other hand, a scientific article usually has references and if we recursively import references we might end up with 34 mio. articles or more even if we choose to delete all the ones that are not specifically mentioned in Wikipedia (or a recursive reference to one of those). There are now 4 mio. references to scientific articles with an ID in enWP and according to my small sampling with my bot my estimation is that 60-70% of those are in Wikidata already.

I'm pretty sure that the 34 mio. items with a total of maybe 500 mio. tuples that scientific articles now consist of even if the were to be "moved" or deleted are pointing to a deeper infrastructure issue with Blazegraph. Deleting these 500 mio. triples will not solve the problem for long as imports of all the worlds patents (or just the ones already mentioned in Wikipedia) or all the worlds registered beaches or similar might fill up the gap pretty fast.

@Lydia_Pintscher_(WMDE): recently mentioned the issue with Blazegraph in the telegram chat. She pointed out that in her opinion it's a fact that we are now at the limit of the number of nodes that Blazegraph was designed for. I interpret her statements in the chat this way: Adding more statements/items in the order of a couple of millions (as this bot proposal is all about) might pose a risk to the whole WDQS infrastructure. If that is correct then this is big infrastructure problem and one I suggest we put a lot of effort into solving no matter the future of the scientific articles in Wikidata.

See recent WDQS disk space issues and the high priority epic bug about finding an alternative to Blazegraph (open since 2018) which has seen very little activity until recently (after I bumped it :)). See also this recent comment about BlazeGraph from a WMF employee.

IMO the whole idea with Wikidata is to support other Wikimedia projects with centralized structured data which is exactly what this bot-job is about if you ask me, but I can see that in the bigger picture a fruitful WikiCite project that can easily be linked from special(?) properties in Wikidata might be a better solution.

I invite others to join this discussion and state their views.--So9q (talk) 17:51, 4 May 2021 (UTC)[reply]

We have no rush. Let's see what comes from the Shared Citations project, that seems anything but stale. Ainali (talk) 21:19, 4 May 2021 (UTC)[reply]

Shared Citation does not intended to be a "bibliographic commons" (i.e. a collection of all books and articled published ever), while some users proposed Wikidata to be.--GZWDer (talk) 02:50, 5 May 2021 (UTC)[reply]

We use a common knowledge base for a number of different purposes (such as Cite Q template), and Scholia will be more usable if we have a complete corpus of papers (currently we have not even completed 20% of them. Also @Multichill: what is the benefit to import all artistic works comparing with papers?--GZWDer (talk) 18:02, 4 May 2021 (UTC)[reply]

Why are you asking me about importing all artistic works? Are you planning to do so or is this a straw man? Multichill (talk) 20:12, 4 May 2021 (UTC)[reply]

You may find it beneficial to do so (and indeed commons may make good use of them), while many others consider importing all articles useful. (In my opinion only, this means 200 million new items.)--GZWDer (talk) 02:26, 5 May 2021 (UTC)[reply]

Strong oppose. What Multichill said. Also that this "Import It All!" mentality has come up repeatedly on the Wikidata Telegram channel (the "it" changes, but the idea is always the same) and he has been told every time that it is not a good idea because of the limitations of hardware and resources, the fact that these are always imports without any kind of maintenance, etc. yet refuses to listen and at this point, this has a long time ago already turned into a game of pigeon chess that the community doesn't need to be playing. -Yupik (talk) 15:27, 25 April 2021 (UTC)[reply]

BTW: See phab:T281854 - WMF is proposed to introduce new endpoints dedicated to scientific articles.--GZWDer (talk) 18:03, 4 May 2021 (UTC)[reply]

Oppose Hold off until we have more clarity of direction around the Shared Citations proposal. - PKM (talk) 22:19, 4 May 2021 (UTC)[reply]

While I like the idea of shared citations, the goal is different: m:WikiCite/Shared_Citations#Database - It is not a place to compile completed sets of citation corpora (also known as "stamp collecting") or an attempt at a universal a "bibliographic commons". Plus, shared citations does not involve the relationship of articles (how they are referenced each other).--GZWDer (talk) 02:29, 5 May 2021 (UTC)[reply]

Good point. I guess a stamp collection could be made in any wikibase hosted by anyone (preferably by scientists/or a scientific organization) when federated properties becomes a reality/stable feature.--So9q (talk) 05:54, 20 July 2021 (UTC)[reply]

Thanks for chipping in. I am not going to go forward with this until I'm feeling more sure the infrastructure can handle all the extra items/triplets. Currently Blazegraph is a weak link, it would be nice to fix that before going forward if possible.--So9q (talk) 05:54, 20 July 2021 (UTC)[reply]

Support But please do not repeat common mistakes of other bots:

Invalid DOI, like this
Mistakes in captions. "Micea" instead of "Mice"
Outdated PMCID. Like this
Wrong captions. See this
Items without captions. [1]
Duplicate items. [2]
HTML formatting in captions. [3]

External databases have many kinds of special cases. — Ivan A. Krestinin (talk) 22:47, 19 July 2021 (UTC)[reply]

@Ivan A. Krestinin, GZWDer, PKM, Yupik, Ainali, BrokenSegue: During the last couple of months I have investigated the technical limits of Wikidata. WMF has started to take the issue with BlazeGraph more seriously it seems and has hired a new employee to help with analysis of how WDQS is used today to have a basis to make good decisions in the case that BG starts failing because of the sheer number of TRIPLES it has to handle. The search platform team has also conducted a poll to better understand what WDQS users value and want.

I have found a column based SPARQL engine that has query optimizations implemented in MapReduce (Apache Rya) that scales to PB of RDF data (we currently have ~100GB) and proposed that WMF evaluate that as a future replacement for BlazeGraph. I also recently explored QLever and proposed that WMF evaluate whether a double backend strategy is a good mitigation strategy that we can start implementing now with much less effort than it is to replace BlazeGraph. See phabricator tasks: concerning QLever)

I firmly believe that it is very important to continue improving the graph NOW (given that we have tons of users constantly asking for data, 23,000 eager and active contributors and a world on fire that needs knowledge to find good solutions to all the problems we created for ourselves by destroying the environment since the industrial revolution) and thus also improving by creation of new items that fall under our notability criteria be they chemicals or books or scientific articles.

The opposition above seems IMO mostly related to the fear of technical breakdown. This fear is IMO not something the community should base decisions on. But it is something we need to keep an eye on and make sure we have spaces where we explore the underlying needs expressed by the users who have fear.

BlazeGraph is still keeping up fine from what I can see in Grafana and from 18/10 when the new importer is deployed the system will be able to handle changes much faster, including those that would be introduced by this bot.

Regarding the suggestion to hold off until the Shared Citations proposal might become a reality is a really bad idea IMO. The WikiCite grant is about to expire and I don't see WMF prioritizing this currently, do you PKM? It could be years before the Shared Citations proposal becomes a reality. We have the Cite Q-template now and WP-users want to use it to avoid having to manually enter all details about their references on all articles they curate. If Shared Citations ever become a reality, we can just move the data there at that point.

(In the case of catastrophic failure of BG I will of course stop this bot and we can discuss how to move forward in that case. My guess is that WMF will just delete all descriptions from BG to buy time finding and implementing a replacement.)

I'm eager to hear your response to this. I will improve the code so I am able to make some test edits like Lymantria asked and hope the community after that will approve this bot request so we can start getting better coverage for books and scientific literature and make Wikidata the best and most beautiful knowledge graph the world has ever seen.

I am also willing to throttle the bot so that it runs as slowly as the community want it to.

Sincerely--So9q (talk) 09:18, 28 September 2021 (UTC)[reply]

Sorry but I don't feel qualified to answer this. What if we asked for a statement from the WMF dev team? Would they feel confident in adding 100M more statements (or whatever we estimate this will do). My guess is that that isn't actually a huge percent increase from where we are now. BrokenSegue (talk) 14:02, 28 September 2021 (UTC)[reply]

@So9q, BrokenSegue, Lymantria: and all. This looks stale - is anything happening here? Is there a phabricator ticket to discuss this with WMF? Thanks. Mike Peel (talk) 21:48, 18 January 2022 (UTC)[reply]

The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made in a new section.

Wikidata:Requests for permissions/Bot/So9qBot

So9qBot

Navigation menu

Search