Wikidata:Requests for permissions/Bot/RegularBot 2
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Not done @GZWDer: This request seems to be abandoned, please reopen it if that is not the case. Thanks. Mike Peel (talk) 21:35, 18 January 2022 (UTC)[reply]
RegularBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: GZWDer (talk • contribs • logs)
Task/s: Mass creation of items from Wikimedia articles and categories.
Code: Using a modified version of newitem.py
Function details: I intended to move all mass sitelink import feature to this bot. Feel free to raise your concern. --GZWDer (talk) 07:28, 8 August 2020 (UTC) @Tagishsimon, SCIdude, Hjart, Jheald, Edoderoo, Animalparty: @Charles Matthews, Voyagerim, Sabas88, Jean-Frédéric, Ymblanter: Should we import unconnected pages from Wikipedia at all?[reply]
- Previouly this is cleaned up annually or semi-annually
- In my opinion even if importing new items will result in duplicates, not importing them at all defeats the purpose of Wikidata
- This results in large and infinitely growing backlog which including many duplicates will few people fixing them (such as the cebwiki one)
- Importing them at least allows users finding them using various tools (including when the item is improved)
- Some wikis have people cleaning up unconnected pages, but many wiki does not
- Alternatively we only import items with a specific age (the default of newitem.py is 14 days since creation and 7 days since last edit, and there are bot importing from nlwiki and cswiki using such setting; but I use 1/0 setting)
--GZWDer (talk) 07:46, 8 August 2020 (UTC)[reply]
Discussion
- Support if you use the default 14/7 - I'm editing such items right now. However, it would help if you could make a quick page that lists tools to find them. I used PetScan for the titles and a WD dump, but this is out of reach for most people. --SCIdude (talk) 07:59, 8 August 2020 (UTC)[reply]
- Note I will use different setting for different wikis, and I don't think 14/7 is a right solution (especially for actively edited pages). In most wiki I suggests 1/0 with a template skip list, unless there are someone actively cleaning up unconnected pages and suggests a different setting (in many wikis there are not).--GZWDer (talk) 08:38, 8 August 2020 (UTC)[reply]
- Before you import unconnected pages and especially if you do 1/0 you should definitely talk with affected communities. You should not just import them without notifying anyone. --Hjart (talk) 08:59, 8 August 2020 (UTC)[reply]
- Note I will use different setting for different wikis, and I don't think 14/7 is a right solution (especially for actively edited pages). In most wiki I suggests 1/0 with a template skip list, unless there are someone actively cleaning up unconnected pages and suggests a different setting (in many wikis there are not).--GZWDer (talk) 08:38, 8 August 2020 (UTC)[reply]
- Oppose without much better efforts to identify matches to existing items, and (if new items must be created) to much more comprehensively import information to new items, to make them properly identifiable from their statements. Also support delay to see whether new items get moved, merged, added to, or deleted. And above all, please can we encourage creators of new articles to link their own new articles, rather than assume that a bot (might) come and do it for them. Oppose any bot doing this, unless there is such an information campaign on major wikis. Jheald (talk) 09:05, 8 August 2020 (UTC)[reply]
- @Jheald: "encourage creators of new articles to link their own new articles" - but as long as there are users not doing them, there will always be backlog of unconnected pages. Moreover, Many tools (PetScan, HarvestTemplate, projectmerge, etc.) require an existing item to work and thus for the best effect items should be created beforehand.--GZWDer (talk) 09:15, 8 August 2020 (UTC)[reply]
- Oppose Duplicates will only be found if there are many properties filled on these items. Creating empty duplicates is moving problem from area A to area B, where you can even discuss how big the issue is if some items from the Zulu-wiki (or any other low volume wiki) do not get connected to any wikidata item within a week. This should ALSO be discussed with the community of these wiki's. Edoderoo (talk) 09:18, 8 August 2020 (UTC)[reply]
- @Edoderoo: This also means if a page is never connected, a duplicate is never found. Creating an item will allow using tools projectmerge or KrBot's number merge, or more dumb way, a search; in addition when more data are added, duplicates will surface (User_talk:Ghuron/Archives/2018#Extra_US_presidents). In most wikis, there are no people taking care of unconnected pages; even in most active one like nlwiki, a bot doing so is still required ([1]).--GZWDer (talk) 09:25, 8 August 2020 (UTC)[reply]
- For me it is NOT an issue if page XYZ on Zulu-wiki is not connected to Wikidata and not to any other wiki. And again, you didn't answer on the part: creating EMPTY items will not be any help in finding duplicates. So how many properties will your bot add, and how will it define those properties? I know by own personal experience that a bot adding properties is leading to other HUGE issues and tremendous additional extra manual doublechecking work. Edoderoo (talk) 09:38, 8 August 2020 (UTC)[reply]
- Agree. Duplicates do not exist in the meta-space of WD plus WP. It is either a WD duplicate or not. Before you create the item there is no WD duplicate. --SCIdude (talk) 09:42, 8 August 2020 (UTC)[reply]
- @Edoderoo: In my opinion a hidden duplicate is still a duplicate. This task only covers creating new items; adding statements is another thing. (User:NoclaimsBot is a good idea, but we need to generalize it - it currently only works in a few wikis and there are no workflow suggesting a new template or category to add.) @SCIdude: This basically means we will have a infinitely growing number of unconnected items until there are someone taking care of it (unlikely in smaller wikis), which defeats the purpose of Wikidata to be used as a centralized interlanguage links store.--GZWDer (talk) 09:49, 8 August 2020 (UTC)[reply]
- Creating blank-shitty-items that has no value at all can be done by anyone using PetScan. But if your bot can not add any value, I am absolutely against. You better put your energy in creating value, instead of shitty volume. Edoderoo (talk) 10:09, 8 August 2020 (UTC)[reply]
- Not every pages will be handled by human in due time (see the BotMultichillT example).--GZWDer (talk) 10:21, 8 August 2020 (UTC)[reply]
- @GZWDer: regarding NoclaimsBot. More wiki's can be added and it's easy and anyone can add other templates. Categories produced too many false positives so I didn't implement that part. Multichill (talk) 13:18, 8 August 2020 (UTC)[reply]
- Not every pages will be handled by human in due time (see the BotMultichillT example).--GZWDer (talk) 10:21, 8 August 2020 (UTC)[reply]
- Creating blank-shitty-items that has no value at all can be done by anyone using PetScan. But if your bot can not add any value, I am absolutely against. You better put your energy in creating value, instead of shitty volume. Edoderoo (talk) 10:09, 8 August 2020 (UTC)[reply]
- @Edoderoo: In my opinion a hidden duplicate is still a duplicate. This task only covers creating new items; adding statements is another thing. (User:NoclaimsBot is a good idea, but we need to generalize it - it currently only works in a few wikis and there are no workflow suggesting a new template or category to add.) @SCIdude: This basically means we will have a infinitely growing number of unconnected items until there are someone taking care of it (unlikely in smaller wikis), which defeats the purpose of Wikidata to be used as a centralized interlanguage links store.--GZWDer (talk) 09:49, 8 August 2020 (UTC)[reply]
- @Edoderoo: This also means if a page is never connected, a duplicate is never found. Creating an item will allow using tools projectmerge or KrBot's number merge, or more dumb way, a search; in addition when more data are added, duplicates will surface (User_talk:Ghuron/Archives/2018#Extra_US_presidents). In most wikis, there are no people taking care of unconnected pages; even in most active one like nlwiki, a bot doing so is still required ([1]).--GZWDer (talk) 09:25, 8 August 2020 (UTC)[reply]
- @Mike Peel: I think PiBot is approved for some activities in this kind of area. Any thoughts on best practice, minimum requirements, what other bots are active, and how to function appropriately in this area? Jheald (talk) 09:35, 8 August 2020 (UTC)[reply]
- I'm generally in support of this bot, I come across quite a lot of items that it's usefully created. I think waiting for 2 weeks after page creation is a good idea if the items are going to be blank - pi bot creates them within an hour for humans, but it's also matching pages with existing ones wherever possible, plus it's importing basic data about the humans from enwp at the same time. It would be nice if there was a way of finding matches between unconnected pages and existing items, to avoid duplicates, but this is tricky. I have some scripts that search Wikidata for the page title to find matches, skipping items that already have a sitelink, but they then need a human to check through the matches to see if they are correct. Thanks. Mike Peel (talk) 10:03, 8 August 2020 (UTC)[reply]
- Yeah, though Pi bot is only run on some wikis.--GZWDer (talk) 10:21, 8 August 2020 (UTC)[reply]
- I'm generally in support of this bot, I come across quite a lot of items that it's usefully created. I think waiting for 2 weeks after page creation is a good idea if the items are going to be blank - pi bot creates them within an hour for humans, but it's also matching pages with existing ones wherever possible, plus it's importing basic data about the humans from enwp at the same time. It would be nice if there was a way of finding matches between unconnected pages and existing items, to avoid duplicates, but this is tricky. I have some scripts that search Wikidata for the page title to find matches, skipping items that already have a sitelink, but they then need a human to check through the matches to see if they are correct. Thanks. Mike Peel (talk) 10:03, 8 August 2020 (UTC)[reply]
- Oppose for wikis that mass-generate articles automatically like cebwiki (not sure if there are others?). Support for other wikis, with the current age restrictions (14/7). − Pintoch (talk) 10:10, 8 August 2020 (UTC)[reply]
- Another point is, if items are not created at all, few people will notice existence of local pages. Many people imported locations from various sources, where many duplicates with cebwiki ones exists. If duplicates (specifically hidden ones) will happen, let them happen as earliest as possible.--GZWDer (talk) 10:21, 8 August 2020 (UTC)[reply]
- Oppose until a satisfactory explanation is offered about why it is better than the failed proposal Wikidata:Requests for permissions/Bot/GZWDer (flood) 2. – The preceding unsigned comment was added by Jc3s5h (talk • contribs) at 10:53, 8 August 2020 (UTC).[reply]
- Duplicates will exist, but creating items will make them visible and allow people actively working on it. In previous years there are significant import from other sources without notice of possible duplicates.--GZWDer (talk) 11:17, 8 August 2020 (UTC)[reply]
- I wrote newitem.py to make sure the backlog doesn't get too large. You to give users the time to connect articles to existing items or create new items with some statements. If that doesn't happen in time, the bot comes along to create an empty item so that we don't get an ever growing backlog. The current settings for the Dutch Wikipedia are: The article has to be at least 49 days old and the last edit has to be at least 35 days ago. Running this bot on 1 days old and edit set to 0 days is ridiculous. Why this rush? Oppose with those crazy settings. Multichill (talk) 13:18, 8 August 2020 (UTC)[reply]
- @Multichill: What is the propose of a last edit threshold?--GZWDer (talk) 14:20, 8 August 2020 (UTC)[reply]
- If you're going to run this on many wiki's I would do a conservative approach and set creation around 30 days and last edit around 45 days. Multichill (talk) 14:28, 8 August 2020 (UTC)[reply]
- @Multichill: I still do not get the point of last edit. Many local wiki users do not care Wikidata and I can find many articles in DYK, GA or FA without a Wikidata item connected.--GZWDer (talk) 14:33, 8 August 2020 (UTC)[reply]
- To increase that the article is in a stable situation. For example, you don't want to create items for articles that are nominated for deletion. Multichill (talk) 14:39, 8 August 2020 (UTC)[reply]
- @Multichill: FA, GA and DYK articles are usually not "stable" in this sense, as is current hot topic. Instead the bot skip pages containing some defined template.--GZWDer (talk) 15:55, 8 August 2020 (UTC)[reply]
- To increase that the article is in a stable situation. For example, you don't want to create items for articles that are nominated for deletion. Multichill (talk) 14:39, 8 August 2020 (UTC)[reply]
- @Multichill: I still do not get the point of last edit. Many local wiki users do not care Wikidata and I can find many articles in DYK, GA or FA without a Wikidata item connected.--GZWDer (talk) 14:33, 8 August 2020 (UTC)[reply]
- If you're going to run this on many wiki's I would do a conservative approach and set creation around 30 days and last edit around 45 days. Multichill (talk) 14:28, 8 August 2020 (UTC)[reply]
- @Multichill: What is the propose of a last edit threshold?--GZWDer (talk) 14:20, 8 August 2020 (UTC)[reply]
- Oppose in the same sense as Multichill. I have a page with some tools I use successfully to find matches of particular interest to me. If a duplicate item is created, rather than an available match, then this PetScan query, related to s:Dictionary of National Biography, will not pick it up (within its scope), until it is merged here, which I know can take a year. I use this PetScan query to patrol for likely candidates, and this works well. When an item, for a person, needs to be created, I use https://mix-n-match.toolforge.org/ first, to see whether I can create an item with some identifiers on it, as a start. These workflows have worked fine for me. They have been disrupted by the short lead time: there may be some trade-off here, but I will not be sure about effect of creating the needing-to-be-merged-here items until they start appearing in the medium term.
- Where, please, is the urgency of a change in the status quo? Waiting a few weeks is sensible, in the general case. There needs to be a clear explanation of what is currently broken and requiring to be fixed. Charles Matthews (talk) 14:08, 8 August 2020 (UTC)[reply]
- @Charles Matthews: Currently there are no bot at all to clean up the backlog in most wikis. If possible, I can run the enwiki one with the default setting, which will at least significantly reduce the backlog. For other wikis, alternative setting may be used.--GZWDer (talk) 14:22, 8 August 2020 (UTC)[reply]
- Three points here:
- As is usual in discussions with you, you do not answer the question directly, but start another line of discussion. While this may be politeness in some ways, it does not correspond to the needs of wiki culture.
- When I read this diff from your user talk page, I thought that you simply don't understand the merging issue for duplicates. Technically, merging is fairly easy on Wikidata. But to do it responsibly, particularly for human (Q5) items but also in other areas such as medical terms, is hard work.
- When you ask for a bot permission that gives you many options that you might use, I'm inclined to refuse. I think you should specify what you will do, not talk about what you might do. If you define some backlogs you want to clear, and say how you might clear them, that might be OK.
- Three points here:
- Charles Matthews (talk) 14:48, 8 August 2020 (UTC)[reply]
- @Charles Matthews: For this task only, it have only one job - fully automatically creating new items from Wikimedia pages. For point #2: It is not a good thing either that many hidden duplicates in an infinitely growing backlog of unconnected pages that nobody is cleaning up. Things are even worse when many new items are created which increased the number of hidden duplicates. (Mix'n'Match can only find pages in one wiki.) --GZWDer (talk) 15:21, 8 August 2020 (UTC)[reply]
- Charles Matthews (talk) 14:48, 8 August 2020 (UTC)[reply]
- Well, I gave a link to a serious dispute on your user talk page. This dispute actually needs to be resolved. It changes the situation. Let me explain: I do not always agree with the idea that bot tasks should be completely specified. It is usually be better if the bot operator agrees to stay within the bot policy. The dispute is about good practice in the creation of duplicates here, which is not currently mentioned in the bot policy.
- But the way you are arguing is likely to have the result that not creating too many duplicates is added to the bot policy. Because many people disagree with you. In the end, disputes are resolved by addressing the issues.
- A possible solution here is to divide up the language codes into groups, and try to get some agreement on how long to wait for each group of codes. If you can give details of wikis that "nobody is cleaning up", probably that could be a basis for discussion. If you are really saying this is a "long tail" problem, where there is more in the "infinitely growing backlog" of "hidden duplicates", as you call it, than we all know, then we do need to understand how fat the tail is. If there are 250 out of ~300 wikipedias involved, generally the smallest, then maybe it is comprehensible as an issue. The ceb wikipedia is clearly an edge case, and we should exclude it at present. Charles Matthews (talk) 16:14, 8 August 2020 (UTC)[reply]
- @Charles Matthews: See User:GZWDer_(flood)/Automatic_creation_schedule, currently involving 151 different wikis (including all Wikipedias with more than 10000 articles except four). See also here for the number of unconnected pages older than two weeks per wiki and here for history of backlog in enwiki.
- Previously when PetScan is used, pages with title same as the label of existing items are skipped by default. However I don't think this is a good practice as the skipped page are itself infinitely growing. So I decided to import all of them and duplicates can be found when items are created (especially when more statements are added).--GZWDer (talk) 16:45, 8 August 2020 (UTC)[reply]
- It seems you are missing the point of what I am saying, and also the point that almost everyone here is opposing. Charles Matthews (talk) 18:41, 8 August 2020 (UTC)[reply]
- @Charles Matthews: Creation of items from unconnected page will results in duplicates unless they are checked one by one which is not possible in a automatic process. Connections do not disappear if the items are not created. So let it happen which will surface works that need to do, unless significant many people doing them in another way (i.e. cleaning up unconnected pages manually).--GZWDer (talk) 18:47, 8 August 2020 (UTC)[reply]
- To be clear, you need to engage here with criticism. If your attitude is "all or nothing", then clearly at this time you get nothing. Charles Matthews (talk) 19:03, 8 August 2020 (UTC)[reply]
- @Charles Matthews: Do you agree to automatically create new items for articles older than a time stamp (14 days by default, but may be a bit longer for wikis with users actively handling unconnected pages)? Duplicates will happen nevertheless (there are no automatic way to prevent it), but at least unconnected pages are likely to be abondoned (i.e. not actively handled) if they are not handled in a specific timeframe. In other word, we currently have two workflows (handle unconnect pages before any automatical imports, and handle them after imports), and this proposes a cut point that the loss of creating items lately (i.e. unable to use tools for extent items, and possible duplicates with items recently created) outweight the gain (i.e. duplicates in creation, and premature for human handling).--GZWDer (talk) 13:42, 9 August 2020 (UTC)[reply]
- @GZWDer: I can agree to a two-phase system, in which (phase I) newly-created wikipedia items are left for a period, and then (phase II) automatic creation of a Wikidata item takes place. In all your suggestions, it seems to me, you make phase I too short. I agree that there is a kind of trade-off here, and that we can accept some duplicates caused in phase II. That doesn't mean that phase II of automated creation has to be blind to the duplication issue. I don't think it is a good idea to apply the same workflow to all wikipedias. (And I would say, as a Wikisource editor, there is much work to do there, also.) Charles Matthews (talk) 13:59, 9 August 2020 (UTC)[reply]
- @Charles Matthews: Originally I also want to cover Wikisources; as more controversies are expected (and met in the past), The schedule currently only include the Chinese one. So for phase II there are some options:
- @GZWDer: I can agree to a two-phase system, in which (phase I) newly-created wikipedia items are left for a period, and then (phase II) automatic creation of a Wikidata item takes place. In all your suggestions, it seems to me, you make phase I too short. I agree that there is a kind of trade-off here, and that we can accept some duplicates caused in phase II. That doesn't mean that phase II of automated creation has to be blind to the duplication issue. I don't think it is a good idea to apply the same workflow to all wikipedias. (And I would say, as a Wikisource editor, there is much work to do there, also.) Charles Matthews (talk) 13:59, 9 August 2020 (UTC)[reply]
- @Charles Matthews: Do you agree to automatically create new items for articles older than a time stamp (14 days by default, but may be a bit longer for wikis with users actively handling unconnected pages)? Duplicates will happen nevertheless (there are no automatic way to prevent it), but at least unconnected pages are likely to be abondoned (i.e. not actively handled) if they are not handled in a specific timeframe. In other word, we currently have two workflows (handle unconnect pages before any automatical imports, and handle them after imports), and this proposes a cut point that the loss of creating items lately (i.e. unable to use tools for extent items, and possible duplicates with items recently created) outweight the gain (i.e. duplicates in creation, and premature for human handling).--GZWDer (talk) 13:42, 9 August 2020 (UTC)[reply]
- To be clear, you need to engage here with criticism. If your attitude is "all or nothing", then clearly at this time you get nothing. Charles Matthews (talk) 19:03, 8 August 2020 (UTC)[reply]
- Create items for older articles en masse, as I originally proposed.
- Increase the interval between creations, e.g. Create items once each year, which is what I have done between 2014 and 2020 - this does not solve all issues.
- Not creating the items at all. This will result in infinitely growing backlog which I am strongly worried about (even for cebwiki). And also in the future other users will create items covering the same topic without notice of local articles.
- Manually checking each article - require language skill and not always scalable
- Create them subset by subset (I imported more than 40000 individual English Wikisource articles)
- And other ideas?--GZWDer (talk) 14:10, 9 August 2020 (UTC)[reply]
@GZWDer: In the big picture, this really is not a simple issue. Here is a table for what seems to be needed.
Plan for: | Phase I | Phase II | Comments |
---|---|---|---|
A: smaller wikipedias | |||
B: larger wikipedias | |||
C: other wikis |
There is a comments column because: firstly, there are points about scope (deciding about "smaller", "larger" and ceb); secondly, there are issues about item creation in Phase II. Charles Matthews (talk) 08:03, 10 August 2020 (UTC)[reply]
- Oppose Looks to me like you want permission to run a new bot which basically just does the same thing your old bot did. It's clear to me that you will not get a go for something like that. If you want permission for a new bot, you will need to build something that addresses our concerns substantially better than your old one.--Hjart (talk) 16:12, 8 August 2020 (UTC)[reply]
- Questions (please answer both): 1) Will your "Mass creation of items" import at least a few basic properties like instance of (P31), date of birth (P569) or located in the administrative territorial entity (P131) as well? Or will it create only a single-language label and sitelink? And 2) can you explain or link to a description of what you mean by "modified version of newitem.py"? -Animalparty (talk) 16:40, 8 August 2020 (UTC)[reply]
- @Animalparty: 1) This task only involve creation of new items. For people I (and some other users) will occasionally add instance of (P31)=human (Q5) in batches. Importing dates is not in the scope of this task (see Wikidata:Database_reports/Deaths_at_Wikipedia and related recentdeaths/harvesttemplates tools, which only works if an item exist). 2) There are two modifications: 1. an extended template skip list (see phab:T257739 for feature request) 2. The touch (of local pages) feature is disabled (in my experience it is not necessary).--GZWDer (talk) 18:35, 8 August 2020 (UTC)[reply]
- Oppose Some time ago this bot made alot of mistakes and doubles. -- – The preceding unsigned comment was added by Voyagerim (talk • contribs) at 17:36, 8 August 2020 (UTC).[reply]
de facto practice in July 2020 | ||||||||
---|---|---|---|---|---|---|---|---|
|
This is what being planned:
Plan for: | Setting | Comments |
---|---|---|
Group 1: All Wikipedias other than those listed below | 14/0 (14/7 is also OK but not my favor) | Should we split it to larger and smaller ones with different settings> |
Group 2: Some Wikipedias such as dawiki | 21/0 (or 21/7, 30/7) | Wikis with active users handling Wikidata. Alternatively each wiki may use a custom setting. |
Group 3: zhwiki, zhwikisource, all Wikinews | 1/0 | If agreed any Wikidata actions (i.e. improvement and merge) can happen after item creation. The client sitelink widget functions regardless whether an item exist. |
nlwiki, cswiki | Not to be done | |
cebwiki | Planned to mass import regardless of duplicate, then treat as Group 1 | Leaving unconnected indefinitely will result in more and more duplicates |
arzwiki | Currently skipped, but eventually to be done | Currently there are many articles created based on Wikidata and is not connected to Wikidata; it is being fixed |
Wikisource (other than zh): Non-subpages in main namespace Pages in author namespace |
Treated as Group 1 (by default) or 2 | |
Wikisource (other than zh): Subpages |
Manual batch import with a case-by-case basis |
- Wikisource (other than zh) is not planned initially
- "Pages" includes articles and categories, but non-Wikipedia categories is not planned initially
--GZWDer (talk) 08:39, 10 August 2020 (UTC)[reply]
- Oppose until previous problems are fixed, e.g. User_talk:GZWDer#Mass_creation_of_items_without_labels. --- Jura 12:15, 11 August 2020 (UTC)[reply]
- More detailed comment at Wikidata:Requests for permissions/Bot/RegularBot. --- Jura 06:01, 12 August 2020 (UTC)[reply]
- Oppose if the behaviour of the flood bot isn't addressed. Could you please add some heuristic before creating the item, for example check for similarly named items and if there's already something with a 50% similarity leave it in a list to review manually? --Sabas88 (talk) 08:56, 14 August 2020 (UTC)[reply]
- @Sabas88: I am afraid that this will still result in a backlog. For some time when Wikidata item creator is functional (since deprecated in favor of PetScan), it checks and skips any pages with label same as an existing item. After several runs, the list of skipped pages become longer and longer. I do not think it is scalable for any human checking beforehand. Anyway, new pages are held for several days, and users may creating items or linking them to existing one. It is unlikely for a page to be taken care of when there have been a significant period since it is created.--GZWDer (talk) 23:17, 14 August 2020 (UTC)[reply]
- Comment Is there any way we can stem this problem at the source - when a user creates a new page on a language wiki, can we get the UI to immediately try to link it to an existing wikidata item, encourage the user to select the right one? Is there maybe a phabricator task for this? This sort of bot action really can't be the correct long-term solution for this problem. I have run across many, many, maybe over a hundred, such page creations that should have been linked to an obvious existing wikidata item, and required a later item merge on Wikidata. ArthurPSmith (talk) 17:53, 18 August 2020 (UTC)[reply]
- I like the idea by @ArthurPSmith: very much. @Lydia Pintscher (WMDE): what do you think? Would that be possible to implement? From my point of view, one problem is, that a lot of creators of articles, categories, navigational items, templates, disambiguations, lists, commonscats, etc. are either not aware of the existance of wikidata or did forget to connect a newly created article etc. to an already existing object or to create a new one if not yet existing (which leads to a lot of duplicates, if this creation respectivley connection is not done manually, but by a bot instead, which have to be merged manually). An additional step after saving a newly created article etc. to present to the user a list of wikidata objects (e.g. a list of persons with the same name; could be a similar algorithm as the duplicate check / suggestion list in PetScan, duplicity example 1 and duplicity example 2) that might be matching or the option to create a new one if no one matches. Thanks a lot! --M2k~dewiki (talk) 22:47, 28 August 2020 (UTC)[reply]
- also ping to @Lucas Werkmeister (WMDE), Mohammed Sadat (WMDE), Lea Lacroix (WMDE): for info --M2k~dewiki (talk) 22:50, 28 August 2020 (UTC)[reply]
- Also ping to @Lantus, MisterSynergy, Olaf Studt, Bahnmoeller: In addition, i think the User:Pi bot operated by @Mike Peel: does a great job with connecting to existing objects or creating new ones if not existing for items regarding peoples (currently only for the english wikipedia, until june 2020 for about one year also for the german wikipedia, Thanks a lot to Mike! - In my opinion this should be reactived also for the german wikipedia). Of course, the algorithm could be improved, for example by also considering various IDs (like GND, VIAF, LCCN, IMDb, ...). The algorithm is described here: User_talk:Mike_Peel/Archive_2#Matching_existing_wikidata_objects_with_unconnected_articles.
- also ping to @Lucas Werkmeister (WMDE), Mohammed Sadat (WMDE), Lea Lacroix (WMDE): for info --M2k~dewiki (talk) 22:50, 28 August 2020 (UTC)[reply]
- I like the idea by @ArthurPSmith: very much. @Lydia Pintscher (WMDE): what do you think? Would that be possible to implement? From my point of view, one problem is, that a lot of creators of articles, categories, navigational items, templates, disambiguations, lists, commonscats, etc. are either not aware of the existance of wikidata or did forget to connect a newly created article etc. to an already existing object or to create a new one if not yet existing (which leads to a lot of duplicates, if this creation respectivley connection is not done manually, but by a bot instead, which have to be merged manually). An additional step after saving a newly created article etc. to present to the user a list of wikidata objects (e.g. a list of persons with the same name; could be a similar algorithm as the duplicate check / suggestion list in PetScan, duplicity example 1 and duplicity example 2) that might be matching or the option to create a new one if no one matches. Thanks a lot! --M2k~dewiki (talk) 22:47, 28 August 2020 (UTC)[reply]
Since this very fundamental problem of connecting articles to existing objects respectivley creating new objects for unconnected pages (when, by whom, how to avoid duplicates, ...) for hundreds of newly created articles per day in different language versions has been discussed for years now, the above proposal by ArthurPSmith could be a solution to it. It might be combined with specialized bots like Mikes Pi bot for people (and maybe others for movies, geographic objects, lists, categories, ...).
Also see
- Topic:Vplpjjjpwfpmelfo
- Wikidata:Forum/Archiv/2019/08#Kann_man_hier_auch_Vandalen_in_die_Schranken_weisen?
- de:Benutzer:M2k~dewiki/FAQ#Warum_fehlen_die_von_mir_gewünschten_Informationen_im_neu_erstellten_Wikidata-Objekt?
- de:Benutzer_Diskussion:M2k~dewiki/Archiv/2019#Einträge_auf_Wikidata
- Wikidata:Forum/Archiv/2020/07#Wikidata-Objekte_für_noch_nicht_zugordnete_Artikel,_Kategorien,_Vorlagen,_Listen,_Begriffsklärungen,_mit_bestehenden_Objekten_verbinden_bzw._neu_anlegen
- Wikidata:Client editing prototype --M2k~dewiki (talk) 23:23, 28 August 2020 (UTC)[reply]
Also for info to @Derzno, Jean-Frédéric, Mfchris84, Giorgio Michele, Ordercrazy: another problem regarding item creation and duplicates is, that there are a lot of already existing entries e.g. for french or german monuments, churches, etc. which contain the monument ID. E.g. for bavarian monuments there are currently 160.000 wikidata objects. But if a user connects an newly created article to an (unconnected) commonscat (using the "add other language" in the left navigation) for this monument an additional wikidata object is created, so there is one object containing the sitelinks to the article and the commonscat and another one with the monument ID. Currently the only solution is to connect a newly created commonscat for a monument as soon as possible to the already existing wikidata object with the monument ID, so if a user connects an article to this commonscat, then the existing wikidata object will be used, otherwise a new one only with the two sitelinks will be created. For example, in 2020 so far about 1.000 new commonscats for bavarian monuments have been created, which have not been connected to the already existing wikidata objects by the creators of the commonscats.
Also see:
- User_talk:Derzno#BLfD-Objekte_und_Commonscat (difflink)
- de:Benutzer:M2k~dewiki/FAQ#Wie_ist_das_mit_bestehenden_Wikidata-Objekten_und_unverbundenen_Commonscat_und_Artikel_(inbesondere_Baudenkmäler_in_Bayern)_? --M2k~dewiki (talk) 23:59, 28 August 2020 (UTC)[reply]
- Hello @Lantus, Olaf Studt, Bahnmoeller: I have now create these two pages:
The first one might help to find and connect unconnected articles, categories, templates, ... from de-WP to existing objects respectivley to create new wikidata objects. The second one might help to enhance existing objects with IDs (using HarvestTemplates for GND, VIAF, LCCN/LCAuth, IMDb, ...) or other properties (e.g. using PetScan based on categories). Parts of the functionality of these two pages might be sooner or later be implemented in (specialized) bots. --M2k~dewiki (talk) 01:23, 29 August 2020 (UTC)[reply]
- The problems with Bavaria is slightly complex and a mixture of many different issues. It get started with the bot transfer in 2017 followed by certain others root causes. At the end the datasets are in a very bad shape and it’s a nightmare to clean up. Day by day I’ll find new corners of surprises. Currently I’m working hard to get a couple of these issues fixed. On top of this bot issues we need to find a way to get people stopped, working out by the same way as in the German Wikipedia. Some folks pulling together again and again things to be in line with articles. So the P4244 issue list will be filled up again and again with doubles, wrong using and violations. I have no idea how we can get stopped but personally I’d gave up to discussing with people having no mindset on database design and definitions. Most are living in their own world. Anyhow, crying doesn’t help and I’m doing my best to drop the P4244issues. To be honest this is a job for month and many item needs to be checked manually. So I’m not happy to get through the back door uploaded with new issues of a bot task. --Derzno (talk) 03:31, 29 August 2020 (UTC)[reply]
- As I have said many times: In most wikis, there are not enough people to taken care of unconnected pages. If possible, I can postpond the item creation (the plan is 14 days after article creation), but the backlog must be cleaned eventually.--GZWDer (talk) 04:11, 29 August 2020 (UTC)[reply]
- @Derzno: the problem is not only related to Bavarian monuments, but affects all cases in all languages and all language versions of wikipedia and all sort of object types (e.g. movies with totally different names in different languages, chemical components, ...), where datasets have been imported before, but not connected to articles, commonscats, etc. How would a user find the right object between the 90 million existing objects (Special:Statistics)? If a user is looking for "Burgstall Humpfldumpf", does not find it and therefore creates a new object for this article/commonscat-combination, while there might exist an object "Bodendenkmal D-123-456-789" or an object for the japanese or russian translation? Duplicates eventually might be identified and merged by identical IDs (like GND, LCCN/LCAuth, VIAF, IAAF, IMDB, monuments IDs like Palissy-ID for french monuments, BLf-ID for Bavarian monuments, DenkXWeb Objektnummer for Hesse state monuments, BLDAM for monuments from Brandenburg, LfDS for monuments from Saxony, P2951 for Austrian monuments, CAS-Number for chemical components, ...). How could the process of the matching by ID (currently there are more than 8.000 properties, a lot of them are IDs) be handled on a large scale (i.e. every day in several language versions of wikipedia hundreds of new articles are created) which need to be connected to maybe already existing objects? --M2k~dewiki (talk) 07:21, 29 August 2020 (UTC)[reply]
- So we created these items (including duplicates) first, and someone will improve them; duplicates discovered and merged. Originally I expected this will become the primary workflow for unconnected pages - this is why I previously run the bot at 1/0 instead of the default 14/7. There are people who taken care of new Wikipedia articles; Previously my expection is completely move Wikidata handling after item creation. Using a delay is expected by many users, but works (i.e. clearing unconnected pages) should eventually be done. In another word, I give people some time to do Wikidata connection, and after the time limit, new items are created automatically.--GZWDer (talk) 09:05, 29 August 2020 (UTC)[reply]
- @Derzno: the problem is not only related to Bavarian monuments, but affects all cases in all languages and all language versions of wikipedia and all sort of object types (e.g. movies with totally different names in different languages, chemical components, ...), where datasets have been imported before, but not connected to articles, commonscats, etc. How would a user find the right object between the 90 million existing objects (Special:Statistics)? If a user is looking for "Burgstall Humpfldumpf", does not find it and therefore creates a new object for this article/commonscat-combination, while there might exist an object "Bodendenkmal D-123-456-789" or an object for the japanese or russian translation? Duplicates eventually might be identified and merged by identical IDs (like GND, LCCN/LCAuth, VIAF, IAAF, IMDB, monuments IDs like Palissy-ID for french monuments, BLf-ID for Bavarian monuments, DenkXWeb Objektnummer for Hesse state monuments, BLDAM for monuments from Brandenburg, LfDS for monuments from Saxony, P2951 for Austrian monuments, CAS-Number for chemical components, ...). How could the process of the matching by ID (currently there are more than 8.000 properties, a lot of them are IDs) be handled on a large scale (i.e. every day in several language versions of wikipedia hundreds of new articles are created) which need to be connected to maybe already existing objects? --M2k~dewiki (talk) 07:21, 29 August 2020 (UTC)[reply]
- Also see Wikidata:Contact_the_development_team#Connecting_newly_created_articles_to_existing_objects_resp._creating_new_object_-_additional_step_when_creating_articles,_categories,_etc. (difflink). --M2k~dewiki (talk)
- Oppose. Moving a backlog from place A to place B makes sense only if place B has a more active community or better tools, but this does not seem to be the case. I often encounter duplicates created by this kind of bots many years ago, and at least on my home wiki some of them might have been spotted earlier if they had simply remained unconnected. --Silvonen (talk) 16:18, 12 September 2020 (UTC)[reply]
- Without a bot of this kind page can left unconnected indefinitely, which may be not optimal for users using PetScan, HarvestTemplate and projectmerge, or even users try to find a item about the topic (they will never know a local unconnected page and it is almost impossible to check each of 900 wikis to find whether a topic exists). If a specific wiki does not have enough people to handle all unconnected pages, we have a reason to mass create them (after a period). (Yes, for wikis with some active users doing so, we can postpond them; but even nlwiki requires a bot to clean up the backlog.)--GZWDer (talk) 10:34, 13 September 2020 (UTC)[reply]
- Oppose. I'm sceptical in general, but actively hostile to anything run by this user ever since I found a group of hundreds of duplicate items that were so easy to connect to their originals, I managed to do it with Quickstatements. Bots currently have several orders of magnitude more capacity here than active manual editors. With that in mind, running a bot that carelessly adds wrong/duplicate items that require manual correction is wasting the more precious resource. A semi-automated workflow would be preferable, and might have an easier time attracting users if my impression is correct that creating new items is the subjectively more rewarding experience compared to correcting existing items. --Matthias Winkelmann (talk) 00:22, 13 September 2020 (UTC)[reply]
- "new items is the subjectively more rewarding" - yes for the reason I stated above. It requires some work to clean up duplicates, but bring them to Wikidata will allow more users noticing it, especially for wikis with few users handling them locally.--GZWDer (talk) 10:34, 13 September 2020 (UTC)[reply]
- @Charles Matthews: You have not commented on this plan yet.--GZWDer (talk) 10:35, 13 September 2020 (UTC)[reply]
- @GZWDer: I commented on 8 August that the urgency of item creation here for newly-created articles on wikipedias is not as great as you are assuming. My view remains the same. Certainly for enWP, which most concerns me, waiting longer and adding more value to items that are created is a good idea. So I will not support a plan of this kind. Charles Matthews (talk) 10:47, 13 September 2020 (UTC)[reply]
- @Charles Matthews: For wikis with active users handling unconnected pages, it may wait a bit latter. But it is less likely for a page to be connected if they are not connected in a while (for this point a tradeoff must be chosen), and not creating them also impede the usage of many tools (as I responsed to Silvonen).--GZWDer (talk) 11:04, 13 September 2020 (UTC)[reply]
- @GZWDer: Clearly, there are a number of trade-offs to consider here. But since we don't agree about those trade-offs, we are not so likely to agree on a plan. I am arguing from my actual workflow, starting with PetScan (queries on User:Charles Matthews/Petscan). I become involved in article writing, such as w:Sir James Wright, 1st Baronet, through using queries. Using those queries is positive for my work on enWS and enWP. I think Wikidata is important in integrating Wikimedia projects, so I do not oppose the principle of automated creation of items here. But I do oppose doing it too quickly. Charles Matthews (talk) 11:15, 13 September 2020 (UTC)[reply]
- @Charles Matthews: For wikis with active users handling unconnected pages, it may wait a bit latter. But it is less likely for a page to be connected if they are not connected in a while (for this point a tradeoff must be chosen), and not creating them also impede the usage of many tools (as I responsed to Silvonen).--GZWDer (talk) 11:04, 13 September 2020 (UTC)[reply]
- @GZWDer: I commented on 8 August that the urgency of item creation here for newly-created articles on wikipedias is not as great as you are assuming. My view remains the same. Certainly for enWP, which most concerns me, waiting longer and adding more value to items that are created is a good idea. So I will not support a plan of this kind. Charles Matthews (talk) 10:47, 13 September 2020 (UTC)[reply]
- hmm Wikidata:Requests for permissions/Bot/JonHaraldSøbyWMNO-bot 2 - this is one of the reasons I proposed to mass import pages from Cebuano Wikipedia (and other wikis): others will import something similar, so import them earlier will reduce the number of duplicates.--GZWDer (talk) 14:14, 28 September 2020 (UTC)[reply]
- We still need people to cleanup up mass imports of defective data from cebwiki: Wikidata:Bot_requests#elevation_above_sea_level_(P2044)_values_imported_from_ceb-Wiki. Does this concern your edits too? --- Jura 14:37, 28 September 2020 (UTC)[reply]
- If the values are defective, we just does not import such statements. But at very least we need the sitelinks.--GZWDer (talk) 02:02, 29 September 2020 (UTC)[reply]
- We still need people to cleanup up mass imports of defective data from cebwiki: Wikidata:Bot_requests#elevation_above_sea_level_(P2044)_values_imported_from_ceb-Wiki. Does this concern your edits too? --- Jura 14:37, 28 September 2020 (UTC)[reply]
- I didn't read the whole talk but shouldn't it be on Wikipedia side? So after user save his article window with reminder to connect article to Wikidata item should pop up or something similar. Eurohunter (talk) 16:15, 21 December 2020 (UTC)[reply]
- Oppose Frankly, I am getting a bit tired of all these one sitelink item creations. From an Wikipedia point of view, statements should not be taken from the Wikipedias (and especially not with tools or bots that don't reuse the existing citations) and the length of the backlog does not matter at all. On an priority list on Wikipedia this backlog of unconnected pages is always going to be low down on the list, as it should be.--Snaevar (talk) 18:19, 21 December 2020 (UTC)[reply]
- @Eurohunter: also see meta:Community Wishlist Survey 2021/Wikidata/Creation of new objects resp. connecting to existing objects while avoiding duplicates. --M2k~dewiki (talk) 18:28, 21 December 2020 (UTC)[reply]
- @M2k~dewiki: Just wanted to vote but it ended. Eurohunter (talk) 20:05, 21 December 2020 (UTC)[reply]
- @Eurohunter: also see meta:Community Wishlist Survey 2021/Wikidata/Creation of new objects resp. connecting to existing objects while avoiding duplicates. --M2k~dewiki (talk) 18:28, 21 December 2020 (UTC)[reply]
- In case it wasn't clear earlier, I Support this bot request. Duplicates are an issue (I frequently merge items created by this bot), so I think it is best if the bot waits for a few days before creating the item, but not running it creates a backlog of unconnected items that gets in the way of matching new items. Pi bot also now imports various statements (such as commons category links and descriptions, hopefully coordinates soon) for non-humans, but only if the item already exists - and again, not having the Wikidata item creates backlogs for those tasks. @GZWDer: I know you don't like it, but could you adopt the '14/7' rule please, and clear the backlog? Thanks. Mike Peel (talk) 19:11, 28 December 2020 (UTC)[reply]
- So:
Plan for: | Setting | Comments |
---|---|---|
Default: All Wikipedia, and Wikisource non sub-pages | 14/0 or 14/7 | - |
Some specific wikis (please comment below) | TBD | |
All Wikinews | 1/0 | If approved, will succeed Wikidata:Requests for permissions/Bot/RegularBot 3 |
nlwiki, cswiki | Not to be done | |
cebwiki | Items will be created with at least one identifier (or source) other than Geonames. The actual code is to be developed. | |
arzwiki | Currently skipped | Will be re-evaluated if bot-creating article is stopped |
- @Jheald, Edoderoo, Pintoch, Jc3s5h, Charles Matthews, Hjart: Please comment, if you want a different configuration, either in general, or in a specific wiki.--GZWDer (talk) 19:26, 28 December 2020 (UTC)[reply]
- @GZWDer: Being pragmatic (what has a chance to be be approved?), I suggest that you just look at Wikipedias for this task, go with 14/7 with a list of excluded Wikipedias, and leave the rest for other bot tasks. Thanks. Mike Peel (talk) 19:33, 28 December 2020 (UTC)[reply]
- @GZWDer: I've said it before, but since you don't seem to understand it, I guess it needs to be said again.. You need to actively ask every single wikipedia for permission before running any bots on them. Danish wikipedia i.e. has had people handling unconnected pages for years and I guess many other wikipedias has too. At least, don't touch dawiki. Thanks --Hjart (talk) 22:27, 28 December 2020 (UTC)[reply]
- @Hjart: Do your community run a bot that cleans up very old backlog? If no I will run it on 30-day old pages. P.S. You did not responsed to my comment at Wikidata:Requests_for_permissions/Bot/RegularBot 3.--GZWDer (talk) 22:31, 28 December 2020 (UTC)[reply]
- @GZWDer: Yes. we do have such a bot. And from watching some german activity, I guess they do too. Again, please ask every single community before doing anything to their backlogs. And don't touch dawiki at all. --Hjart (talk) 22:38, 28 December 2020 (UTC)[reply]
- OK. --GZWDer (talk) 22:39, 28 December 2020 (UTC)[reply]
- @GZWDer: Yes. we do have such a bot. And from watching some german activity, I guess they do too. Again, please ask every single community before doing anything to their backlogs. And don't touch dawiki at all. --Hjart (talk) 22:38, 28 December 2020 (UTC)[reply]
- I still oppose this, as I am not confident the operator can respect the views of the community on this. General lack of trust in them given the history in this area. If this task is important, someone else will step in to do it, no one is (or should be) irreplaceable. − Pintoch (talk) 21:43, 30 December 2020 (UTC)[reply]
- @Pintoch: Do you oppose to a specific strip down version (Wikidata:Requests for permissions/Bot/RegularBot 3)?-GZWDer (talk) 18:25, 31 December 2020 (UTC)[reply]
- @Pintoch: This is an important task. While I would prefer this bot task to be accepted, I've started Wikidata:Requests for permissions/Bot/Pi bot 19 to do this for enwp at least. Thanks. Mike Peel (talk) 19:28, 3 January 2021 (UTC)[reply]
- There clearly isn't yet a meeting of minds here. Charles Matthews (talk) 11:10, 2 January 2021 (UTC)[reply]