Grants:Project/Rapid/Hjfocs/soweego 1.1
Project Goal
[edit]soweego
[1] links Wikidata items to large external catalogs.
It is an artificial intelligence based on multiple machine learning[2] algorithms (AKA linkers).
Its vision is to make Wikidata the nucleus of the open data landscape.
The main goal of this proposal is to automatically get the highest-quality links by bringing soweego
linkers together: unity is strength.
Problem
[edit]Pretty much like a human, soweego
claims that a given Wikidata item links to a given catalog identifier with different levels of confidence.
Currently, it only considers the confidence yielded by one linker (the best), thus not leveraging any relationship or information captured by others. That is to say, the system has only one pair of eyes, but it could indeed benefit from extra viewpoints.
Therefore, we can improve the quality and quantity of links by letting soweego
linkers join forces.
Solution
[edit]Machine learning algorithms capture information in heterogeneous ways, and they have been shown to perform better together, rather than alone.[3][4][5]
We propose to build an ensemble system,[6] and to implement it as an enhancement of the soweego
linker module.[7]
Furthermore, linkers can behave differently depending on the external catalog. Hence, it is important to automatically tune the weight of each linker in the ensemble. Finally, we will automatically set the optimal parameters of each linker through cross-validation[8] techniques.
Project Plan
[edit]Activities
[edit]- State of the art: explore best practices in ensemble learning and investigate related approaches applied to
soweego
's task, namely record linkage;[9] - add decision trees[10] to the current pool of linkers;
- develop the ensemble system;
- implement automatic hyperparameters tuning of linkers;
- implement automatic weighting of each linker, for each supported catalog;
- evaluate performance and compare to previous results without ensemble;
- write reports and include them in a MSc thesis at the University of Trento (Q930528), supervised by Hjfocs.
Outcomes
[edit]- Release of
soweego
unity is strength (version 1.1); - delivery of ready-to-use documentation;
- engagement of developers through the standard social coding workflow: understand, fork, make a pull request.
Community notification
[edit]- Wikidata: https://lists.wikimedia.org/pipermail/wikidata/2019-July/013263.html
- Wiki research: https://lists.wikimedia.org/pipermail/wiki-research-l/2019-July/006864.html
- Wikimedia AI: https://lists.wikimedia.org/pipermail/ai/2019-July/000277.html
Impact
[edit]- 229k confident Wikidata identifier statements created or referenced;[11]
- 124k link candidates uploaded to the Mix'n'match tool[12] for curation;[11]
- 4 pull requests submitted to the
soweego
code repository, under the Wikidata GitHub organization.[13]
Resources
[edit]Hjfocs will work tighly with Tupini07, and supervise his MSc thesis at the University of Trento (Q930528), together with Prof. Passerini.[14] We will not receive any additional support.
The whole budget is allocated to the implementation efforts.
References
[edit]- ↑ Grants:Project/Hjfocs/soweego
- ↑ en:Machine_learning
- ↑ https://jair.org/index.php/jair/article/view/10239/24370
- ↑ http://users.rowan.edu/~polikar/RESEARCH/PUBLICATIONS/csm06.pdf
- ↑ https://www.researchgate.net/profile/Lior_Rokach/publication/220637823_Ensemble-based_classifiers/links/55bf427008aed621de122c52/Ensemble-based-classifiers.pdf
- ↑ en:Ensemble_learning
- ↑ https://soweego.readthedocs.io/en/latest/linker.html
- ↑ en:Cross-validation_(statistics)
- ↑ en:Record_linkage
- ↑ en:Decision_tree_learning
- ↑ a b This is an upper-bound estimate based on
soweego
version 1 output - ↑ https://tools.wmflabs.org/mix-n-match/
- ↑ https://github.com/Wikidata/soweego
- ↑ https://disi.unitn.it/~passerini/
Endorsements
[edit]- Support Can't wait to see it in action! Sannita - not just another it.wiki sysop 18:22, 14 July 2019 (UTC)
- Strong support (disclaimer: I contributed to the development of soweego 1) I strongly endorse this proposal, because I see it as the natural next step for soweego. We implemented several algorithms, picked the one that performed best, but had to put the others aside. An ensemble would definitely smooth the cons of each algorithm, thus providing the strongest results. MaxFrax96 (talk) 12:15, 15 July 2019 (UTC)
- Sounds promising. Jonathan Groß (talk) 17:21, 16 July 2019 (UTC)
- Support --Jaqen (talk) 16:54, 18 July 2019 (UTC)
- Support Looking forward to it! - User:kippelboy
- This looks promising. StudiesWorld (talk) 13:21, 21 July 2019 (UTC)
- sounds promising Wikipeter-HH (talk) 07:40, 26 July 2019 (UTC)
- Support Jmmuguerza (talk) 14:11, 27 July 2019 (UTC)
- Support This will be a big helper for the Wikimedia 2030 goals. --Sebastian Wallroth (talk) 05:15, 28 July 2019 (UTC)
- Support --Floscher (talk) 10:15, 29 July 2019 (UTC)
- Support We need this tool to improve the quality of data in wikidata - let machines do the boring and time wasting things... A ka es (talk) 11:44, 31 July 2019 (UTC)
- Support StultuS (talk) 07:08, 2 August 2019 (UTC)
- Support Uomovariabile (talk to me) 07:43, 8 August 2019 (UTC)
- Sounds cool. Amitbjadhav (talk) 04:01, 9 August 2019 (UTC)
- This project looks interesting. I would like to see more ★ → Airon 90 07:32, 9 August 2019 (UTC)
- Weak oppose The bot is a nice idea but unfortunately I am reluctant to support it due to the irresponsible way it is being run on the wikidata. There has been several issues with its functionality in the past (see d:User_talk:Soweego_bot) and in this occasion the owner did not even revert the problematic edits. The bot is still adding the existing discogs IDs in the database to other items! -- Meisam (talk) 08:45, 9 August 2019 (UTC)
- Support The documentation of the previous project at Grants:Project/Hjfocs/soweego/Final seems great and I appreciate the intent to have a small project to sort thoughts and next steps instead of jumping into another big project. With all the documentation and conversation that happened around the previous proposal, and with this only being US$2000 to continue an ongoing institutional affiliation with a university partner in data science, and because this project is securing student labor and likely to produce results which interest a broad international audience, it is easy for me to support this. I understand that data import and sorting projects will have some bumps and controversy but the solution to that is more documentation and conversation, and not ending a highly transparent project which is producing a lot of documentation. I am not immediately aware of any major insurmountable problems with this project. I am not so familiar with this project, but on its face, it is presented in the way that I wish for any project. Blue Rasberry (talk) 14:24, 13 August 2019 (UTC)
- I want to support a better Mix'n'match Vidariv (talk) 08:08, 16 August 2019 (UTC)
- Support --Sabas88 (talk) 11:35, 17 August 2019 (UTC)
- Support -- Maxlath (talk) 09:20, 19 August 2019 (UTC)
- Support Adding more links to other projects makes Wikidata more important as a Linked Data hub, so it is great to see efforts in this direction - Sohmen (talk) 10:11, 20 August 2019 (UTC)
Participants
[edit]- Volunteer In however ways I can. Amitbjadhav (talk) 04:02, 9 August 2019 (UTC)