Wiki-research-l December 2014

39 participants
19 discussions

Wikipedia Research policy
by song＠cs.umn.edu 14 Jul '23

14 Jul '23

Pursuant to prior discussions about the need for a research policy on Wikipedia, WikiProject Research is drafting a policy regarding the recruitment of Wikipedia users to participate in studies. At this time, we have a proposed policy, and an accompanying group that would facilitate recruitment of subjects in much the same way that the Bot Approvals Group approves bots. The policy proposal can be found at: http://en.wikipedia.org/wiki/Wikipedia:Research The Subject Recruitment Approvals Group mentioned in the proposal is being described at: http://en.wikipedia.org/wiki/Wikipedia:Subject_Recruitment_Approvals_Group Before we move forward with seeking approval from the Wikipedia community, we would like additional input about the proposal, and would welcome additional help improving it. Also, please consider participating in WikiProject Research at: http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Research -- Bryan Song GroupLens Research University of Minnesota

8 10

Looking for reader's click log data for Wikipedia
by Ditty Mathew 29 Dec '14

29 Dec '14

Hi, Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia. with regards Ditty

4 8

Re: [Wiki-research-l] Looking for reader's click log data for Wikipedia
by Finn Aarup Nielsen 29 Dec '14

29 Dec '14

How would you anonymize data? This is very difficult. If a user is pseudonomized with a random identifier it is not difficult to triangularize the user. This is particular the case if the user is a Wikipedian: The user will often read his/her own user talk page and the pages s/he edits. Readings: https://en.wikipedia.org/wiki/AOL_search_data_leak https://en.wikipedia.org/wiki/Differential_privacy#Netflix_Prize best regards Finn Årup Nielsen Den 29-12-2014 kl. 04:53 skrev Ditty Mathew: > The exact user information is not needed. The anonymized data is enough. > What exactly we need is the navigation path of Wikipedia readers. > > with regards > > Ditty > > On Sun, Dec 28, 2014 at 9:46 PM, Oliver Keyes <okeyes(a)wikimedia.org > <mailto:[email protected]>> wrote: > > Afraid not. First, we do not have some of those datapoints; we do > not currently have unique user IDs. And, second, it would be a > tremendous ethical violation for us to release that data that we > /do/ have (IP addresses, for example). > > On 28 December 2014 at 21:00, Ditty Mathew <dittyvkm(a)gmail.com > <mailto:[email protected]>> wrote: > > Hi, > > Is the reader's click log data(should contain user id/ip, > article title, timestamp) is available for Wikipedia. > > with regards > > Ditty > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > <mailto:[email protected]> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > <mailto:[email protected]> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >

1 0

Re: [Wiki-research-l] Looking for reader's click log data for Wikipedia (Ditty Mathew)
by Srijan Kumar 29 Dec '14

29 Dec '14

Dear Ditty, If this is of any help, researchers at Stanford had studied the article navigation behavior of users when the article being searched for is known. The data and relevant publications can be found here: http://snap.stanford.edu/data/wikispeedia.html Cheers Srijan > The exact user information is not needed. The anonymized data is enough. > What exactly we need is the navigation path of Wikipedia readers. > > with regards > > Ditty >

2 1

Article feedback corpus released
by Dario Taraborelli 25 Dec '14

25 Dec '14

I’m glad to announce the release of an open-licensed corpus with 1.5M records from the Article Feedback v5 pilot. http://dx.doi.org/10.6084/m9.figshare.1277784 Thanks to everyone who helped make this happen, Fabrice in particular for shepherding this through. Dario — This dataset contains the entire corpus of feedback submitted on the English, French and German Wikipedia during the Article Feedback v.5 pilot (AFT). [1] The Wikimedia Foundation ran the Article Feedback pilot for a year between March 2013 and March 2014. During the pilot, 1,549,842 feedback messages were collected across the three languages. All feedback messages and their metadata (as described in this schema [2]) are available in this dataset, with the exception of messages that have been oversighted and/or deleted by the end of the pilot. The corpus is released [3] under the following license: • CC BY SA 3.0 for feedback messages • CC0 for the associated metadata Results from the pilot are discussed in: Halfaker, A., Keyes, O. and Taraborelli, D (2013). Making peripheral participation legitimate: Reader engagement experiments in Wikipedia. CSCW ’13 Proceedings of the 2013 Conference on Computer Supported Cooperative Work [4][5] [1] https://www.mediawiki.org/wiki/Article_feedback/Version_5 [2] https://www.mediawiki.org/wiki/Article_feedback/Version_5/Technical_Design_… [3] https://wikimediafoundation.org/wiki/Feedback_data#Article_Feedback [4] http://dx.doi.org/10.1145/2441776.2441872 [5] http://nitens.org/docs/cscw13.pdf

2 1

Re: [Wiki-research-l] commentary on Wikipedia's community behaviour (Aaron gets a quote) (mjn)
by Mathieu ONeil 22 Dec '14

22 Dec '14

Hi On the question of location of disputes I wrote a blog post a few years ago: "Auray et al. identify several factors which contribute to conflictuality, such as the number of participants, the location of disputes, and the identity choices of participants. The larger the number of contributors, the more likely discussion is; the threshold number seems to be eight. When there are more than ten participants, discussion increasingly moves to the talk pages of users, and is more likely to degenerate into insults. A surefire indicator of fights are references to policy pages. These can be statistically measured: research by Kriplean and Beschastnikh has shown that pages with more than 250 posts had 51% of the links towards policy pages. There are two main types of articles where conflicts erupt: first, the usual suspects are topics with burning current affairs value involving inter-ethnic or inter-faith conflicts; second, “scientific” categories with low academic legitimacy such as homeopathy and chiropraxy are strong conflict zones. Suspected “sock-puppetry” (fake identity) is also a source of conflict; an attenuated version of this being the lack of regard for people who have not registered on the site and instead just use an IP address: more than half of the text inserted by “IPs” is deleted, and they are more likely to be present in semi-protected articles which is where disputes and insults typically occur. IPs are also more likely to insult others, so there are suspicions that IPs are registereds users who use “socks” to engage in insulting behaviour which they would not dare to do under their registered identities." http://blog.p2pfoundation.net/wikipedia-and-conflict/2009/07/07 cheers Mathieu ________________________________________ From: wiki-research-l-bounces(a)lists.wikimedia.org <wiki-research-l-bounces(a)lists.wikimedia.org> on behalf of wiki-research-l-request(a)lists.wikimedia.org <wiki-research-l-request(a)lists.wikimedia.org> Sent: Tuesday, December 16, 2014 23:01 To: wiki-research-l(a)lists.wikimedia.org Subject: Wiki-research-l Digest, Vol 112, Issue 24 Send Wiki-research-l mailing list submissions to wiki-research-l(a)lists.wikimedia.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/wiki-research-l or, via email, send a message with subject or body 'help' to wiki-research-l-request(a)lists.wikimedia.org You can reach the person managing the list at wiki-research-l-owner(a)lists.wikimedia.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Wiki-research-l digest..." Today's Topics: 1. Re: commentary on Wikipedia's community behaviour (Aaron gets a quote) (mjn) ---------------------------------------------------------------------- Message: 1 Date: Tue, 16 Dec 2014 05:28:30 +0100 From: mjn <mjn(a)anadrome.org> To: Research into Wikimedia content and communities <wiki-research-l(a)lists.wikimedia.org> Subject: Re: [Wiki-research-l] commentary on Wikipedia's community behaviour (Aaron gets a quote) Message-ID: <87k31si55a.fsf(a)mjn.anadrome.org> Content-Type: text/plain; charset=utf-8 Perhaps it depends on what part of the encyclopedia? Has anyone attempted to characterize how the editing environment varies with different subject matter? I often run across descriptions that don't comport with either my experience, or that of people I've interviewed, but it's hard to tell precisely why. I've encountered quite different beliefs about what the en.wikipedia community is like, even among people who to me seem to otherwise have a similar background. Entirely anecdotally, areas of interest seem to be one correlated factor. For example, writing an article on an archaeological site (one thing I've mentored new editors in doing) is by and large trouble-free and friendly, in my experience. But some other areas are not. I haven't attempted to characterize that factor in any detail. -Mark WereSpielChequers <werespielchequers(a)gmail.com> writes: > We have problems, I don't dispute that. But "ugly and bitter as 4chan"? That has to be an exaggeration. > > Regards > > Jonathan Cardy > > >> On 13 Dec 2014, at 01:03, Andrew Lih <andrew.lih(a)gmail.com> wrote: >> >> I certainly hope you're right Sydney. What a horrible mess. >> >> >>> On Fri, Dec 12, 2014 at 5:53 PM, Sydney Poore <sydney.poore(a)gmail.com> wrote: >>> I think feminists, especially those who take an interest in STEM, will pass this article around. >>> >>> Sydney >>> >>>> On Dec 12, 2014 5:35 PM, "Andrew Lih" <andrew.lih(a)gmail.com> wrote: >>>> It's a good piece, but honestly I think only the dedicated tech reader will make it through the entire story. There's a lot of jargon and insider intrigue such that I could imagine most people never making past the typewriter barf of "BLP, AGF, NOR" :) >>>> >>>> >>>>> On Fri, Dec 12, 2014 at 5:26 PM, Dariusz Jemielniak <darekj(a)alk.edu.pl> wrote: >>>>> While I agree that the article is overly negative (likely because of the individual experience), I think it still points to an important problem. I don't perceive this article as really problematic in terms of image. Maybe naively, I imagine that people will not stop donating because the community is not ideal. >>>>> >>>>> pundit >>>>> >>>>>> On Fri, Dec 12, 2014 at 11:16 PM, Kerry Raymond <kerry.raymond(a)gmail.com> wrote: >>>>>> There’s a saying that everyone likes to eat sausages but nobody likes to know how they are made. It is not good to have negative publicity like that during the annual donation campaign (irrespective of the motivations of the journalist and/or the rights/wrongs of the issue being reported, neither of which I intend to debate here). As a donation-funded organisation, public perception matters a lot. >>>>>> >>>>>> >>>>>> >>>>>> Kerry >>>>>> >>>>>> >>>>>> >>>>>> From: Jonathan Morgan [mailto:[email protected]] >>>>>> Sent: Saturday, 13 December 2014 6:43 AM >>>>>> To: Research into Wikimedia content and communities >>>>>> Cc: Kerry Raymond >>>>>> Subject: Re: [Wiki-research-l] commentary on Wikipedia's community behaviour (Aaron gets a quote) >>>>>> >>>>>> >>>>>> >>>>>> I mostly agree. On one hand, it's always nice to see a detailed description of how wiki-sausage gets made in a major venue. On the other, this journalist clearly has a personal axe to grind, and used his bully pulpit to grind it in public. >>>>>> >>>>>> >>>>>> >>>>>> - J >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Dec 12, 2014 at 1:39 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote: >>>>>> >>>>>> 1000th addition to the inconsequential rant genre. >>>>>> >>>>>> Nemo >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Wiki-research-l mailing list >>>>>> Wiki-research-l(a)lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Jonathan T. Morgan >>>>>> >>>>>> Community Research Lead >>>>>> >>>>>> Wikimedia Foundation >>>>>> >>>>>> User:Jmorgan (WMF) >>>>>> >>>>>> jmorgan(a)wikimedia.org >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Wiki-research-l mailing list >>>>>> Wiki-research-l(a)lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>>> >>>>> >>>>> -- >>>>> >>>>> __________________________ >>>>> prof. dr hab. Dariusz Jemielniak >>>>> kierownik katedry Zarządzania Międzynarodowego >>>>> i centrum badawczego CROW >>>>> Akademia Leona Koźmińskiego >>>>> http://www.crow.alk.edu.pl >>>>> >>>>> członek Akademii Młodych Uczonych Polskiej Akademii Nauk >>>>> członek Komitetu Polityki Naukowej MNiSW >>>>> >>>>> Wyszła pierwsza na świecie etnografia Wikipedii "Common Knowledge? An Ethnography of Wikipedia" (2014, Stanford University Press) mojego autorstwa http://www.sup.org/book.cgi?id=24010 >>>>> >>>>> Recenzje >>>>> Forbes: http://www.forbes.com/fdc/welcome_mjx.shtml >>>>> Pacific Standard: http://www.psmag.com/navigation/books-and-culture/killed-wikipedia-93777/ >>>>> Motherboard: http://motherboard.vice.com/read/an-ethnography-of-wikipedia >>>>> The Wikipedian: http://thewikipedian.net/2014/10/10/dariusz-jemielniak-common-knowledge >>>>> >>>>> _______________________________________________ >>>>> Wiki-research-l mailing list >>>>> Wiki-research-l(a)lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>> >>>> >>>> _______________________________________________ >>>> Wiki-research-l mailing list >>>> Wiki-research-l(a)lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Sent with my mu4e ------------------------------ _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l End of Wiki-research-l Digest, Vol 112, Issue 24 ************************************************

2 1

Wikimedia Research Showcase -- Thurs. Dec 18th: mobile readership; disease monitoring with Wikipedia
by Dario Taraborelli 19 Dec '14

19 Dec '14

This month’s Research showcase will be held tomorrow, Thursday, Dec. 18th at 3PM PST (2300 UTC). As usual, the event will be recorded and publicly streamed on YouTube (link <https://www.youtube.com/watch?v=xPO8XhmeUAU>) We’ll hold a discussion and take questions from the Wikimedia Research IRC channel (#wikimedia-research <http://webchat.freenode.net/?channels=wikimedia-research> on freenode). Looking forward to seeing you there. Dario —— This month: Mobile Madness: The Changing Face of Wikimedia Readers By Oliver Keyes <https://www.mediawiki.org/wiki/User:Ironholds> A dive into the data we have around readership that investigates the rising popularity of the mobile web, countries and projects that are racing ahead of the pack, and what changes in user behaviour we can expect to see as mobile grows. Global Disease Monitoring and Forecasting with Wikipedia By Reid Priedhorsky <http://www.lanl.gov/expertise/profiles/view/reid-priedhorsky> (Los Alamos National Laboratory) Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with r² up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.

2 1

commentary on Wikipedia's community behaviour (Aaron gets a quote)
by Kerry Raymond 16 Dec '14

16 Dec '14

http://www.slate.com/articles/technology/bitwise/2014/12/wikipedia_editing_d isputes_the_crowdsourced_encyclopedia_has_become_a_rancorous.single.html This is the predicated fallout of the recent ArbCom case in relation to civility (although there's a rather longer and more tortuous history to it). Kerry

10 13

Editor sessions and related metrics
by Oliver Keyes 16 Dec '14

16 Dec '14

Hey all, Not sure if this would be interesting to researchers or community members, but: you might remember a paper Stuart and Aaron did a while ago about measuring edit sessions - http://www-users.cs.umn.edu/~halfak/publications/Using_Edit_Sessions_to_Mea… To me it's really interesting, because it's (as much as anything else) a new metric for measuring participation, and a metric we can extract additional metrics from (e.g., session length). As part of some related work on /reader/ sessions, I wrote a pile of code to handle session reconstruction. I've generalised it (it doesn't care if you've got reader timestamps, editor timestamps, or best buy receipt timestamps) and thrown it up at https://github.com/Ironholds/reconstructr . I figure it could be useful to any researchers or community members looking into sessions. Thanks, -- Oliver Keyes Research Analyst Wikimedia Foundation

2 2

How to track all the diffs in real time?
by Maximilian Klein 16 Dec '14

16 Dec '14

Hello Researchers, I've been playing with Recent Changes Stream Interface <https://wikitech.wikimedia.org/wiki/RCStream> recently, and have started trying to use the API's "*action=compare*" to look at every diff of every wiki in real time. The goal is to produce real-time analytics on the content that's being added or deleted. The only problem is that is will really hammer the API with lots of reads since it doesn't have a batch interface. Can I spawn multiple network threads and do 10+ reads per second forever without the API complaining? Can I warn someone about this and get a special exemption for research purposes? The other thing to do would be to use "*action=query*" to get the revisions in batches and do the diffing myself, but then i'm not guaranteed to be diffing in the same way that the site is. What techniques would you recommend? Make a great day, Max Klein ‽ http://notconfusing.com/

12 17

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l December 2014