Page MenuHomePhabricator

New Pages Feed: copyvio addition
Closed, ResolvedPublic

Description

Parent task for the New Pages Feed work to allow pages in the feed to be filtered by an automated prediction of their likelihood to contain copyright violations ("copyvio").

Allowing pages to be filtered by copyvio prediction is one of three major work items for T193782. Please see that epic or this task's subtasks for more information.

The following are acceptance criteria for this copyvio work:

  • All the criteria below apply equally to the NPP and AfC sides of the New Pages Feed.
  • All pages in the New Pages Feed that are in the Article and Draft spaces should be scanned according to the rules of CopyPatrol. User space pages do not need to be scanned. If the rules of CopyPatrol don't scan a page (for instance, because it is too small), then that is okay.
  • When this feature is put into production, the several thousand legacy pages that are in the New Pages Feed at that time should also have the results of their scans. Ticketed in T203207.
  • When CopyPatrol identifies any diff to have over 50% copyvio (and therefore be present in the CopyPatrol interface), the page of that diff should have an indicator in New Pages Feed that says "Copyvio". This is true whether it is the initial revision of the page or a subsequent edit.
  • That "Copyvio" indicator should be a bold blue link alongside "Potential issues:" as shown in the image below. If there are other issues found by the ORES draftquality model, those should be listed first and separated with a dot. The implementation in that image below is the correct evolution from the original mockup on T202161.
  • That bold blue link should open a new tab with a permanent-link CopyPatrol page that lists all the CopyPatrol results for revisions of that page. The ticket for testing this is T203120.
  • "Copyvio" should be added as a checkbox option under a "Potential issues" heading in the "Set filters" menu of the New Pages Feed. The order of the issues in the checkbox list should put "Copyvio" second to last, before "None". It should also be listed the same as the other ORES issues in the "Showing:" component of the interface above the "Set filters" menu.
  • The ORES and copyvio "Potential issues" should be treated like an "OR". For instance, if a user selects "Vandalism", "Attack", and "Copyvio", the feed should list all pages that have any of those three potential issues.
  • Copyvio checking should populate the New Pages Feed in under a minute. If it takes longer than a minute, we will need to consider a way to indicate which pages have not been checked yet. This is being investigated in T202914.

Screen Shot 2018-08-29 at 18.56.58.png (130×741 px, 36 KB)

Related Objects

StatusSubtypeAssignedTask
ResolvedMMiller_WMF
ResolvedMMiller_WMF
ResolvedMMiller_WMF
Resolvednettrom_WMF
ResolvedJul 17 2018SBisson
ResolvedSBisson
DeclinedNone
ResolvedSBisson
ResolvedSBisson
DeclinedNone
ResolvedSBisson
ResolvedEtonkovidova
ResolvedSBisson
Resolvederanroz
ResolvedSBisson
DeclinedNone
DeclinedNone
DeclinedNone
ResolvedSBisson
Resolvedkostajh
ResolvedMMiller_WMF
DeclinedMMiller_WMF
Resolved Catrope
DeclinedNone
Resolved Catrope
Resolvedkostajh
Resolved Catrope
Resolvedkaldari
Resolvedkostajh

Event Timeline

I've created a bunch of subtasks for the development work and tagged this one as an Epic.

I'm not sure what to do with it now so I'm putting it in @MMiller_WMF's column.

How is "copyright" of submissions being validated? From a much earlier review it appeared that this was only looking for "unoriginal" text, not text that was actually violating a copyright as its labeling suggests?

@Xaosflux -- I tried to answer here on wiki. Hopefully this helps!

To follow up from project talks, I suggest the labeling for this is updated from "copyvio" to "copydetect" or the like, especially as we are relying on a third party - our labeling should be neutral (this text appears to have been copied from somewhere else) as opposed to making a legal claim (this text has violated a copyright law).

Aklapper changed the edit policy from "Custom Policy" to "All Users".Sep 17 2018, 5:44 PM
Aklapper changed Risk Rating from N/A to default.

@Xaosflux - In most contexts it says "Potential issues: Copyvio" or "Potential copyright violation", but it looks like the Log page title should be changed, as it says "Copyvio event log". It should be changed to something like "Possible copyright violation log", otherwise people are going to be upset. Pinging @MMiller_WMF and @kostajh.

Or "Potential" or "Suspected" or something. Not sure what exact verbiage is best :P

Change 464032 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/extensions/PageTriage@master] Align copyvio log terminology

https://gerrit.wikimedia.org/r/464032

Change 464032 merged by jenkins-bot:
[mediawiki/extensions/PageTriage@master] Align copyvio log terminology

https://gerrit.wikimedia.org/r/464032

Change 464047 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/extensions/PageTriage@wmf/1.32.0-wmf.24] Align copyvio log terminology

https://gerrit.wikimedia.org/r/464047

I could be the only one hung up on this, just think it shouldn't even claim to be determining a legal status (violation of a copyright), just an information status (possible copied text) or the like.

Change 464047 merged by jenkins-bot:
[mediawiki/extensions/PageTriage@wmf/1.32.0-wmf.24] Align copyvio log terminology

https://gerrit.wikimedia.org/r/464047

Mentioned in SAL (#wikimedia-operations) [2018-10-02T23:54:05Z] <ladsgroup@deploy1001> Synchronized php-1.32.0-wmf.24/extensions/PageTriage/i18n/en.json: SWAT: [[gerrit:464047|Align copyvio log terminology (T199359)]] (duration: 00m 56s)

@Xaosflux - That's an understandable concern. We discussed it a bit and feel like changing it to "copied text" (or something similar) would unfortunately be confusing. Often when people talk about "copied text" problems on Wikipedia they're talking about copy-paste moves or similar problems, while text plagiarized from outside sources is most often referred to as "copyvio" or "copyright violations" (e.g. the Copyright violations policy, the Copyvio-revdel template, etc.). I don't think there is much legal risk from flagging something as a "potential copyright violation" in good faith. Even outright incorrectly calling something a "copyright violation", if done in good faith, is legally OK in the US at least (IANAL). We're going to go ahead with the copyright violation wording, as we think it is the easiest to understand, but we'll be looking out for further feedback from the community on it. Hope that sounds reasonable.

MMiller_WMF claimed this task.

After a successful bot trial, we released this to production on 2018-10-29. See the project update here.