Wikipedia:Bots/Requests for approval/EranBot 3


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

EranBot 3
Operator:

Time filed: 16:07, Saturday, September 15, 2018 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Python

Source code available: source on github

Function overview:: This bot submits newly added text to the iThenticate API which determines if other sources are similar to it. Suspected copyvios (>50% similarity) can then reviewed manually (copypatrol; top reviewers: Diannaa, Sphilbrick, L3X1). In this BRFA I would like to ask to join it to copyviobot group, to access pagetriagetagcopyvio API which will be used by PageCuration extension aka Special:NewPagesFeed (see phab tasks).

Links to relevant discussions (where appropriate): prev BRFA, copyviobot, Epic task for copyvio in new pages feed (and subtasks)

 Edit Active period(s): Continuous

Estimated number of pages affected: N/A. The bot will tag suspected edits using API. This may be used by special page Special:NewPagesFeed.

Namespace(s): main namespace and drafts (the bot is not editing them, but may check them for copy)

Exclusion compliant (Yes/No): N/A

Function details: Eran (talk) 16:07, 15 September 2018 (UTC)
 * any diff (except rollbacks) in main and draft NS which adds large chunck of text may be a subject for copyvio check
 * Copyvio check is done using iThenticate service (WP:Turnitin who kindly provided us access to their service)
 * Changes that are similar to existing text in external source are reported (can be reviewed in https://tools.wmflabs.org/copypatrol/en ) so users can further review them manually.
 * (new) By adding the bot to copyviobot group, it will be possible to access to suspected diffs more easily from Special:NewPagesFeed later

Discussion
48% of the edits reported as suspected copyvio required additional follow up ("page fixed"). In tools.labsdb:

The full details how it is going to be shown in Special:NewPagesFeed would probably need to be discussed with community and with Growth team (MMiller, Roan Kattouw) - however, it is already possible to see an example in beta test wiki (search for "copyvio"). It would be important to note tagged page just means an edit may contain copied text (such edits may be OK [CC-BY content from government institutions], copyright violation [copy & paste from commercial news service] or promotional content [may be legally OK sometimes, but violates WP:Promo). Eran (talk) 16:07, 15 September 2018 (UTC)
 * It isn't sinking in how this fits in with the CopyPatrol activities. I'd like to discuss this further. Please let me know if this is a good place to have that discussion or if I should open up a discussion on your talk page or elsewhere.-- S Philbrick (Talk)  18:03, 15 September 2018 (UTC)
 * Sphilbrick: I think it is relevant in this discussion, can you please elaborate? thanks, Eran (talk) 19:30, 15 September 2018 (UTC)
 * I start with a bit of a handicap. While I understand the new pages feed in a very broad sense, I haven't actually worked with it in years and even then had little involvement.
 * It appears to me that the goal is to give editors who work in the new page feed a heads up that there might be a copyvio issue. I've taken a glance at the beta test wiki — I see a few examples related to copyvios. I see that those entries have a link to CopyPatrol. Does this mean that the new page feed will not be directly testing for copyright issues but will be leaning on the copy patrol feed? I checked the links to copy patrol and found nothing in each case which may make sense because those contrived examples aren't really in that report, but I would be interested to know exactly how it works if there is an entry.
 * The timing is coincidental. I was literally working on a draft of a proposal to consider whether the copy patrol tools should be directly making reports to the editors. That's not exactly what's going on here but it's definitely related.
 * What training, if any is being given to the editors who work on the new pages feed? Many reports are quite straightforward, but there are a few subtleties, and I wonder what steps have been taken to respond to false positives.-- S Philbrick (Talk)  19:57, 15 September 2018 (UTC)
 * CopyPartol is driven by EranBot with checks done by iThenticate/Turnitin. This BRFA is to send revision IDs with possible violations to the API, which will cause the CopyPatrol links to be shown in the new pages feed. —&thinsp;JJMC89&thinsp; (T·C) 04:53, 16 September 2018 (UTC)
 * Sphilbrick: thank you for the good points.
 * Regarding training for handling new pages feed and copyvios - I was about to suggest to document it, but actually it is already explained in New pages patrol quite well (but we may want to update it later)
 * Directly making reports to the editors - This is good idea, and actually it was already suggested but was never fully defined and implemented - T135301. You are more than welcome to suggest how it should work there (or in my talk page and I will summarize the discussion on phabricator).
 * Eran (talk) 18:40, 16 September 2018 (UTC)
 * Thanks for the link to the training material. I have clicked on the link to "school" thinking it would be there, but I now see the material in the tutorial link.


 * Regarding direct contacts, I'm in a discussion with Diannaa who has some good reasons why it may be a bad idea. I intend to follow up with that and see if some of the objections can be addressed. Discussion is here.-- S Philbrick (Talk)  18:54, 16 September 2018 (UTC)
 * thanks for the questions, and I'm sorry it's taken me a few days to respond. It looks like  has summarized the situation pretty well, but I'll also take a stab.  One of the biggest challenges with both the NPP and AfC process is that there are so many pages that need to be reviewed, and there aren't good ways to prioritize which ones to review first.  Adding copyvio detection to the New Pages Feed is one of three parts of this project meant to make it easier to find both the best and worst pages to review soonest.  Parts 1 and 2 are to add AfC drafts to the New Pages Feed (being deployed this week), and to add ORES scores on predicted issues and predicted class to the feed for both NPP and AfC (being deployed in two weeks).  The third part will add an indicator next to any pages who have a revision that shows up in CopyPatrol, and those will say, "Potential issues: Copyvio".  Reviewers will then be able to click through to the CopyPatrol page for those revisions, investigate, and address them.  The idea is that this way, reviewers will be able to prioritize pages that may have copyvio issues.  Here are the full details on this plan.   has brought up questions around using the specific term "copyvio", and I will discuss that with the NPP and AfC communities.  Regarding training, yes, I think you are bringing up a good point.  The two reviewing communities are good at assembling training material, and I expect that they will modify their material as the New Pages Feed changes.  I'll also be continually reminding them about that.  Does this help clear things up? -- MMiller (WMF) (talk) 20:32, 20 September 2018 (UTC)
 * Yes, it does, thanks.-- S Philbrick (Talk)  21:37, 20 September 2018 (UTC)


 * User:ערן how will your bot's on-wiki actions be recorded (e.g. will they appear as 'edits', as 'logged actions' (which log?), etc?). Can you point to an example of where this get recorded on a test system? —  xaosflux  Talk 00:22, 16 September 2018 (UTC)
 * Xaosflux: For the bot side it is logged to s51306__copyright_p on tools.labsdb but this is clearly not accessible place. It is not logged on wiki AFAIK - If we do want to log it this should be done in the extension side. Eran (talk) 18:40, 16 September 2018 (UTC)
 * T204455 opened for lack of logging. — xaosflux  Talk 18:48, 16 September 2018 (UTC)
 * Thanks, . We're working on this now. -- MMiller (WMF) (talk) 20:33, 20 September 2018 (UTC)


 * I've never commented on a B/RFA before, but I think that another bot doing copyvios would be great, esp if it had less false positives than the current bot. Thanks,L3X1 ◊distænt write◊  01:12, 16 September 2018 (UTC)
 * L3X1: the Page Curation extension defines infrastructure for copyvio bots - so if there are other bots that can detect copyvios they may be added to this group later. AFAIK the automated tools for copyvio detection are Earwig's copyvio detector and EranBot/CopyPatrol and in the past there was also CorenSearchBot. The way it works is technically different (one is based on a general purpose search using Google search, one is based on Turnitin copyvio service) and they are completing each other with various pros and cons for each. I think Eranbot works pretty well (can be compared to Suspected copyright violations/2016-06-07 for example)
 * As for the false positives - it is possible to define different thresholds for the getting less false positives but also missing true positives. I haven't done a full Roc analysis to tune all the parameters but the arbitrary criteria is actually works pretty well somewhere in the middle ground. Eran (talk) 18:40, 16 September 2018 (UTC)
 * Follow up from BOTN discussion, from what has been reviewed so far, the vendor this bot will get results from can check for "copies" but not necessarily "violations of copyrights" (though some copies certainly are also copyvios), as such I think all labels should be limited to descriptive (e.g. "copy detected"), as opposed to accusatory (humans should make determination if the legal situation of violating a copyright has occured). — xaosflux  Talk 01:30, 16 September 2018 (UTC)
 * That would be part of the new pages feed, which the bot doesn't control. Wikipedia talk:WikiProject Articles for creation/AfC Process Improvement May 2018 or Phabricator would be more appropriate venues for discussing the interface. —&thinsp;JJMC89&thinsp; (T·C) 04:53, 16 September 2018 (UTC)
 * what I'm looking for is where is a log of what this bot does control. As this is editor-managed, its not unreasonable to think another editor may want to run a similar or backup bot in the future. —  xaosflux  Talk 05:14, 16 September 2018 (UTC)
 * Would it be possible to assign a number of bytes to "large chunck of text"? SQL Query me!  02:25, 16 September 2018 (UTC)
 * 500 bytes. —&thinsp;JJMC89&thinsp; (T·C) 04:53, 16 September 2018 (UTC)


 * Procedural note: The components for reading changes, sending data to the third party, and making off-wiki reports alone do not require this BRFA; making changes on the English Wikipedia (i.e. submitting new data to our new pages feed, etc) are all we really need to be reviewing here. Some of this may have overlap (e.g. what namesapces, text size, etc), however there is nothing here blocking the first 3 components alone. —  xaosflux  Talk 18:54, 16 September 2018 (UTC)


 * It looks like T204455 has been closed regarding logging, can you show an example of an action and it making use of this new logging? — xaosflux  Talk 11:02, 2 October 2018 (UTC)
 * xaosflux: I found this was already tested in betawiki. Eran (talk) 18:16, 2 October 2018 (UTC)
 * Thank you, looks like it leverages MediaWiki:Logentry-pagetriage-copyvio-insert for some of the text if we need to adjsut. — xaosflux  Talk 18:33, 2 October 2018 (UTC)


 * any update on verbiage related to T199359 ? — xaosflux  Talk 18:35, 2 October 2018 (UTC)
 * If the verbiage issue is resolved, I was wondering if we could move ahead with a trial for this BFRA. The way that PageTriage works is that it won't allow bots to post copyvio data to it unless the bot belongs to the "Copyright violation bots" group. So for the trial, you'll need to add EranBot to the group with whatever expiration time you like. It would be good to have at least a couple days so that we can make sure everything is running properly on our end as well. Ryan Kaldari (WMF) (talk) 17:35, 4 October 2018 (UTC)
 * ping. Ryan Kaldari (WMF) (talk) 17:12, 10 October 2018 (UTC)
 * Thanks, I've prompted any other feedback at Village_pump_(proposals). In the meantime,  I'd like to see this demonstrated on testwiki or test2wiki prior to production trials here, so that our human reviewers can submit data easily and see how it responds.  Any impediments to this? —  xaosflux  Talk 17:38, 10 October 2018 (UTC)
 * Unfortunately, setting up EranBot to monitor a new wiki isn't trivial and might take a while. You can see what the new logs will look like on Beta Labs. And you can see what sort of stuff EranBot flags by looking at CopyPatrol. What do you think about just doing a one day trial on English Wikipedia and having folks take a look at the results? That way it will be tested against more realistic edits anyway. Ryan Kaldari (WMF) (talk) 00:15, 11 October 2018 (UTC)
 * T206731 created as we do not currently have community control over this access. — xaosflux  Talk 02:14, 11 October 2018 (UTC)
 * Sorry about that. Should be fixed now. Ryan Kaldari (WMF) (talk) 18:32, 11 October 2018 (UTC)
 * Thank you, pending community closure at WP:VPP. As far as a trial goes, any specific day you would like to do the live run? —  xaosflux  Talk 03:54, 14 October 2018 (UTC)
 * Closed the VPP thread as succesful. &#x222F; WBG converse 06:23, 14 October 2018 (UTC)


 * In prepping for a live trial, what day(s) would you like to do this? I want to make sure we send notices to New pages patrol and perhaps a note at MediaWiki:Pagetriage-welcome. —  xaosflux  Talk 13:48, 15 October 2018 (UTC)
 * Xaosflux: Would it work to run with reports between 16 October ~16:00 UTC time - 17 October ~16:00 UTC ? Eran (talk) 15:23, 15 October 2018 (UTC)
 * That sounds good for us. What do you think Xaosflux? Ryan Kaldari (WMF) (talk) 17:24, 15 October 2018 (UTC)

Trial

 * I've added the cvb flag for the trial and let the NPP/Reviewers know. Do you have a good one line text that could be added to MediaWiki:Pagetriage-welcome to help explain things and point anyone with errors here? —  xaosflux  Talk 18:41, 15 October 2018 (UTC)


 * User:Ryan Kaldari (WMF), I don't actually see an option for using this filter in Special:NewPagesFeed - is it hidden because there are none currently? — xaosflux  Talk 19:30, 15 October 2018 (UTC)
 * I'm not seeing on betalabs either - how is anyone going to actually make use of this? — xaosflux  Talk 19:32, 15 October 2018 (UTC)
 * I was guessing it would show in the filters under "potential issues", but there's nothing there. FWIW, "attack" also has no articles, but is still shown there. I think I might be misunderstanding how this works altogether. Natureium (talk) 19:39, 15 October 2018 (UTC)
 * just regarding the "attack" filter having no pages, that is behaving correctly. It is very rare that a page gets flagged as "attack", because whole pages meant as attack pages are rare.  It's much more common for pages to be flagged as "spam", and you can see some of those in the feed.  To see some flagged as attack, you can switch to "Articles for Creation" mode and filter to "All" and "attack". -- MMiller (WMF) (talk) 22:52, 15 October 2018 (UTC)
 * During the trial period, they will need to add "copyvio=1" to the Special:NewPageFeed URL to see the interface changes. So https://en.wikipedia.org/wiki/Special:NewPagesFeed?copyvio=1. Nothing has been tagged as a potential copyvio yet, so not much to see at the moment. Ryan Kaldari (WMF) (talk) 20:16, 15 October 2018 (UTC)
 * I added a note at Wikipedia talk:New pages patrol/Reviewers with the above info. Ryan Kaldari (WMF) (talk) 20:20, 15 October 2018 (UTC)
 * thank you, I included a link to that in the header for Special:NewPagesFeed to help guide any testers. — xaosflux  Talk 20:53, 15 October 2018 (UTC)
 * thanks for your help here and for adding that editor.  edited it to link to the project page for this work so that people coming across it have some more context. -- MMiller (WMF) (talk) 22:52, 15 October 2018 (UTC)


 * Looks like testing has begun - please report status of the testing here. — xaosflux  Talk 11:17, 16 October 2018 (UTC)


 * Regarding the scope of this bot, User:ערן / User:Ryan Kaldari (WMF) the function overview calls for this to run against "newly added text", but the trials suggest it is only running against newly added pages - is this limited to new pages? — xaosflux  Talk 13:37, 17 October 2018 (UTC)
 * Xaosflux: the bot runs against any added text and reports for suspected edits to CopyPatrol and to pagetriage. Page triage accepts only edits for pages in the NewPagesFeed (e.g new pages). Eran (talk) 14:43, 17 October 2018 (UTC)
 * Thanks, I'm not concerned with updates offwiki (such as to CopyPatrol) for this BRFA, just trying to clarify when activity will actually be made on-wiki. For example with page  xaosflux  Talk 15:07, 17 October 2018 (UTC)
 * Xaosflux: No it doesn't care for the page creation time (page creation time isn't that meaningful for drafts). However this is viewable only in Special:NewPagesFeed which is intended for new pages, but I'm not sure what is the definition of new page for the feed (User:Ryan Kaldari (WMF) do you know?). Eran (talk) 16:32, 17 October 2018 (UTC)
 * Will these be usable for recent changes or elsewhere, to catch potential copyvios being introduced to 'oldpages' ? — xaosflux  Talk 16:36, 17 October 2018 (UTC)
 * For now, no, only new pages and drafts. CopyPatrol, however, handles copyvios introduced to old pages. Ryan Kaldari (WMF) (talk) 17:28, 14 November 2018 (UTC)

Handling Wikipedia mirrors
I know very little about the techincal aspect of this, so if I need to just pipe down I will, but No it doesn't care for the page creation time (page creation time isn't that meaningful for drafts) is one of the main problems that exists with Turintin-based CopyPatrol. Dozens upon dozens of revisions are flagged as potential CVs even though the dif in question did not add the text that is supposedly a CV, most of the time it seems as if if anyone edits a page with a Wikipedia mirror (or whne someone else has wholesale lifted the article) no matter how small the edit, it will be flagged. Most of the 6000 some cases I've closed have been false positives along those lines, and I think it might be of some use to make the software segregate against any cases where the editor has more than 50,000 edits. Thanks,L3X1 ◊distænt write◊  02:46, 18 October 2018 (UTC)
 * L3X1, thank you for the comment. I think this is good comment that should be addressed and disscussed in a seperate subsection, I hope this is OK.
 * Historically, EranBot detects Wikipedia mirrors (see example in User:EranBot/Copyright/rc/44; look for Mirror? ) where the intention is to handle also Copyvio within Wikipedia. That is, if user copies content from other article, he should give credit in the edit summary. (e.g example for sufficent credits in summary: "Copied from Main page").
 * This is common case, and somewhat different from copying from external site/book. I think CopyPatrol interface doesn't show this indication of mirror (as other indications of CC-BY). So how should we address it:
 * Do the community wants reports on "internal copyvio" (copy within Wikipedia/Wikipedia mirror) without credits? (if no, this can be disabled, and we will not get anymore reports on such edits)
 * If the community does want reports for "internal copyvio":
 * We can add the hints in the CopyPatrol side (Niharika and MusikAnimal) if this isn't done already. (I think it doens't)
 * This is up to community wheather we want to have distinction of labels in NewPagesFeed?
 * (based on community input here, this will be tracked technically in T207353)
 * Eran (talk) 05:29, 18 October 2018 (UTC)
 * Though I suspect this will be a minority viewpoint, I don't think detecting copying within Wikipedia is as important as catching from external sources. Thanks,L3X1 ◊distænt write◊  16:56, 18 October 2018 (UTC)


 * If these are only page creations, I think this would be useful for finding pages that have been copy-pasted from other articles, because this also requires action. Natureium (talk) 16:59, 18 October 2018 (UTC)
 * Detecting copying within WP is important, because it is so easily fixed--and fixing it, especially if it can be fixed automatically, prevents people who don't understand the priorities from nominating them for deletion unnecessarily. Detectingandn ot reporting mirrors is pretty much essential, & it should be marked, because beginners tend not to realize.  DGG ( talk ) 05:53, 23 October 2018 (UTC)

Trial results
The trial time has expired and along with it the access to inject data to the feed. Please summarize the results of your trial and if you require further trialing for evaluation. — xaosflux  Talk 04:48, 22 October 2018 (UTC)
 * thanks for your help with the trial. We think the trial was successful and that we should be ready to go to production.  Below are the results of the trial (along with one caveat that we intend to work on):
 * Since October 15, the bot flagged 237 pages as potential copyright violations. That is here in the log.  One encouraging note is that many of those pages are now red links that have been deleted, showing the association between being flagged by the bot and being a revision with enough violations that deletion is warranted.
 * Spot checking by our team has shown that almost all pages flagged by the bot in the New Pages Feed are also in CopyPatrol, and vice versa -- as expected.
 * This test was announced at both NPP talk and AfC talk. There were a couple questions, but no negative feedback.  Mostly just confirmation that things were working, and a little bit of positive feedback.
 * The one issue we noticed is in this Phabricator task, in which we noticed from spot-checking that when a page is moved from Main namespace to Draft namespace (or vice versa), the New Pages Feed links to the CopyPatrol page of the new namespace, whereas the CopyPatrol results are still displayed at the original namespace. This, however, is an issue with the PageTriage integration and the way we are displaying results, not with the bot.  We'll address this, but we think the bot is behaving correctly.
 * -- MMiller (WMF) (talk) 20:11, 22 October 2018 (UTC)
 * thank you for the note, I've left notes that this appears ready to go live on those pages as well, if no issues in ~2days looks like we can call this done. — xaosflux  Talk 20:44, 22 October 2018 (UTC)
 * Hi -- it's about two days later now, and I don't see any notes about the bot from the affected communities.  We want to remove the feature flag for the New Pages Feed to incorporate the bot's results tomorrow, putting this feature in production as a whole.  Could you please add the bot back to the copyvio group?  Thank you. -- MMiller (WMF) (talk) 18:39, 24 October 2018 (UTC)

I'll ask another crat to add the flag since I'm approving as Bot Approver. Please let us know when this is a 'standard' option at Special:NewPagesFeed so we can adjust the directions at MediaWiki:Pagetriage-welcome. — xaosflux  Talk 18:45, 24 October 2018 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.