User:Novem Linguae/Essays/Copyvio detectors

This is a summary of enwiki's various copyright violation detector bots and tools.

Earwig copyvio detector

 * https://copyvios.toolforge.org/
 * maintainer: The Earwig, Chlod
 * source code: https://github.com/earwig/copyvios
 * last commit: 2 years ago
 * tech: Python
 * uses Google search API and the WMF eranbot Turnitin API
 * WMF pays for credits
 * no discount (NPerry (WMF) used to work on Wikimedia's partnership with Google, maybe this is something worth bringing up?)
 * hard daily limit (maximum for any user of this API) of 10,000 queries per day
 * costs US$50 per day
 * makes up to 8 queries per page
 * 2,000ish checks per day (not all checks use all 8 queries)
 * has issues with concurrent queries
 * uptime report: https://stats.uptimerobot.com/BN16RUOP5/784331770
 * false positive handling via a community-maintained exclusion list at User:EarwigBot/Copyvios/Exclusions

Detection via Turnitin
Turnitin

Frontend

 * https://copypatrol.wmcloud.org/en
 * maintainer: WMF Community Tech team (most active recent committer: MusikAnimal)
 * source code: https://github.com/wikimedia/CopyPatrol
 * last commit: 3 months ago
 * tech: Symfony (PHP)
 * replaced https://copypatrol.toolforge.org/en
 * is mostly a viewer for an SQL database that the copyright detection bot(s) below writes to
 * users can mark pages/revisions as being fixed or requiring no action. (However, this information is not reflected on enwiki)
 * there is a "compare" feature in the CopyPatrol interface. clicking on it does an API query to the Earwig tool above

Backend

 * bot name: CopyPatrolBot
 * BRFA: Bots/Requests for approval/CopyPatrolBot
 * maintainer: JJMC89
 * source code: https://github.com/JJMC89/copypatrol-backend
 * last commit: 2 months ago
 * tech: Python
 * rewrite of EranBot's copyright tasks

Frontend (wikimedia-slimapp)

 * https://copypatrol.toolforge.org/en
 * maintainer: WMF Community Tech team (most active recent committer: MusikAnimal)
 * source code: https://github.com/wikimedia/CopyPatrol/tree/569f76e113da307d3810e1333531fcfc8449dbcf
 * last commit: 7 months ago
 * tech: PHP, Twig (wikimedia-slimapp)
 * is mostly a viewer for an SQL database that the copyright detection bot(s) below writes to
 * users can mark pages/revisions as being fixed or requiring no action. (However, this information is not reflected on enwiki)
 * there is a "compare" feature in the CopyPatrol interface. clicking on it does an API query to the Earwig tool above

Backend (EranBot)

 * bot name: EranBot
 * BRFA: Bots/Requests for approval/EranBot 3
 * maintainer: ערן
 * also involved: Doc James, Ocaasi
 * source code: https://github.com/valhallasw/plagiabot
 * last commit: 2 years ago
 * tech: Python, Pywikibot
 * writes to an SQL database that CopyPatrol uses
 * uses PageTriage  API to mark pages/revisions as probable copyright violations
 * writes this to the  log: https://en.wikipedia.org/wiki/Special:Log?type=pagetriage-copyvio
 * displays the tag in Special:NewPagesFeed and the Page Curation toolbar's info flyout
 * need the  permission to use this API (not assignable by admins, probably need a bureaucrat to do it)
 * uses Turnitin's iThenticate API
 * does WMF pay for it or if it is comped???
 * what are the daily limits?
 * do we hit these limits?
 * false positive handling via a community-maintained exclusion list at meta:User:EranBot/Copyright/Blacklist