Wikipedia:Turnitin


 * Note: This page is mostly outdated: the proposal and planned RFC were in 2012.  Some parts of this page, including about alternatives, have occasionally been updated.
 * Note : https://copypatrol.wmcloud.org/en/ was introduced to create this functionality.


 * Note: The above nutshell summary described the status in 2012.  The planned RfC did not happen.

Turnitin is an Internet-based plagiarism-detection service run by iParadigms. Universities, schools, and professional researchers and writers submit documents to Turnitin's websites, which check the writing for originality against a comprehensive internet crawler, a database of proprietary content, and prior submissions. Managing copyrighted content is a major focus and problem for Wikipedia. This page is designed to lay out the concept of a potential collaboration with Turnitin as a way to combine our strengths and improve or resolve a major issue for Wikipedia's content oversight.

Comments are welcome on the talk page.

Background: Turnitin

 * Turnitin checks and archives millions of papers and uses its database and algorithms to identify plagiarized material.
 * Submissions are compared to over 17 billion web pages, 200 million student papers, and over 100 million additional articles from content publishers, including library databases, text-books, digital reference collections, subscription-based publications, homework helper sites and books. (Turnitin says their web index is now up to 25 billion pages).
 * Globally, Turnitin evaluates about 40 million papers a year. During final exam periods, the site processes 400 new submissions per second.
 * As of 2012, Turnitin serves about 10,000 institutions in 13 languages in 126 countries.
 * More than 2,500 higher education institutions use Turnitin, including 69 percent of the top 100 colleges and universities (U.S. News and World Report Best Colleges list).
 * Almost 5,000 middle and high schools use Turnitin, including 56 percent of the top 100 high schools (U.S. News and World Report's America's Best High Schools).
 * In Colorado, Turnitin is used by 100 schools—both secondary and higher education—and more than 200,000 students.
 * More than 100 colleges use Turnitin to detect plagiarism in application essays.
 * Turnitin's parent company iParadigms employs almost 100 people. It is backed by the private equity firm Warburg Pincus. It has 8 international offices serving almost 130 countries.  It is headquartered in Oakland, California.

Background: Copyright investigations on Wikipedia
There are several ongoing efforts to deal with copyright on Wikipedia:
 * CorenSearchBot – This is the most sophisticated tool we currently use. It checks new Wikipedia articles and matches content against a web search and tags them with an appropriate message as well as alerting relevant copyright forums.  There are limitations to Coren's bot:  it does not check existing Wikipedia content, it only checks webpages not a content database, and it doesn't have a corpus of prior submissions.  It's possible that Coren's algorithm is not as developed as the proprietary code by Turnitin.  Coren's bot is limited to run one check per 5 seconds, which would allow it to check over 6 million articles yearly; that is enough to cover English Wikipedia almost twice, however, it's not clear if that level of operation is feasible.  Coren's bot does not generate an itemized report which allows editors to actually see and compare plagiarized sections or identify the various sources which result in the match (for recent Coren bot reports, see User:CorenSearchBot/manual).  In exploring a collaboration with Turnitin, a necessary question is whether Coren's bot is sufficient, or should be expanded rather than overtaken by Turnitin's system.  There's also possible areas of synergy where two can complement each other.  For example, Coren's search bot could tag articles that score high in Turnitin's copyright detection.  Also of interest are the bot's 'excluded' sites list, which includes Wikipedia mirrors; this could be leveraged to assist Turnitin in optimizing their algorithm for Wikipedia (see also: Mirrors and forks and the Mirror filter).  Note that Coren is not currently active (since 31 December 2011), and his bot has been mirrored and replaced by User:MadmanBot.
 * Duplication Detector – This Toolforge tool compares two web pages directly and identifies areas of overlap. It does not run automatically or query a database.
 * Contribution Surveyor Created for Wikipedia:Contributor copyright investigations on the English Wikipedia, the tool analyzes the contributions of users with a history of copyright violations. It isolates and ranks contributions that are most likely to be copyright violations. It lists contributions by size not likelihood of violations, so while it helps prioritize the largest offenses, it doesn't do so with an emphasis of actual likeliness of a violation.
 * Copyvio Detector - Another Toolforge tool to detect copyright violations.
 * WikiProject Contributor Copyright Investigations – This on-Wikipedia group investigates and fixes multiple and large-scale copyright violations. Their important work is largely manual and generally tedious.
 * Copyright problems, a help page for investigating single or small-scale copyright issues.

Background: Corporate collaborations

 * The idea of informal relationships with corporations is not without precedent, although it is still relatively new. In 2010 and 2011 Credo Reference donated 400 free "Credo 250" accounts to Wikipedia editors (project page), and in 2012 HighBeam Research offered up to 1000 free 1-year accounts to editors (project page).
 * Wikipedia is an immense and precious global asset. Doing anything which is perceived to compromise its neutrality is not to be undertaken lightly or at all.  Wikipedia is not a commercial project; further, it's an explicitly non-commercial project, and fiercely so.  There are thousands of companies who would love to leave their logo or brand association on Wikipedia, but Wikipedia's independence is a primary concern.  In many ways it is simply non-negotiable.
 * Although Wikipedia maintains such strict neutrality and independence in its operations, collaborations with corporations have the potential to enhance the core mission of the encyclopedia. If they are done right, they can be beneficial and pragmatic, addressing major areas of site operations without compromising Wikipedia's objectivity or giving undue privileges to any company.

Principles

 * Respecting copyright is required by law as well as being core policy on Wikipedia, as it aims to be a truly free work for all to use, modify, repurpose, or even sell.
 * Current tools for identifying copyright violations are limited, sometimes manual, not comprehensive, and inefficient.
 * Turnitin provides paid access to a comprehensive copyright and plagiarism database that Wikipedians would find useful in their regular content work as well as their copyright violation investigations
 * Turnitin is not inexpensive and would be unaffordable to a majority of volunteer editors who work on the encyclopedia.
 * A collaboration between Turnitin and Wikipedia would be mutually beneficial.

What's in it for Wikipedia?

 * Access to a leading service for plagiarism and copyright detection
 * Increased efficiency and scalability for dealing with copyright violations
 * Ability to prioritize and oversee copyright investigations and cleanup
 * An opportunity to analyze every Wikipedia article using a sophisticated algorithm, which could revolutionize the way we manage our content
 * Enhanced community relations with a provider of education resources
 * Another tool in the community's and editors' bag for monitoring and improving articles

What's in it for Turnitin?

 * High-profile collaboration which would solidify the software's status as the standard in its field
 * Tremendous amount of user feedback from a community which is known for giving feedback
 * Opportunity to improve the content on the largest encyclopedia in the world
 * Visibility within the community as having helped out with an essential aspect of site operations
 * In line with policies, promotion of this collaboration throughout the community
 * Pending discussion, attribution given on Turnitin's off-Wikipedia reports
 * Greater awareness among editors that Turnitin exists and provides a useful service
 * Potential for Turnitin to advertise that it is used to 'check Wikipedia'

What it's not

 * A formal partnership or contractual relationship
 * An endorsement of Turnitin over other similar and competing services
 * An agreement to continue using Turnitin's services if a free, competing, or open source version of comparable software becomes available

Working plan

 * Turnitin report would be generically/anonymously linked to on talk pages (the name Turnitin would not be mentioned on talk pages), which meet a certain level of text-matching determined by the community as useful
 * Turnitin's report pages would be rebranded as something like "WikipediaCheck"
 * At the bottom of Turnitin's reports would be a small icon that said "Powered by Ithenticate", which is Turnitin's parent company.
 * Turnitin's reports integrated with a new or existing bot that periodicallly queries the Turnitin database during their off-peak hours and writes a report to the article talk page or a subpage
 * A central page project page, talk page, or possibly even article page could be updated with results or appropriate tags

Attribution
One of the key issues is whether, how, and when to give Turnitin attribution or credit for the services it provides. Here's one example, a notice/banner that could be placed on the Talk page of articles.


 * Other possibilities for attribution
 * Promotion of the collaboration on community forums, various copyright projects, article content writing and article review centers
 * A Wikimedia Foundation Press Release
 * A Turnitin Press Release
 * Limited, pre-approved mention on Turnitin's website and promotional materials about this collaboration

Addressing major objections

 * Superiority: Wikipedia currently has a bot running which checks all new pages against Google's database, called MadmanBot. This is the 'competition' so to speak, and it's necessary to demonstrate that Turnitin has a more effective, more comprehensive, and more sophisticated approach.  (Note that MadmanBot could also be a crucial lynchpin which would supplement Turnitin, since it could be the vehicle for querying Turnitin's databases, tagging offending pages, and posting reports on Wikipedia.  There are also a variety of free and competing pay services.  Why Turnitin should be used over these is a necessary question to answer.
 * Avoiding false positives: Turnitin will have to show that they can avoid matching pages that that are mirrors or copies of Wikipedia. Technically implementing that is key to a collaboration.
 * Attribution: The current, most approachable idea is that article Talk pages, which are linked directly at the top of every article, would have a banner linking to Turnitin and a Turnitin report.  Is this fair, excessive, sufficient?
 * Exclusivity: In the past, corporate collaborations have been explicitly non-exclusive, permitting Wikipedia to use competitors' services or discontinue a collaboration at any point for any reason. Wikipedia largely operates on good faith, and mutual benefit has made the non-exclusivity criterion largely moot.  Turnitin, however, is considering devoting considerable resources, planning, time, and energy into this partnership, so we must consider whether some type of exclusivity agreement for some time period is desirable, necessary, or permissible.

Who is involved
Signed on:
 * Ocaasit &#124; c 16:22, 25 March 2012 (UTC)
 * Andrew G. West, computer science PhD student at UPenn who studies wiki security
 * Doc James (talk · contribs · email) (if I write on your page reply on mine) 01:01, 8 May 2014 (UTC)
 * Fuhghettaboutit (talk) Admin with years of experience with copyright violations, detecting them, investigating backwards copying and related issues. Love the idea and glad to help if I can. I have no programming skills, however, and so will be useless on that end of matters.

Bot programmers:
 * User:ValHallASW
 * User:Eran

Consulted:
 * Maggie Dennis, community liason for the Wikimedia Foundation
 * Philippe Beaudette, director of community advocacy for the Wikimedia Foundation
 * Derek Coetzee, administrator and computer science PhD student at UC Berkeley
 * Coren (Marc A. Pelletier), administrator and operator of CorenSearchBot
 * Madman, administrator and operator of Madmanbot (CorenSearchBot's replacement)

Alternative plagiarism tools

 * Free
 * Copybot
 * Unicheck
 * Check-Plagiarism
 * Chimpsky
 * Plagium
 * Plagiarismcheck.org
 * SeeSources
 * The Plagiarism Checker
 * Plagiarism Checker X
 * Viper
 * Plagiarism Checker
 * Plagiarisma
 * Article Checker
 * Article Check
 * Plagiarism Check
 * Detectordeplagio
 * Pay
 * Ephorus
 * Attributor
 * Plagiarismhunt
 * Copyscape
 * PlagScan
 * Veriguide