Wikipedia talk:Turnitin

Comments are welcome:

Comments from Madman

 * I've responded inline --Ocaasit &#124; c 16:00, 27 March 2012 (UTC)

Here are my way-too-extensive and highly-caffeinated notes on this proposal thus far. I really appreciate the proposal itself and the effort you've put into it.
 * No, thank you so much for taking a close and extensive look. You're one of the most important people working on copyright issues and I want/need your input.

Background: Do we know how substantial Turnitin's Web corpus really is? I've only heard of Turnitin as a database of submitted papers and papers retrieved from academic databases. While that database is impressive, I find it more likely that infringing content on Wikipedia is going to be from the Web rather than from academia. I could certainly be wrong on that, and I think we'd want to have a considerable trial period during which we could determine how much additional infringement we'd actually find. We do know on the other hand that search engines' corpora are substantial; while it's very hard to determine indexes' size, estimates for Yahoo! have ranged between 2.5 and 18 billion Web pages, and estimates for Bing have ranged between 2.0 and 20 billion Web pages (Yahoo!'s search is apparently now powered by Microsoft, but I'm not clear on what that means).
 * We can find out much more about Turnitin's corpus. I'm sure they provide those details, as they do when they are promoting their product.  Turnitin draws on three databases: a web crawler, proprietary content from media and academic journals, etc., and prior submissions.  Turnitin has their own algorithm which they use for their web crawler, and they have suggested it is optimized to find likely plagiarism.  We have already discussed the need to exclude Wikipedia mirrors, and they appear willing to implement that on their end.  As for web content vs. academic content, the truth about which is the more common source of plagiarism is that we just don't know because we've never had a way to look.  That's one of the reasons a collaboration with Turnitin is so intriguing; it would reveal what the real state of the encyclopedia is.

CorenSearchBot: Some of the limitations of CorenSearchBot that you note such as the hard-coded edit rate no longer apply; others could be easily addressed if a feature were requested, such as itemized reports. I'm hoping to be able to rewrite MadmanBot's mirror of CorenSearchBot sometime soon in a language with which I'm familiar, as I don't see CorenSearchBot coming back soon, and this should result in many improvements in its feature set and maintainability.
 * Itemized reports seem like a critical piece. The ability to see specifically which pieces of a document are matches and what source they came from is essential to efficiently pursuing any copyright investigation.  Thank you for you work picking up after CorenSearchBot.  It's much needed, and our best hope at the moment.  A fair and important question here is whether we can do what Turnitin is offering better than they can, and if not, how much of a gap are we willing to tolerate in our abilities because we are retaining independence of operations.  Some will care very much about that, whereas others will take a more pragmatic view of a collaboration.

Principles: Your point regarding the Wikimedia Foundation's neutrality is well-taken, and I think it merits much discussion within the WMF. Speaking for myself, I'm less concerned by using proprietary databases as that's what I'm doing already, but mine is a service provided to Wikipedia, not by Wikipedia. I would have no objection to similarly making API requests of Turnitin's database; I think any bot operator would, however, object if any proprietary code were necessary to make these requests. I don't imagine it would be.
 * The WMF is currently reviewing the proposal, and we will need to hear their full feedback before further discussions with Turnitin, or a broader presentation to the community. Your bot could be the best link between Turnitin and Wikipedia, a lynchpin in making this collaboration work.

Concepts: I think the weak form would result in more work for Wikipedia's already over-worked contributors in the copyright domain, and I'm not sure it would therefore be feasible. I think whatever use we make of Turnitin's database should be automated, though I don't agree with the strong concept that every page should be reviewed nor that integration with MediaWiki's needed. I think a bot should be all that's needed, and if every page on Wikipedia were reviewed, there would be massive false positives. Can you conceive of how many papers submitted to Turnitin have been plagiarized from Wikipedia? This is why CorenSearchBot and MadmanBot check new pages only, because otherwise we've found the results are virtually meaningless.
 * I agree that the weak form is a paltry benefit compared to the strong form. Automation is the key.  Running every article on Wikipedia through Turnitin would be desirable if we can tweak the algorithm to exclude the majority of mirrors and self-references.  Also, Turnitin's prior submissions database may be of no use to us, because there is no proof that the writing of other students is copyrighted or not copied directly from Wikipedia.  This aspects will require careful design, trial, and testing to determine just how large a scope is feasible and beneficial.

Attribution: I think a talk page banner is worth discussing; a link/tab on every article is not, for the same reasons as above. You say of anonymous links: "This is unlikely to attract support from Turnitin"... my impression is that you've spoken with Turnitin already. Have they come up with any ideas or made any requests regarding attribution? Could the existing communication be published somewhere so we have more of an idea how feasible this entire proposal is and what's already been discussed?
 * The talk page banner seems to be the sweet spot. Turnitin has made a general request for attribution in some form but nothing specific.  We are waiting to hear from the WMF before continuing the conversation with Turnitin.  I am giving regular updates to WMF staff and will attempt to make all future communications public if possible and relevant.  At the moment, there's nothing new to report.  We're just waiting to hear from the WMF, and then hoping to set up a conference call with key people from Turnitin.  I would love for you to be involved in that process.

Thanks! &mdash; madman 15:33, 27 March 2012 (UTC)
 * You're welcome! Ocaasit &#124; c 16:00, 27 March 2012 (UTC)

Update after conversation with Turnitin's Marketing department
As requested by Madman, I'm going to continually update this talk page with any news from my preliminary discussion with Turnitin. The goal is to gradually involve more key editors and Foundation staff, and then eventually the entire community. Here's what we covered in our last conversation.

Turnitin expressed that it shares a core value with Wikipedia, a commitment to education. They understood our stance on what a truly free encyclopedia entails in terms of copyright and plagiarism, and believe they can help us achieve that goal. In all, Turnitin expressed that a collaboration with Wikipedia aligns with the company's mission.

They explained two common objections that Turnitin often faces in an educational context. The first is that it 'poisons' the student-teacher relationship by shifting the burden of proof onto the contributor. What this means in a Wikipedia context is probably less relevant, because we have a legal and policy obligation to police copyright violations and analyzing submitted text after it's been published does not seem unkind or bitey. We want to catch copyright violations. The second objection involves whether Turnitin are themselves committing a copyright violation by evaluating and storing submissions in their database. They said that they have been legally cleared of this concern under 'fair use' exceptions. This does not seem to be an issue at all for Wikipedia, since our content is free for others to use anyway. We want people to copy our content, provided they give attribution when doing so.

We discussed the four key objections raised on the project page:
 * Superiority. The discussion over of whether Turnitin is superior to other copyright detection methods hit on three key areas.  One, Turnitin's web crawler is proprietary and extensive.  It indexes 20 billion pages.  Further, it uses a sophisticated pattern-matching algorithm which is substantially different than Google or Bing's keyword-matching approach.  They are going to share some of the technical highlights of what that means, though the actual code is a company secret.  Two, Turnitin has great depth in their content database.  This is a significant difference between Turnitin and MadmanBot, or some of the other free copyright detection programs which only search web pages.  Three is Turnitin's ability to massively scale.  They reported processing 50 million pages in the past year, reaching peaks of 400 submissions per second.  That ability means that handling Wikipedia's 4 million pages would not be an obstacle for them.


 * False positives. Turnitin acknowledged that the significant technical challenge is adapting their algorithm to work on Wikipedia specifically, and particularly, to avoid mere mirrors of Wikipedia content.  This is a hard problem that will require development and testing.  CorenSearchBot tried testing existing articles for violations and found too many false positives.  Turnitin understands that in order for the collaboration to be effective for reviewing every Wikipedia article, they will have to make an investment in design and follow-through with a pilot program.  They are willing to do that, pending a positive reception from the broader community.


 * Attribution. This key political or economic issue about how to give credit to Turnitin and how much was actually resolved quite easily.  Turnitin understands that Wikipedia is a public good and an independent project, and that co-branding of any sort is not an option.  They were basically on board with the option of links to Turnitin reports which are posted on article Talk pages.  That seems like a good proposal to bring to the community for discussion.


 * Exclusivity. Turnitin understood that this collaboration would not be formal or exclusive, and that it would not involve payments or contracts.  So, no issue there.

In all, it was a very constructive conversation, seeing how the four major objections appear to be at least understood if not fully resolved, at least on Turnitin's side. The next step is to hear back from the WMF about legal or organizational issues. Then we need to have a conference call to get into technical details, and last, but certainly not least, propose it to the community at large. Timeframe for next steps is 1-3 weeks, and I'll post any updates in the interim to this page. I'll also move the project into the mainspace, since it looks to be advancing. Cheers Ocaasit &#124; c 19:25, 30 March 2012 (UTC)


 * I think Turnitin are understating the problem posed by false positives, and overstating the possibility of a solution. I'd explored this before, and the problem is fundamental. Turnitin is designed to detect plagarism in student work. Under those circumstances, the work has been newly submitted and is not publicly available, so any hits are either due to copying, accidentally similar wording, or acceptable quotations. This is fine. But that assumption doesn't work on Wikipedia. Wikipedia's material is both public and old, so the risk of reverse copying, (which doesn't exist in a given assignment being checked) is very high. The issue isn't mirror sites, which can be filtered out, but that all other online or published content that has emerged since the content appeared in Wikipedia is potentially copied from here. For example, if a school site contains text identical to Wikipedia in their "about us", is that because we copied it from them, or because they copied it from us? To determine this we need to explore a dated archive, such as Wayback Machine, to see if the material was on their site before ours, and explore the history of the page to see when the material was included. But that presupposes that the site both could and has been archived, given the restrictions inherent to web archiving, and that the site was crawled in the critical time. And this is a tricky algorithm - I'm currently coding one, but for a general, WP-wide data set it would be a massive challenge even if it was possible.
 * I'm not sure if any plagiarism software designed for assignments can get over the false positives inherent in being one of the most visible content providers on the web. - Bilby (talk) 05:18, 19 April 2012 (UTC)
 * Just as a brief aside, as it is not a core issue, a collaboration won't reveal the true state of Wikipedia, but the state of WP as viewed by Turnitin's algorithm. The result would incorporate Turnitin's bias, which would be tricky to filter out. But this isn't a major point either way - just something of interest. :) - Bilby (talk) 05:21, 19 April 2012 (UTC)
 * Thanks for your comments Bilby. The issue of false positives is known and not trivial.  Turnitin has expressed a willingness to adapt their algorithm specifically for Wikipedia, to design and develop it, and to test it in a pilot program.  So, while it is a technical issue, it is not an abstract one.  We can actually try it out and see if it works.  If it doesn't, there would be less of a role Turnitin could play, mainly analyzing only new articles (as MadmanBot currently does).  That may be an improvement in sophistication, but not in scope, and it would impact the rest of the arrangement. Ocaasit &#124; c 15:48, 19 April 2012 (UTC)

false positives?
I read somewhere on the Internet that TurnItIn is designed not only to catch obvious plagiarism (copying without attribution), but also to catch plagiarism of ideas (paraphrasing without attribution); thus, if you write a class paper that contains an idea that is not original, then Turnitin can catch it. But not all ideas that others have come up with are plagiarism – somebody may have come up with the idea independently, without being aware that it was already published somewhere else! When you come up with a new idea, you are often not aware of whether it is really yours or not. Therefore, I actually think that Turnitin is much better suited for Wikipedia than it is for the classroom environment. A false-positive on Wikipedia will cause almost no harm to anybody, while a false-positive in a school environment can seriously harm a person's career. Bwrs (talk) 22:22, 30 August 2012 (UTC)
 * One of the strength's of Turnitin's system is that we can set the sensitivity levels to meet our needs. So, for example, we can have it only catch a 5% text match or require a 25% text match; it can catch full sentence matches, or only small word groups.  In short, Turnitin can be as sensitive as is useful to us, and we might well want to dial it back to avoid false-positives (not to avoid harm per se, but to make the workload of checking the reports scalable) .  Also, it's quite important that we view Turnitin's reports, at least for the near future, as investigation tools rather than conclusions. Ocaasit &#124; c 16:52, 8 February 2013 (UTC)

Other languages
Do they cover other languages or only English? What languages are covered and to what extent? Have you notified the other wikis? I'll notify it.wiki/it:Progetto:Cococo. --Nemo 16:02, 6 February 2013 (UTC)
 * Turnitin does have abilities and has expressed a general interest in working on other non-English versions of Wikipedia. Each community would have to make their own decision in that regard, but once we get past the English Wikipedia trial stage, I'd be happy to continue talks with Turnitin in regard to expanding usage to other projects. Ocaasit &#124; c 16:47, 8 February 2013 (UTC)
 * What do you mean "abilities"? Of course the software is reusable, but the question is whether they have an extensive database in other languages too: I doubt they'd build one just for our benefit. Thanks for your answers here and below. --Nemo 21:16, 8 February 2013 (UTC)
 * The software is reusable, and their database is growing in other languages but is to my knowledge significantly larger in English. For most Wikipedia purposes the internet crawler is their primary tool as most copy-paste happens from other websites.  That said, Turnitin is a large and global organization, so I'm happy to ask them about specific other languages capabilities.  I take it you are curious about Italian?  The reason I haven't pursued this with other communities yet is because we are still in the pre-trial stage on English Wikipedia, having just recently received access to Turnitin's reports through an API mechanism.  So we haven't even confirmed that Turnitin's tools would work on English Wikipedia, which seems like a necessary first task before moving into other, less comprehensively-indexed languages.  If things go well here, I will do whatever I can to provide the same opportunities to other communities. Ocaasit &#124; c 22:35, 8 February 2013 (UTC)
 * Yes, the Italian Wikipedia may be interested. I understand that you want to start with en.wiki (although maybe it would be easier to start from a "smaller" Wikipedia? who knows), so feel free to leave this for later; I asked just to know how closely I should follow this initiative. :-) --Nemo 18:00, 9 February 2013 (UTC)

Other tools
The WMF was once asked to help with some Yahoo! search API keys (BOSS?) which were needed to run some community bots which helped dozens or hundreds of wikis identify copyvios: do you know the status of this effort and would it be superseded? --Nemo 16:02, 6 February 2013 (UTC)
 * Here's how User:CorenSearchBot currently works according to my understanding (note, this is based on User:Madman's understanding of the code; I haven't been able to reach Coren):


 * 1) feed of new articles are fed into Yahoo Boss API
 * 2) searches for the subject of the article and for the subject of the article + random snippets of the text. &mdash; madman 02:40, 3 September 2012 (UTC)
 * 3) Pulls the top five to six results
 * 4) converts page to text
 * 5) compares Wikipedia article and search results with the Wagner–Fischer algorithm and computes a difference score
 * 6) if high enough, means it's a likely copyvio
 * 7) the page is flagged, the creator is notified, and SCV is updated


 * So yes, we're still using the Yahoo Boss API, and last I heard we still have access to it through the WMF. At this point, while we are still setting up and testing Turnitin's code, it would be premature to talk about superseding that effort.  Once we have data to present to the community, hopefully in the next few months, they'll have the opportunity to weigh in on how best to use/leverage/integrate/replace these various tools.  Sorry that's a little vague, but it reflects where we are in the project. Ocaasit &#124; c 16:44, 8 February 2013 (UTC)

Newspapers
Does this tool also check newspapers (which at least in Italy copy heavily from Wikipedia and from each other) and how good is it at reporting who came first? Does it have some nice visualization, e.g. to quickly show that 5 % of newspaper A is copied from Wikipedia, 1 % of newspaper B is and 30 % of their content is copied from each other? Thanks, Nemo 16:02, 6 February 2013 (UTC)
 * Turnitin runs an extensive webcrawler as well as a proprietary content database they've negotiated with a variety of publishers. While I believe it has the most extensive search capabilities of any other tool, it does not have all newspapers, because some newspapers prevent their sites from being crawled.


 * The question of reporting who came first is a complex one and something we're working on in the post-processing efforts. So, after Turnitin's reports run, we might then have one of our bots run a check of article version history compared to website last updated or created by data.  It's not clear at this point if we can automate those efforts.  In most cases, a human is required to make a judgement call and Turnitin merely helps bring out attention to the most likely suspects.


 * Turnitin has a very nice visualization to show % matches across multiple sources, using a data table as well as text highlighting. We would have access to their full reports to utilize those capabilities, as well as a data feed from their API to use that data in our post-processing efforts. Ocaasit &#124; c 16:57, 8 February 2013 (UTC)

Addressing false positives
Having dealt with copyright a fair bit so much work in textbooks, academics journals and around the web is copied word for word from Wikipedia without attribution. The only way to determine who has copied from whom in pre existing text is to look back at Wikipedia's history and see if we contained the text before the publication date of the source in question. However if this Turnitin project was to look at all new edits that added significant text (would need to exclude text that is moved from one place to another), "copy and pasted" content could much more easily be detected and plagiarism would be prevented as a problem going forwards. This would be simpler than running it on entire Wikipedia pages. Doc James (talk · contribs · email) (if I write on your page reply on mine) 11:23, 4 May 2013 (UTC)
 * This is an interesting point, that I don't see addressed on a quick scan of the project page. What is the plan for how this will be used, once all the policy and technical hurdles have been overcome?  Will it be run on existing Wikipedia pages, or only on newly added text?  Even if only the latter, it seems there is potential for false positives if an editor is restoring text that  was previously deleted.  Also, what's the status of this effort?  I'd be very interested to know if it's likely we'll be able to use it in our education program in next Fall's semester. Klortho (talk) 21:44, 1 June 2013 (UTC)

Can we use this in an education course?
Hi. What's the status of this? I'm helping to run a wikipedia education course, and some of us are wondering if we can use Turnitin to help us detect plagiarism in the articles that the students write. Klortho (talk) 20:13, 15 January 2014 (UTC)
 * Hey . I will look into this and see how we could get course instructors/admins/ambassadors access.  Please feel free to ping me as a reminder, best at jorlowitz@undefinedgmail.com.  Cheers! Ocaasit &#124; c 17:07, 31 January 2014 (UTC)

Conflict of interest question
I posted a relevant question at the Conflict of Interest Noticeboard about a gift card received for a talk given at a Turnitin Webinar, about Wikipedia and Education: Conflict_of_interest/Noticeboard. Please comment if you have any thoughts, ideas, or concerns. Ocaasit &#124; c 17:04, 31 January 2014 (UTC)

Posted here WT:WikiProject Medicine
We need to move this forwards. Doc James (talk · contribs · email) (if I write on your page reply on mine) 01:44, 21 July 2014 (UTC)

Some questions that aren't answered by this page:


 * 1) Has any testing been done on a set of edits to see what the results might look like? I'm a little unconvinced on the idea of comparing edits with tens millions of term papers or other submissions. If testing hasn't begun, why not? What's lacking?
 * 2) The page says there will be no formal or contractual relationship between Turnitin and WMF, but I don't see how this can necessarily be true if its assumed Turnitin will be able to use the "Wikipedia" name in marketing material. Thoughts?
 * 3) Is it actually technically feasible to run the filter on all new edits? What would the effect of this be on Wikimedia servers, how would Turnitin access this data, etc.? (In my post to wikimedia-l I asked what the value was of skimming edits instead of all pages; I see that Jim proposed this to address part of the false positive problem).
 * 4) What mechanism would be used to add the report link to the talkpages? A bot account operated by Turnitin? Would access to the Turnitin database be restricted / proprietary, or could other bot developers query it for various purposes?

It sounds like there's a desire to just skip to the end and agree to switch Turnitin on as a scan for all edits, but I think these questions and more will need to be answered before people will agree to anything like full scale implementation. Nathan  T 16:43, 21 July 2014 (UTC)
 * I agree that a bunch of planning would be needed, and the best start would be the development of an example showing a hand-made report. After that, there should be some estimates of what the WP:Turnitin would involve in terms of activity on Wikipedia—how many talk pages might be tagged in a month? how many updates might occur on a central page on Wikipedia in a month? Obviously extreme guesswork would be involved in arriving at estimates, but some planning along those lines would be needed. The problem of mirrors has been noted, but mirrors and reverse copyvios would be a big problem to overcome. Then there is the enormous scale of the task. On that point, while it might be a bit creepy, I was wondering if it would be more feasible for an automated system to track editors—for example, all accounts with more than 2000 edits might be considered whitelisted (their edits would not be checked), and an addition of more than, say, 100 bytes by a non-whitelisted editor would trigger a check of their other additions. This is not the place to plan details, but that idea might be considered. I don't think there is a problem regarding a contract—Wikipedia's legal department would write a letter outlining that certain statements of cooperation were agreed. Johnuniq (talk) 01:34, 22 July 2014 (UTC)
 * Has any testing been done? No the tool is not build yet. Currently I check for copyright concerns using google scholar and google books manually.
 * This is only going to be run on edits over a certain size. The pilot is planned for only medical edits. In 2013 there was 400,000 edits to English Wikipedia medical articles. Of these less than 10% were of any significant size. That is 40,000 edits a year or a 100 edits a day.
 * Lets say copyright issues occur in 10% and there is a half the triggers are false positives. That is 20 edits for Wikiproject Med members to review a day. We can do that.
 * Turnitin is NOT building this tool. We at Wikipedia are. So yes there will be a Wikipedia made bot that will take blocks of new text and plug this text into Turnitin. It will not be runnable on all edits. Depending on the results the concerns will be added to a central page at WPMED.
 * Whitelist editors with more than 2000 edits? I have recently caught an account with 30,000 edits of copyright violations and another with 3500 edits of copyright violations.
 * Will this be easy? No and that is why I am offering money to someone who wishes to lead the development. If no other topic area other than WPMED wants it, I have no concern with that. Am wanting development specifically for WPMED. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 04:47, 22 July 2014 (UTC)
 * In addition to the issue of mirrors and reverse copyvio there are the CC-BY-SA sources. I've copied and attributed a number of such sources from Plos-One, CNX textbooks, Biomed Central etc. etc. (even though most of the content has been images). This would also need to be whitelisted. -- CFCF  🍌 (email) 14:44, 22 July 2014 (UTC)
 * This is why we are assuming that have the number of edits picked up will be false positives. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 17:21, 22 July 2014 (UTC)

This might have been an interesting Google Summer of Code project, or something the research folks could help on. have you tried contacting anyone at the WMF about channels to interested people? You could try to offer your grant through their grant program, or speak to the tech folks to see if they have any developers interested. You might also post about it on wikitech-l, I think your last post went to wikimedia-l. Nathan  T 15:01, 28 July 2014 (UTC)
 * Have posted to the tech list. Have spoken with the folks at the WMF. Still trying to find someone to take this on. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 21:18, 28 July 2014 (UTC)

Copy and paste checking tool
Doc James (talk · contribs · email) (if I write on your page reply on mine) 21:01, 9 August 2014 (UTC)

Comparison to Earwig
The discussion of current ongoing efforts to deal with copyright does not make mention of the Earwig's Copyvio Detector. Although I do not have much familiarity with that tool, I sorry recent comment which claimed it was the best tool we had. Even if that's a bit of hyperbole, I'd be interested in how that tool fits in with our other tools.-- S Philbrick (Talk)  19:23, 3 July 2015 (UTC)

Updated version
Doc James (talk · contribs · email) 02:59, 18 May 2016 (UTC)
 * on WP and the current workspace
 * in Labs and still under development

Talk at Wikimania
About this work taking place on June 24th at 2 pm. Currently being worked on by the Community Tech Team and User:Eran. Doc James (talk · contribs · email) 02:59, 18 May 2016 (UTC)