User:Tony1/Plagiarism and close paraphrasing: tips for reviewers

This is a work in progress, a sandbox as it were, which is intended to be the basis of a mainspace resource for reviewers at all content forums. Editors are encouraged to contribute to this draft, right now. The task will be to add the community's collective wisdom to these lists, then to take the best tips, rationalise their text, remove signatures, and knit it into a cohesive whole. We need to sort out the technical advantages and disadvantages of the tools, so that a clear, short guide can be presented.

Lead
[Briefly set out the phenomena, the ethical, legal, and practical imperatives, the evolution of how WP has dealt with these issues.]

The fundamental policy at issue is wp:plagiarism, plus the more recently expanded page on close paraphrasing.

The forums in which audits for plagiarism and close paraphrasing are becoming increasingly useful include Featured article candidates, Featured list candidates, Good articles, Did you know (main page), On this day (main page), In the news (main page), and the multitude of WikiProject assessment processes.


 * A 2009 Signpost article about plagiarism
 * a brief piece about spotchecking
 * A useful clarification by MoonRiddenGirl in a past discussion: "Something may be both a copyright problem and plagiarism. It may be only a copyright problem, if the content is fully attributed but still violates our copyright policies (as with overly extensive quotations). It may be only plagiarism if the content that isn't attributed is public domain." Moonriddengirl (talk) 12:20, 4 November 2010 (UTC)

Strategies for reviewing

 * One strategy from User:Demiurge:
 * 1) Ask if the editor still has access the book in question.
 * 2) Pick a particular short but flowery sentence from the nominated article, and ask them to quote the sentences from the book that it's based on.
 * 3) Job done. If they're just a plagiariser that can't write flowery prose themselves so are merely copy pasting it, they're certainly not going to be able to fabricate a different section of flowery prose to explain where their text came from. And in my experience this technique quickly produces the required evidence that the suspicious-looking sentence is in fact a legitimate paraphrase of a substantially larger section of the offline source in question; and thus I can move on, reassured.


 * Another strategy from User:Orlady: "Could you please check paragraph 4 and make sure it isn't a copy or close paraphrase of the cited source (which I cannot see)? When I see such thoroughly polished prose, I want some reassurance that the words are those of a good Wikipedia contributor, without borrowings from a source." I know from recent experience (User talk:Gamaliel) how easy it can be to overlook copyvio issues that should be obvious. Methinks that many contributors would appreciate an alert about those types of issues.

Tools
From NortyNort et al. at DYK talk and elsewhere

For websites
A great how-to guide can also be found here.--NortyNort (Holla) 04:14, 25 July 2011 (UTC)
 * Earwig Copyright Violation Detector
 * The Earwig Copyright Violation Detector
 * Just be careful of mirrors.
 * Internet Archive's Wayback Machine
 * Internet Archive's Wayback Machine
 * Very helpful in determining the age of a webpage. With this you can determine whether another site mirrored one of our articles.
 * Works best with websites based in the US.


 * Duplication detector
 * Duplication Detector
 * Can compare two different URLs and/or files.
 * For the best accuracy, use the current article revision, not the default URL. I.e http://en.wikipedia.org/w/index.php?title=Channel_(geography)&oldid=440965051 not http://en.wikipedia.org/wiki/Channel_(geography).
 * Always read the non-bolded words around the matched text in a comparison. Although a few words may only match, close-paraphrasing may be apparent.

The Duplication Dictator is lots of fun. Hmmm, let's see... random Geography GAs... nope, nope... nope ..., here is one (Ein Avdat from here). Close enough to check same editor's others articles... yup: Al-Muallaq Mosque almost verbatim from here. And let's see... and, Rochdale Town Hall from here and here, not as bad but definitely in the "close paraphrase" territory. This too I think. So about 30 mins of searching yields 3 potential copyvio/close paraphrase GAs + 1 old DYK (and I notice lots of these GAs have a buttload of deadlinks)Volunteer Marek (talk) 05:43, 25 July 2011 (UTC)
 * Before we cast too many aspersions at the clearly-failed GA process, I've personally been the 'victim' of the reverse effect: a website copies the prose from Wikipedia verbatim without credit or reference. The team history on the Oklahoma Thunder official website seems darned familiar, for example, because I wrote it here and they copied it there. – Dravecky (talk) 20:26, 25 July 2011 (UTC)
 * That's an excellent point, which I've also encounted, hence Volunteer Marek's suggestion to use the Wayback Machine to check for that exact issue. cmadler (talk) 12:17, 26 July 2011 (UTC)
 * One thing that should be done, if using these checking tools, is to not to state anything purely on the results obtained, but to take the time to open the sources and read them together with the article, and to make sure you can justify any conclusions you come to. Blithely stating that something is a potential copyvio/close paraphrase is a bit of a cop-out. It either is or isn't, or you are not sure. Saying it might be, based purely on an automated check is not that helpful if not followed up. And if you conclude that it is problematic, you need to be able to justify that based on a reading of the sources and the article, not just a regurgitation of what an automatic checker has picked up. This is, though, time-consuming for longer articles. If an article is fairly long, it should be acceptable to say that spotchecks have been done (and there is a page around somewhere with tips on how to carry out spotchecks). Carcharoth (talk) 04:29, 28 July 2011 (UTC)

We have a great bot that searches new articles for copyvio. Why can't it, or a similar bot, be programmed to search articles that have suddenly undergone expansion as well? This is a greater issue than DYK; doing so would benefit the entire project since, obviously, copyvio or plagiarism isn't just introduced when an article is created. Daniel Case (talk) 04:10, 27 July 2011 (UTC)
 * CorenSearchBot
 * Good tool I forgot to mention. You can manually check articles as well. The bot has been down the past few days though.--NortyNort (Holla) 13:56, 27 July 2011 (UTC)
 * Well, that helps, but it is severely limited: you get false positives from sites that derive material from Wikipedia (particularly for expand noms, since those tend to have been around longer), and false negatives, for example where the source used is offline. Nikkimaria (talk) 04:15, 28 July 2011 (UTC)
 * No one said it was perfect ... those have always been issues with plagiarism. Perhaps the bot's programming could be altered to exclude known mirrors, or put them in a separate list. And there's really nothing we can do about plagiarism of offline sources, but not all plagiarists are that smart. Daniel Case (talk) 14:10, 28 July 2011 (UTC)
 * Sadly, this bot recently stopped working because the free Yahoo api it used got switched to a fee-for-service model. And, considering that I have never even once gotten a positive result from EarwigBot, I wonder if it might have a similar problem. Duplication Detector is very good, though.   Sharktopus  talk  01:09, 31 July 2011 (UTC)

For books
ol' search or side-to-side comparison is needed.
 * Google books search

If you are inventive, you can use snippet view on google books to great advantage both for sourcing and source checking.--Wehwalt (talk) 14:17, 29 July 2011 (UTC) A word of caution. Snippet views, which don't always bring up the relevant snippets of page scans, are more helpful with CopyPaste, as in here. Not so much with CloseParaphrase, as in here, unless you're willing to do more work. Fowler&amp;fowler «Talk»  15:05, 29 July 2011 (UTC)
 * Snippet view on google books

How to spot-check
Dumped here from Wikipedia_talk:Did_you_know.
 * I've noted that some of my reviews are included in the ones TK looked at. I was wondering, would google searching random sentences (without quotation marks) count as spotchecking, or would one have to manually check each reference? I went the google route, 3 sentences per article for those I reviewed from scratch. I AGFed the ones that had already been reviewed. Crisco 1492 (talk) 13:06, 2 August 2011 (UTC)
 * You should manually check books and journals. Other sources can be found easier in Google searches. Dependent on the article size, I usually check about three or so 'blocks' of text from three different references. If an editor copied/close-paraphrased text from one, it is likely they did it for others. You can also compare references to the article in the Duplication Detector which can also give you an idea if there is close-paraphrasing.--NortyNort (Holla) 13:26, 2 August 2011 (UTC)
 * The way to spotcheck is to find a random sentence in the article and look at the language used. Then pull the source that cites the sentence and via a find command check to see if the same language exists in the source. In the two you checked I found them in the first sentences I looked at. I only checked the second one because you'd ticked off the Billy Hathorne one, and all of his I've looked at contain close paraphrasing. To answer your question: googling doesn't work. The sources are in the articles and we need to look at the sources. Truthkeeper88 (talk) 13:29, 2 August 2011 (UTC)
 * (ec) IMO, those methods don't work  and are too prone to misses; here is  an old thread from my talk page discussing ways to spotcheck.  I think it essential to pull up at minimum the most-oft-cited online reference and read it-- when there is plagiarism/copyvio/close paraphrasing, that almost always pulls  it up (a frequent plagiarism source on DYK  is online obits). You could  also go  to the first few edits on  the article, where you will often find that plagiarists chunk  in the text directly from a source, and then on the next series of edits just move a few words around (we  can't do that-- if it's in the article history, it's still copyvio).  I found one of those yesterday from a recent reviewer that signed off using the new template, so I still hope we have some accountability at the prep or queue level in terms of who is putting the info on the mainpage, and that they are doublechecking the review, and not assuming that some *esteemed* reviewers caught the plagiarism.  Also, as DYK gains a means of developing "institututional memory" on frequent nominators and reviewers, spotchecking will become easier (I also target nominators and reviewers who are known to have committed plagiarism or copyvio or missed it on prior reviews for deeper looks, and don't look at all on nominators whose work I know very well).   As pointed out many times by MoonRiddenGirl in the October/November threads, DYK is in an optimal position to educate new and old editors alike about plagiarism/copyvio/close paraphrasing and to catch it before those editors create hundreds of copyvios.  I still hope that DYK will develop a notification template that can go to the creator, the reviewer, the admin who put it on the mainpage, DYK talk and article talk, linking at minimum to the offending article, WP:COPYVIO, WP:PARAPHRASE, and the Plagiarism Dispatch.  The point of asking "who did this" (which we can now get from Rjanag's new template (with the exception of the admin who put it on the mainpage, which is still missing) is to educate so it will stop happening.  Sandy Georgia  (Talk) 13:38, 2 August 2011 (UTC)
 * In addition to the most-often cited online reference, I recommend also looking at any online reference that's given as the sole source for a large chunk of text. That's how I've found close paraphrasing in several of Billy's articles. cmadler (talk) 13:56, 2 August 2011 (UTC)
 * Agree, and that can be particularly helpful on the very first, or first few, edits to a new article. Look at the first few edits, and if they are one source only, read the source and compare it to the text.  Copyvios should be scrubbed from history; the copyvio people know how that is done, which is why you should tag those articles with Template:Copyvio and let those who work in that area deal with it from there.   It is a mistake to try to clean up a copyvio without scrubbing the history, as TK can attest to from her experience at attempting to do that on a serial plagiarizer.  Sandy Georgia  (Talk) 14:01, 2 August 2011 (UTC)
 * It's also worth noting that when information is plagiarised then by definition its source won't be listed. Yomangani talk 14:40, 2 August 2011 (UTC)
 * That's completely untrue. Malleus Fatuorum 14:45, 2 August 2011 (UTC)
 * I suspect TAO (The Adorable One) of tongue-in-cheek. I think he's saying that my method will miss plagiarism/copyvio when the creator didn't indicate the source, whereas a google or some other check may turn them up.  Sandy Georgia  (Talk) 14:47, 2 August 2011 (UTC)
 * Only slightly tongue-in-cheek. Plagiarism is presenting someone else's ideas without crediting them sufficiently. If you cite them as a source you can make the argument that you have given them sufficient credit; you aren't passing the ideas off as your own. Copyvio and close paraphrasing might be caught by these spotchecks, and straight lifts can be caught by Googling most of the time, but if there is plagiarism and close paraphrasing or plagiarism of the ideas without the text, you'll only really be able to detect it if you know the subject or do some research. If you suspect plagiarism you can, of course, ask for a reference: if the editor can't provide one then we are looking at OR (which we hate more than Marmite). Yomangani talk 15:06, 2 August 2011 (UTC)
 * We are in violent agreement; have a vegamite sandwich on me (don't I still owe you half a pretzel?) Sandy Georgia  (Talk) 15:14, 2 August 2011 (UTC)
 * I thought it was a whole one; I hope you aren't defaulting on part of your payment: that would be a scandal. On the spotcheck matter: if I see an uncited section in a heavily cited article I become suspicious. Yomangani talk 15:28, 2 August 2011 (UTC)
 * Your co-competitor stole the other half right out of your mouth when he snatched victory from the jaws of defeat with a gracious response. I'm defaulting for other reasons, though; NYC is out.  Uncited sections-- yes, helpful to google suspicious phrases.  Sandy Georgia  (Talk) 15:37, 2 August 2011 (UTC)
 * (ec) Thanks for the quick replies. A couple replies:
 * @NortyNort: Yes, I knew about the tool, but it isn't useful for Google books. I figured the G-search would be useful because it also searches the books. Thanks.
 * @TK: As I noted on the edit summary, I wasn't sure (because of the history). That's also the reason I didn't add a tick. Thanks for letting me know the better way to check it.
 * @Sandy: Indeed, education is paramount right now. That history method sounds interesting. (Side note, would a copyvio revision be revision deleted to if the article had already been expanded enough, with enough creative input, to make it non-copyvio? [i.e. multiple sources, better paraphrasing and amalgamation of information])
 * Thanks everyone. Crisco 1492 (talk) 14:04, 2 August 2011 (UTC)
 * Crisco, I can't really answer that-- MoonRiddenGirl is the expert on how Wikipedia deals with them, and my understanding may be incomplete. But my understanding on why it's important to notify CCI is that they need to scrub it ASAP, before further improvements hide the copyvio.  I say that because they indicated many times that on Grace Sherwood, because it had been through FAC and underwent many improvements, going back and scrubbing the original copyvio from history was much harder.  But if productive discussion is now underway here and these matters are being taken seriously now, you may want to ping in MRG and ask her input, since she is far more knowledgeable than I am on how Wikipedia handles copyvio.  Sandy Georgia  (Talk) 14:30, 2 August 2011 (UTC)
 * Indeed she is. I've gone to her multiple times when I come across headscratchers. I think your answer is good enough, and if I see anything like that I will let MRG know ASAP. Crisco 1492 (talk) 14:39, 2 August 2011 (UTC)
 * An RD1 is best suitable for an article in which there is a likelyhood that the copyvio may be restored or it has a short single-author page history. If a copyvio was inserted in an article 1000 edits back, rev del'ing would be unnecessary given the extensive history and other editor's or the same editor's non-infringing contributions in the period. In addition, if there is the possibility that other copyvios exist in the article, an RD1 would make it difficult to investigate that in the future.--NortyNort (Holla) 21:55, 2 August 2011 (UTC)
 * CopyVioSpeak: I have no idea what any of that means (which is why I leave it to the folks who work there).  Sandy Georgia  (Talk) 02:00, 3 August 2011 (UTC)