Wikipedia:Duplication detector

The duplication detector is a tool used to compare any two web pages to identify text which has been copied from one to the other. It can compare two Wikipedia pages to one another, two versions of a Wikipedia page to one another, a Wikipedia page (current or old revision) to an external page, or two external pages to one another. Duplication detector locates passages in which the text on the two pages is the same. The number of words to match is variable, but set by default to 2.

Usage
The tool is frequently used in checking copyright issues on Wikipedia but can also be used in other ways, such as to help locate quotes in a biography of living persons taken from a large PDF to check for accuracy.

The tool is used by supplying URLs of two websites to compare (or, if using the advanced version, by uploading either document from your computer). It supports text, HTML, and PDF documents. For other types of documents, check Google's cache for an HTML version by doing a Google search for "cache:URL". To make the tool run faster for very large documents, increase minimum number of words to at least 3. For source documents containing scattered numerals, you may have to check "Remove numbers" to get the best matches. You have the option of removing quotations from matches.

Duplication detector can see article text hidden by templates like copyvio, since the text is still in the HTML page source, but cannot see text that has been removed. You need to use the URL of an old revision in this case.

For evaluating copyright or plagiarism
Duplication detector is best at finding literal duplication and larger strings of numbers are indicative of extensive passages copied verbatim. It can also be used to assist in detecting close paraphrasing. Human judgment is always required. If text matches light up, the passages with identical text can be read and compared to see if the copied passages are uncreative and set in text that is overall sufficiently rewritten. Close paraphrasing offers some guidance in determining when a rewrite is sufficient; along with Plagiarism, it may help identify when content is uncreative. Matched content may be handled in a number of ways. For instance, if the source is public domain or compatibly licensed, it may be usable as is if attribution is handled in accordance with licensing requirements and Wikipedia:Plagiarism. If not, the page may need to be revised or at least flagged for close paraphrasing, if not handled in accordance with WP:CV101.

License
The PHP source for Duplication Detector is available under the Simplified BSD License.