Talk:Content similarity detection

Wiki Education Foundation-supported course assignment
This article is or was the subject of a Wiki Education Foundation-supported course assignment. Further details are available on the course page. Student editor(s): Khall4, Bappelman3, Bls15. Peer reviewers: Kwilliams14, ChloeGui, Adamlawson13, Mkizer99.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 18:23, 16 January 2022 (UTC)

Suggested improvements
There is a large body of academic research in this area.

This page needs a major rewrite to include references.

The structure is fine as an outline.

The following improvements should be made:

First, this section is related to plagiarism detection in natural language, e.g. English. It is not related to detection in other areas, e.g. computer source code, sheet music, diagrams.

Search engines - these are ineffective as they cannot find text in a private database, e.g. a protected forum or an electronic archive of research articles. Detection software - there is terminology in the research literature to describe different categorisations of software, which should be used here. Detection algorithms - there are many proposed algorithms and comparative reviews of them exist. There is no reason why one algorithm should be singled out for inclusion.

Source code plagiarism detection - different methods are used for detecting plagiarism in computer source code - and this is again a substantial research area that needs to be referenced. Source code detection engines were developed before those for natural language and informed the development of natural language engines.

There are detection methods, other than just looking for exact text, which can be used for plagiarism detection. For instance, the Glatt detection method, or software that looks for changes in writing style within a document.

There is also a popular research area on plagiarism prevention. This involves designing out opportunities for students to plagiarise by using new/improved methods of assessment (e.g. one, perhaps draconian, example is to replace all courseworks with examination. — Preceding unsigned comment added by 87.194.16.60 (talk) 16:12, 13 May 2007 (edit)

List of free and commercial alternatives is a total spam magnet, it must go!
The subject line of my post speaks for itself. This needs to go! I think I might be bold and just remove it myself now. If you restore it, please respond on talk page. --Jaysweet 17:17, 13 July 2007 (UTC)
 * I came to the plagiarism detection article because I was looking for the commercial systems (I could not remember the name of one I was looking for--it turned out to be "Turnitin".) I could not believe that an encyclopedia article would not mention applications designed to do the task the article is about. So I went back through the History and found that the article did indeed have this information, but that it had been deleted, apparently because listing them was somehow thought to be advertising. But this would be like the Wikipedia article on Word processor not mentioning Word or Open Office for the same reason (that article, of course, does mention these and other applications). Moreover, three of the plagiarism detection applications in the list already have their own Wikipedia articles! I've tried to restore the information about plagiarism detection applications in a way that does not appear to be advertising.Robert P. O&#39;Shea (talk) 21:06, 9 October 2008 (UTC)

I agree, this whole section should be moved to external links (but either all of them or none). --jknabe (talk) 10:43, 29 October 2008 (UTC)


 * I disagree. The existence of the alternatives is supported with individual primary sources. There are other reliable independent sources which supports tuples or groups of these alternatives which lack articles. I will endeavor to add those, as well. Readers will come to this article for both research and practical information. To refuse to allow omit information which is sourced better a little better than the rest of the article goes counter to Wikipedia's purpose. And note that the bluelinked entries in the lists have already established their own separate notability. The local refs are only to make these items sticky. --Lexein (talk) 13:57, 20 August 2011 (UTC)


 * Looking forward to sourced "better". So far, it's not TEDickey (talk) 14:00, 20 August 2011 (UTC)
 * Working... done, pass2. Let's continue at the bottom. --Lexein (talk) 17:18, 20 August 2011 (UTC)

Pending edits
Better introductions to free-text and software plagiarism can be written, adding references to literature on relevant algorithms. Current lists of software can be cleaned, perhaps to turn them into tables to make them more amenable to comparison. Khondrion (talk) 20:02, 18 September 2008 (UTC)
 * Unless the links are internal (i.e. lead to other Wikipedia articles, such as Copyscape) or can be sourced to news coverage in a publication meeting WP:RS, they will probably be removed as non-notable. These articles attract a great deal of spam links. Flowanda | Talk 21:50, 25 September 2008 (UTC)
 * I created this page: Comparison_of_anti-plagiarism_software following the same model of Comparison_of_reference_management_software. Please help me edit and improve it! - Pepato (talk) 22:29, 21 February 2015 (UTC)

Notoriety of existing systems
Internal links to systems that are obscure outside academia should be avoided; Only JPlag and MOSS stand any chance there. 'SIM' or 'AC' are short names that are already very ambiguous and have a small impact outside academia, are more confusing than useful. An alternative is to use things like 'SID (source code plagiarism)' so that the link becomes more specific. This is what i've done with 'Moss (program)'. Khondrion (talk) 11:34, 30 September 2008 (UTC)


 * In that case, the others probably should be moved down into the external-links section (since there's no Wikipedia topic dealing with them). Tedickey (talk) 11:53, 30 September 2008 (UTC)

Related pages
There seems to be no page on general similarity detection, either for text (natural language) or for program similarity. The classification of similarity detection for source code can be expanded into a full article. Additionally, I can't find any page on normalized compression simililarity (NCD), very popular in general similarity detection, specially in bioinformatics. Khondrion (talk) 11:34, 30 September 2008 (UTC)

Sorting lists of systems
I don't know if there is any explicit guideline on how to sort lists of systems, specially commercial systems. Popularity would be a good candidate, but there is no easy way to measure it. Alphabetical would be OK as a default, but less informative. Suggestions?

On the other hand, since all systems in the 'source code plagiarism' section are free (whether their sites are up or down is a different matter), I removed the reference to 'free' from their list-headings.--Khondrion (talk) 23:52, 10 December 2008 (UTC)

repeated links for CopyTracker
There appears to be no reason for repeated links for CopyTracker - the statements which Ofol makes about the page content are not supported by the actual page. If Ofol wants to advertise it, he should provide at least a separate link which gives the corresponding information. Tedickey (talk) 13:23, 20 February 2009 (UTC)

Source code plagiarism questions
As far as I understood, there is only academic source code plagiarism: when students copy-paste code from their peers or internet rather than coding themselves for the homework/project assignment. But copy-pasting significant chunks of code is a common practice of programming and designing new programs unless software patent is infringed (Software patent debate). And also accessing and referencing libraries (compiled code) is not plagiarism. So, the questions: These questions popped up when I realized that free and open source software developers do not really cite where they took the ideas from and from whom, they just implement/copy-paste an algorithm, loop, function, object Maybe somebody in the field could help to clarify these and add references and explanation to this article. Kazkaskazkasako (talk) 12:26, 16 August 2009 (UTC)
 * Is there a plagiarism in the source code in any of the programming languages (which are in the group of constructed languages)?
 * If there is, where is the line for plagiarism in constructed languages as compared to the natural ones?
 * If not, why does plagiarism in the natural languages exist if it does not in the subset of the constructed ones (programming languages)?
 * How does this line of plagiarism of source code interact with the software patents (because patents do not allow even paraphrasing; they patent ideas and their implementation)?


 * No - there is "professional source code plagiarism" (both from commercial and open source developers), e.g., directly copying thousands of lines of code, removing copyright notices. I'm familiar with some examples (which would not be topical), but haven't noticed anyone doing serious research (which would be topical) Tedickey (talk) 13:01, 16 August 2009 (UTC)

Removal of section on "software plagiarism"
(moved from my talk page, since it is topical here) Tedickey (talk) 10:47, 5 October 2009 (UTC)

Hi,

I'm going to ask you to put my edits back. All of the references were from peer-reviewed IEEE journals and law magazines. My work in this field is recognized as significant. I point out serious flaws in the research that is, in fact, promoted on this Wikipeda page as fact when it is actually questioned by legal and computer science professionals. This is a field of study which I have devoted years of my life, and my work has been recognized by others in the field as well as being accepted in U.S. courts.

Wikipedia lists all of my competitors' products and there is no reason not to list mine. I do not say anything promoting the product or using hyperlatives -- I simply say it exists, that it measures source code correlation, and where to find it on the web. All of the competing products include exactly the same information.

Bzeidman (talk) 22:21, 3 October 2009 (UTC)


 * If your paper were topical and widely known, you wouldn't have to add it yourself Tedickey (talk) 22:28, 3 October 2009 (UTC)

From the Wikipedia guidelines for speedy deletion due to unambiguous advertising or promotion: "Note that simply having a company or product as its subject does not qualify an article for this criterion." An Internet search for "source code plagiarism" or related terms will show a significant number of references to my name, my research, my company, my product, and my articles. The fact that someone else has not yet entered this information into Wikipedia is not relevant. Bob Zeidman (talk) 02:32, 4 October 2009 (UTC)


 * There are several guidelines that deal with this - suggest you start with WP:COI Tedickey (talk) 10:48, 4 October 2009 (UTC)

From WP:COI regarding citing oneself, "Editing in an area in which you have professional or academic expertise is not, in itself, a conflict of interest. Using material you yourself have written or published is allowed within reason, but only if it is notable and conforms to the content policies. Excessive self-citation is strongly discouraged." At this point I'm going to request a dispute resolution. Bob Zeidman (talk) 16:19, 4 October 2009 (UTC)


 * More than that: (a) it appears that you're not well-known or authoritative to exploit that, and (b) your edits are self-promotional, rather than informative (disregarding the opinions expressed in the edit because of (a)) Tedickey (talk) 00:33, 5 October 2009 (UTC)


 * Googling, I haven't found any reference to your work which you did not write yourself. Perhaps you can provide a pointer to that. Tedickey (talk) 09:56, 5 October 2009 (UTC)


 * Characterizing a workshop paper as peer-reviewed is stretching the facts. Actually, having read your paper, it comes across as just another infomercial (no useful technical content, just an advertisement). Tedickey (talk) 09:56, 5 October 2009 (UTC)


 * Googling "source code plagiarism zeidman" gives 900 hits. Googling "source code plagiarism" gives 167,000 hits.  Stating in the third-party opinion that "will show more hits for my research" is nonfactual.  Which search engine do you use? Tedickey (talk) 09:59, 5 October 2009 (UTC)

Third opinion: I'm with Tedickey on this one. These edits are almost certainly self-promotional, which is not allowed here. Aside from the tone of the text being inappropriate and the obvious conflict of interest, they don't add to the article. It's mostly original research followed by several spam links. I see no need for it here. &mdash;  Hello Annyong  (say whaaat?!) 17:28, 5 October 2009 (UTC)

OK, I'm confused. First, Googling "source code plagiarism" of course turns up pages with that phrase, not research in the field. Searching for those terms using Google Scholar shows 10 references to me and 154 references altogether. Only a few researchers' names show up more than mine, and many of those references are decades old.

Second, my papers listed below were peer-reviewed for conferences and journals according to strict policies. The IEEE is the largest organization of electrical engineers in the world:


 * http://www1.gi-ev.de/fachbereiche/sicherheit/fg/sidar/imf/imf2009/slides/19-RumpSession1-Zeidman_DUPE_IMF2009.pdf Zeidman, R., "DUPE: The Depository of Universal Plagiarism Examples,” 5th International Conference on IT Security Incident Management & IT Forensics, September 2009.
 * http://www.isca-hq.org/CATA-09-PROGRAM.pdf Baer, N. and Zeidman, B., “Measuring Software Evolution with Changing Lines of Code,” 24th International Conference on Computers and Their Applications (CATA-2009), April 10, 2009.
 * http://www2.computer.org/portal/web/csdl/doi/10.1109/SADFE.2008.9 Zeidman, R., “Multidimensional Correlation of Software Source Code,” The Third International Workshop on Systematic Approaches to Digital Forensic Engineering, May 22, 2008.
 * http://www.iiisci.org/Journal/CV$/sci/pdfs/S259RPB.pdf Zeidman, R., “Iterative Filtering of Retrieved Information to Increase Relevance,” Journal of Systemics, Cybernetics and Informatics, Vol. 5 No. 6, 2007, pp 91-96.
 * http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1652022 Zeidman, B., “Software Source Code Correlation,” 5th IEEE/ACIS International Conference on Computer and Information Science, July 12, 2006.

In addition, I have written numerous articles on the subject for magazines such as Dr. Dobb's Journal, Embedded Sytems Design, Software Test & Performance, and Intellectual Property Today.

Here is a partial list of independent references to my work:


 * http://portal.acm.org/citation.cfm?id=1413815
 * http://proc.isecon.org/2008/3344/ISECON.2008.Abraham.pdf
 * http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/demo_codematch.html
 * http://www.freepatentsonline.com/7493596.html
 * http://pearl.cs.pusan.ac.kr/publication/JiJH2007ICDIM2.pdf
 * http://whoyouknow.co.uk/uni/msci/report.pdf
 * http://www.softpanorama.org/People/Torvalds/Finland_period/was_the_initial_linux_kernel_code_independent_of_minix.shtml
 * http://www.patents.com/Resolving-license-dependencies-aggregations-legally-protectable-content/US7552093/en-US/
 * http://www.patentlens.net/patentlens/patents.html?patnums=US_7552093&returnTo=patentnumber.html%3Fquery%3D%2528US_7552093%2Bin%2Bpublication_number%2529
 * http://etds.ncl.edu.tw/theabs/site/sh/detail_result.jsp?id=095CCIT0394014
 * http://www.ime.usp.br/~dhgoya/forense_paper.pdf
 * http://trochu.kvalitne.cz/diplomka/DP_xhaum07_v4_se_zadanim.pdf
 * http://www.icast.org.tw/project/project-401-nd-soc/project-401/projectmanager.2006-10-31.1747635236/Report_Final/8cc7ankejikua570bhezuo8a08756b-95niandu7d50an5831gao.pdf/download
 * http://www.ddj.com/architect/184406247
 * http://www.inflibnet.ac.in/caliber2009/CaliberPDF/55.pdf
 * http://galab-work.cs.pusan.ac.kr/TRBoard/datafile/041030932galabPRJ04026COPY.pdf
 * http://www.groklaw.net/articlebasic.php?story=20040917083356327

HelloAnnyong misunderstands Wikipedia's policy on original research, which states: "Wikipedia does not publish original research or original thought. This includes unpublished facts, arguments, speculation, and ideas; and any unpublished analysis or synthesis of published material that serves to advance a position. This means that Wikipedia is not the place to publish your own opinions, experiences, arguments, or conclusions." Everything I put into the page was previously published, peer-reviewed, and extensively referenced in other papers and articles.

Given this, I would like your help. How would you suggest that I include my peer-reviewed, well-quoted, well-respected research in this area into Wikipedia in a way that you would be comfortable with? --Bob Zeidman (talk) 17:01, 10 October 2009 (UTC)
 * Yes, you're right. Despite my 16,000+ edits and three years on this project, I misunderstand what original research is. The OR parts of your edits were the parts where you didn't cite any sources, like the second paragraph in the Controversy section that you added. Without any sourcing, I have no idea where that text came from, and it was all unverified text. What you have to understand is that this is not a place to extol yourself by adding paragraphs of text about your research. One person writing about a topic repeatedly does not make it a controversy. And this isn't the place to grind your axe about what you think the subject should be named.
 * Now that that's out of the way, we can discuss what should be included in this article. I do believe that we should have some section about the history of plagiarism detection and how it has evolved. And in that, I suppose we could put in parts about naming and discrepancies in systems. But they would have to be small, neutral and very well sourced. And the sections would have to use sources that are not just your own research. We would also have to be careful of synthesis of sources - basically, taking source A and source B and coming to some conclusion C that is not directly stated in either source.
 * So if you think that can be pulled off, then great. &mdash;  Hello Annyong  (say whaaat?!) 19:34, 10 October 2009 (UTC)


 * That'll be difficult, since his papers essentially say that he's the only one doing research in the field, etc. Not much here to work with. Tedickey (talk) 20:08, 10 October 2009 (UTC)

Embedded list of tools
I plan to add independent reliable sources to support the embedded WP:List of free and commercial plagiarism detection services. Primary sources are reliable about themselves, but are only the start of establishing reliability and WP:Verifiability of claims. The bluelinked items already have their own articles, and their own WP:N notability established. The redlinked items should be allowed, with primary RS and secondary RS. I do see the points about spamminess. For the items listed, if I can't find an independent RS, I will delete them myself. A bit of time is requested for this process. As an aside: I came to this article specifically for plagiarism detection tools - this may be worthy of a poll, right at the bottom of the article (I wonder how that's done, officially). I noticed the unreferenced embedded list, and proceeded to improve it. --Lexein (talk) 14:10, 20 August 2011 (UTC)
 * Thanks for not reverting again, though I welcome discussion.
 * WP:External links. I didn't add external links - the (redlinked) list items had no inline refs to reliable sources. The refs have links, for WP:Verification, since the domain names are not the product names, in many cases.
 * WP:Embedded lists require independent RS too, so they are now added. Some tools are supported by only one RS list; their deletion/retention should be discussed.
 * Discussion point: The inclusion of a large embedded list would be disruptive to this article. This list doesn't seem large, for now. If it grows, perhaps it should move to List of plagiarism detection tools. --Lexein (talk) 17:15, 20 August 2011 (UTC)


 * WP:LIST encourages interested editors to create clear, easily followed inclusion criteria, so this is a proposal for list inclusion criteria, phrased so as not to refer to Wikipedia:
 * Tools shown in this non-exhaustive list are supported in multiple reliable academic and media sources. ( "supported", not "discussed" because some of the sources contain embedded lists.)
 * I would permit RS embedded lists as sources (here), because the article doesn't discuss list items, it just lists items.  Discuss?-Lexein (talk) 23:59, 20 August 2011 (UTC)


 * Went ahead and commented out 3 tools for which no RS exists yet. --Lexein (talk) 14:31, 21 August 2011 (UTC)


 * Half of the references are simply links to the vendor's website with the name of the vendor. Those might be sources for an WP:OR topic, but as it is, all they're achieving is a more visually attractive set of external links TEDickey (talk) 08:37, 8 September 2011 (UTC)


 * I'm not sure what your suggestion for article improvement is, here. If you would like the primary refs removed from the items with articles (bluelinked), state that clearly. For the rest, for items in an embedded list which do not have their own article, a primary source is always allowed, and we don't call it an external link, we call it by its proper name, a convenience link, part of a full citation when possible. AFAIK, WP:Primary sources are considered reliable about themselves (with certain caveats which do not apply here). --Lexein (talk) 09:20, 8 September 2011 (UTC)
 * Not really - those are ad hoc links to a list whose composition was determined solely by Wikipedia editors and the list doesn't correspond to any primary source, reliable or otherwise. TEDickey (talk) 09:31, 8 September 2011 (UTC)


 * "Not really" - what? What are you answering? Anyways, "those are ad hoc links to a list" - no, not ad hoc at all, and not to a list - all the refs contain links to reliable sources, both secondary (independent, describing items) and primary (describing themselves). The list's contents are determined (as is all Wikipedia content) by independent RS: several items appear in multiple RS, and different RS discuss or list several items, so the phrase "entirely by Wikipedia editors" is inaccurate. "The list doesn't correspond to any primary source" - no, several secondary sources describe the list items, as does each item's primary source, itself. Perhaps you're meaning "primary" in another way than I, and the primary sources guideline. If you mean "only one main source encompassing all content in an article or list", there's no requirement anywhere in any Wikipedia policy, guideline, or essay for such a single monolithic source to exist. Oh well, anyways, I'm not seeing suggestions for article section improvement, nor a reply to my statement about removing refs from bluelinked items, so I'm going to third opinion, in the hopes of moving forward, instead of in circles.  --Lexein (talk) 10:18, 8 September 2011 (UTC)

Response to third opinion request:
 * The use of primary sources is to be taken with care, and is always preferred to have access to secondary and tertiary sources, but primaries are perfectly suited for referencing a lot that can be said about a subject. If there was something contentious regarding the inclusion on this list, such as claiming a feature that maybe they don't have, then the primary would prove inadequate, but this case is not so detailed, and the only separation is regarding their commercial status, which is something you can normally believe a primary about — frankie (talk) 01:46, 15 September 2011 (UTC)


 * thanks. However, your response does not focus on the area of disagreement.  I'll recap.  The dispute is based on the other editor's having recharacterized an existing abuse of WP:EL, beautying the list to a small extent, and calling the result (not integrated into any discussion sourced other than as a simple list of external links) as "WP:RS (primary sources)".  There is no third-party source which presents a similar list, so no examples can be taken from that and used to support the list. TEDickey (talk) 10:39, 15 September 2011 (UTC)


 * I reiterate: "there's no requirement anywhere in any Wikipedia policy, guideline, or essay for such a single monolithic source to exist." If you look at the independent RS, they, severally, and in an overlapping way, mention or list the items in this list.  I think you may be unfamiliar with the nature of WP:Embedded lists, and trying to overzealously apply WP:EL. The independent RS support the existence, reliability, and verifiability, of the claims made in the list description, and each item in the list, and are directly relevant to the article's content elsewhere.  You're welcome to use those very same independent RS to support other portions of the article.  The entire previous section "Plagiarism detection systems" in my opinion opens the door to examples of plagiarism detection systems.  My addition of sources is an improvement to the article, since the entire previous two sections are unsourced, as I have now indicated, for your information.  --Lexein (talk) 11:17, 15 September 2011 (UTC)


 * I'm sorry if I'm missing the point, although actually I now see two different things being discussed. Regarding the external links, all I can see with that diff is tag being moved to the position indicated by the documentation, and as far as I can see the links do support the article's topic. Still, the tag wasn't removed, so someone else might still come by it and judge in turn. Regarding the list, it is not necessary to have a third party source presenting these items in the same manner as it is done here. The list serves the purpose of indicating to the reader the pieces of software available, and the one thing to limit the addition of a program is whether it can be verified that it is indeed a system for plagiarism detection.
 * Of course, if you would like additional input I would recommend you to make a post at either No original research/Noticeboard or External links/Noticeboard, depending on how you wish to address it — frankie (talk) 14:14, 15 September 2011 (UTC)


 * The Link farm tag was placed, opposing the embedded list as if it were an External Link farm, which it was not, and is not. There were no "links", just sources.  All table entries are either bluelinks to articles, or are multiply reliably sourced.  That's why I moved it to EL: it was non-applicable where placed.  This editor seems to think all reliable sources equate to External Links.
 * I welcome further discussion at either noticeboard - this embedded list will pass with flying colors at the time of this writing. --Lexein (talk) 14:30, 15 September 2011 (UTC)


 * As you see from the discussion, Lexien categorically rejects any suggestion that it's a link farm. It's been percolating along for the past few years, with some reorganization.  The different from the previous edits is that Lexien wants to elevate the dispute in some fashion, e.g., to get other editors to make explicit concessions to Lexien's point of view. This is a hostile thread, by the way. TEDickey (talk) 23:49, 15 September 2011 (UTC)


 * Comments should be added after prior comments, per the instructions at the top of the Talk page, and convention, so far, on this page. Inserting comments out of date order at the same level of indent distorts the sense of, and order of discussion.
 * Yes, I do categorically reject that it's a link farm, because it's an WP:Embedded list of bluelinked articles, with items without articles being secondarily multiply sourced. But I don't think this is particularly hostile, I've applied policy and guideline-based reasoning at every turn, phrased a number of ways, making no headway with this sole editor. Since TEDickey asserts that the (discussion? dispute?) has been percolating along for the past few years, whose point of view is really being pushed here? Is WP:OWN going on?  Yes, I did indeed, with improving Wikipedia aforethought, elevate the section from unsourced to sourced, and from a long list of actual external links, to a tidier table, containing a multiply and reliably sourced WP:Embedded list of examples of what was being discussed in the preceding section.  Oh, and I deleted all the unsourceable items.  As already stated, take it to either noticeboard.  I absolutely, enthusiastically welcome it. --Lexein (talk) 00:21, 16 September 2011 (UTC)


 * I can't say much more but that I too disagree that this is a link farm, even if it may technically look like one, as the presence of the links serves to verify the inclusion of the products, which all are closely related to the article's topic. The links for those products that have their own articles have been removed now, as verification is done there. Best regards — frankie (talk) 16:32, 16 September 2011 (UTC)


 * Regarding "asserts", the topic history and change comments are where to look, not in an editor's offhand remarks. Characterizing my discussion here as "nonresponsive" is hostile, as are some other remarks by Lexein in this page.  Regarding WP:OWN, Lexein "owns" the current topic, and it's unlikely that having read this discussion, there'll be anyone likely to do much more than watch it.  TEDickey (talk) 00:22, 17 September 2011 (UTC)


 * Not responding, or taking 18 days to respond, or not answering questions, or making only negative remarks (without acknowledging, ever, that your concerns have been addressed), is definitely nonresponsive. "Own"ing the current topic?  Huh? The fact that nobody else has bothered to comment about the embedded list or edit the article section, should tell you something: I'm not engaged in any sort of WP:OWN behavior. In case you didn't know this, the edits I performed are unremarkable - they are the same sort of edits all editors do, on policy: addition of reliable sources, wikilinking, removal of unsourced content, removal of redundant refs, reformatting of tables to take up less space.  No big deal. --Lexein (talk) 01:21, 17 September 2011 (UTC)

Urkund and Ephorus
I've removed those additions from the embedded list of software products as lacking articles, and lacking independent reliable sources. --Lexein (talk) 13:56, 28 June 2012 (UTC)
 * Urkund is a Swedish system, Ephorus a Dutch one. I reported on the effectiveness of these systems in 2010. I have just released a test of collusion detection systems, including systems that find collusion in program source code . I don't want to edit in my own work, but it may serve as a source for others wanting to add material to this article. --WiseWoman (talk) 23:59, 10 November 2012 (UTC)
 * I don't know about their notability, but Urkund is used by the University of Amsterdam (it is university no. 62 in the world top, according to http://www.topuniversities.com/university-rankings/world-university-rankings/2012?page=2 ). This would make Urkund trustworthy, if not notable. I mean it is not just another software tool devised by a geek in his basement, and known only to his friends. Tgeorgescu (talk) 22:11, 12 November 2012 (UTC)
 * Thanks, both WiseWoman and Tgeorgescu. Just a reminder that we do want to locate one or two more independent RS discussing these two software packages, for them to be included in the list. --Lexein (talk) 23:17, 16 October 2013 (UTC)
 * I've just published my 2013 report on plagiarism detection systems at . None of the systems are good or very good, but Urkund, Turnitin, and Copyscape come in at average, Ephorus and a number of other systems were given at least a passing grade. I'd prefer someone else edit the page, though, don't want to be thought to be tooting my own horn. --WiseWoman (talk) 18:18, 18 October 2013 (UTC)
 * Thanks, and, duly noted. I hope you understand there will be some reluctance to immediately cite your work that you've directly advocated here. Independent reviews like yours are indeed part of measuring the relevance of other cited sources. Unfortunately, (a) the 2013 work doesn't appear to be published yet except on your own website (not yet peer reviewed), and (b) your advocacy of it puts us strangely in the same position as overciting Gipp et. al. (below): the appearance of promotion or bias or both. So we slow down a bit; it's always better to wait for research to be peer reviewed for publication in a journal or conference proceedings, and cited by others in the field. Besides, WP is not a newspaper, so timeliness regarding research topics is less important that in-field relevance as determined by others in the field. I think, and correct me if I'm wrong, this is a new and relatively small area of research, only small due to a relatively small number of active researchers. It has periodic breakthroughs which are discussed in other research specialties and their journals, as well as popular media; growing coverage is inevitable, so we will eventually cite further work which is published. And thank you for declaring your conflict of interest, and respecting our guidelines. Mine is not the last word on the subject - TEDickey? --Lexein (talk) 08:34, 19 October 2013 (UTC)
 * (nothing to add here) TEDickey (talk) 13:09, 20 October 2013 (UTC)

IP edits
I've reverted a set of recent IP edits which were unnecessary. The citations used in the list of software are called WP:Named references, and they refer to WP:List-defined references which are listed in full at the bottom of the article. List-defined references can be inconvenient in one circumstance: when performing section edits where temporarily viewing a reflist is desired. However, keeping lengthy citation text out of the article prose is very beneficial for readability while editing Wikitext. --Lexein (talk) 17:41, 13 December 2012 (UTC)

Gipp citations
The repeated citations for Gipp papers has the appearance of (self)promotional editing, particularly noticeable because none of the authors of these papers appear to be notable TEDickey (talk) 09:04, 15 October 2013 (UTC)
 * Worth a look, but let's AGF enroute. If you're really referring to reliability, the plagiarism research field is not huge, but has active contributors which nontrivially cite Gipp et. al., according to Google Scholar. Because the article minimally depends on Gipp, and other research have not attacked Gipp as irrelevant, but have instead referred to and built upon Gipp et.al.'s work, I'd say there's very little actual promo going on, especially since I added one or two of those, and I don't know or have any association with Gipp. I felt Gipp13 was a valid citation to support CitePlag in the list of software. --Lexein (talk) 13:37, 15 October 2013 (UTC)


 * notwithstanding the noticeable amount of cites here (really, Google Scholar gives 14 items, with only the first 2 worth mentioning), there was the most recent edit which added a "first to implement" claim, which lacked an independent source. Not to recommend removing all of those, but pruning to present a balanced view of the available sources seems in order (unless you're going to argue that Gipp is the dominant author in the field) TEDickey (talk) 20:04, 15 October 2013 (UTC)
 * Nope, I'm not advocating Gipp et.al. over any others. We may be seeing a flurry of pubs involving Gipp, but that's not a good reason to overcite here. Worth keeping an eye on, and awaiting further independent sources. --Lexein (talk) 22:22, 16 October 2013 (UTC)

These are from Gipp's dissertation (I was a member of the dissertation committee), the dissertation has a very good overview of the various methods for plagiarism detection. I am not aware of a better one at the moment. I'd also love to add a link to my book in English, "False Feathers: A Perspective on Academic Plagiarism", Springer, 2014, ISBN 3642399606, but that would be self-promotion. I'd be happy if someone would include it, as I have also written on the topic :) --WiseWoman (talk) 14:44, 9 September 2016 (UTC)

Download limit exceeded
The first reference is to http://citeseerx.ist.psu.edu/messages/downloadsexceeded.html — a page that simply reports that the site's download limit has been exceeded. This seems to have nothing to do with the subject. Should it be a link to a different page, and if not, what's its relevance to plagiarism detection? Wocky (talk) 18:27, 30 July 2021 (UTC)

Requested move 11 June 2022

 * The following is a closed discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review after discussing it on the closer's talk page. No further edits should be made to this discussion. 

The result of the move request was: not moved. Natural disambiguation kept. (closed by non-admin page mover) — Ceso femmuin mbolgaig mbung, mellohi! (投稿) 03:43, 25 June 2022 (UTC)

Content similarity detection → Similarity detection – The common name of the article's topic appears to be "similarity detection" instead of "content similarity detection." "Content similarity detection" is rarely mentioned in articles on the subject, while the phrase "similarity detection" is used much more frequently. Jarble (talk) 01:19, 11 June 2022 (UTC) — Relisting. — Ceso femmuin mbolgaig mbung, mellohi! (投稿) 03:25, 18 June 2022 (UTC)

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
 * Oppose – the title is already too general, relative to the article's topic; audio and video content similarity detection are other things it could cover. Making the title even more ambiguous would serve no purpose.  Distinguishing from style similarity detection, even within text similarity detection, is key here, too. Here are plenty of papers that talk about "content similarity detection". Dicklyon (talk) 05:15, 18 June 2022 (UTC)

Sequence alignment
This article doesn't mention sequence alignment. Is sequence alignment also a method of content similarity detection, since it consists of identifying similar substrings in two different sequences or strings? Jarble (talk) 19:46, 18 September 2023 (UTC)