User talk:EarwigBot/Copyvios/Exclusions

Protected edit request on 30 October 2014

 * 1) Please replace all  with  or  as is appropriate on a case-by-case basis.
 * 2) Please replace all of the   in the internal Wikipedia sites section to   since Wikipedia uses a secure server.
 * Thank you. — &#123;&#123;U&#124;Technical 13&#125;&#125; (e • t • c) 17:02, 30 October 2014 (UTC)
 * Should probably ping here as well.  Thanks again. — &#123;&#123;U&#124;Technical 13&#125;&#125; (e • t • c) 17:04, 30 October 2014 (UTC)


 * Re 1: Done. I just replaced all the tt tags with code tags... not sure what else you would have wanted?
 * Re 2: "Exceptions are protocol-insensitive (so the rule http://en.wikipedia.org will match https://en.wikipedia.org/wiki/Foo)", so this is not necessary and would break the rules since they are not regular expressions without the  prefix. —  Earwig   talk  18:53, 30 October 2014 (UTC)

common blacklist?
Hi! I noticed that you're maintaining a blacklist for EarwigBot and you the copyvio tool on Labs, and there's User:EranBot/Copyright/Blacklist another being maintained for User:EranBot, which User:ערן and User:Doc James have been working on lately. Would it be feasible to work from a common blacklist? I noticed a bunch of mirrors (covered by EranBot's blacklist) coming up when I tried out the copyvio tool.--ragesoss (talk) 00:20, 12 December 2014 (UTC)
 * So does the black list for EarwigBot use the same format and is it also a collection of mirrors of Wikipedia? Doc James  (talk · contribs · email) 01:10, 12 December 2014 (UTC)
 * User:Doc James: The format is a little different (the user page of this Talk page is it), but they are both essentially just lists of regexes, so it might be simple to modify EarwigBot to use the much larger list of mirrors we've compiled so far for EranBot. (I anticipate using the same mirror list for Wiki Ed's plagiarism prevention system.)--Sage (Wiki Ed) (talk) 01:14, 12 December 2014 (UTC)
 * This is a good idea; I wasn't aware there was another list of mirrors (and I should really watch this page more often!). I've created an issue for it which I'll handle soon. — Earwig   talk  18:02, 23 January 2015 (UTC)
 * Done now. The bot uses User:EranBot/Copyright/Blacklist too. — Earwig   talk  00:19, 27 January 2015 (UTC)

Add http://www.reference.com/
E.g. http://www.reference.com/browse/Ganesha -- Redtigerxyz Talk 10:52, 26 December 2014 (UTC)
 * Done. — Earwig   talk  18:02, 23 January 2015 (UTC)

+
Please modify quickiwiki.com/en to quickiwiki.com, as there is support for other language wikipedias. i.e. 1, 2

&mdash; Revi 06:49, 23 January 2015 (UTC)
 * Ping. &mdash; Revi 06:49, 23 January 2015 (UTC)
 * Done. — Earwig   talk  18:02, 23 January 2015 (UTC)

Add http://www.donehealth.com/
-- Redtigerxyz Talk 18:15, 7 February 2015 (UTC)

Add http://www.questpedia.org
Thanks! --AlessioMela (talk) 11:39, 2 March 2015 (UTC)

Edit request
Add "url = http://*.gpo.gov" This is the United States Government Printing Office, which prints offical versions of US documents, it's all freely licensed as US government publications. Thanks!  Kharkiv07 Talk  21:35, 3 April 2015 (UTC)
 * ✅ Rjd0060 (talk) 23:26, 25 April 2015 (UTC)

Edit request
Please add "url = http://*.usgs.gov". As with gpo.gov, above, USGS is primarily works of the US government and public domain.

This article creation was flagged as a potential copyvio. TJRC (talk) 00:40, 26 June 2015 (UTC)
 * &mdash; Martin (MSGJ · talk) 10:14, 2 July 2015 (UTC)

New exclusions
Russian Wikipedia mirrors:
 * rfwiki.org
 * gruzdoff.ru
 * dic.academic.ru/dic.nsf/ruwiki/*
 * www.wikiwand.com

Thanks in advance! --Fastboy (talk) 11:10, 15 August 2015 (UTC)
 * ✅ Nakon  22:51, 15 August 2015 (UTC)

More exclusions
Freely licensed, available under CC 4.0 (https://creativecommons.org/licenses/by/4.0/deed.ru):
 * http://kremlin.ru/
 * http://premier.gov.ru/
 * http://mil.ru/

Clones: --Fastboy (talk) 10:23, 18 August 2015 (UTC)
 * http://википедия.орг.рф/
 * http://ru-wiki.org


 * ✅, thank you! — Earwig   talk  03:12, 19 August 2015 (UTC)

Excluding project Gutenberg
We should probably exclude gutenberg.org as a public domain source as well (as I've seen it show up in false positives). Kaldari (talk) 23:18, 24 August 2015 (UTC)
 * ✅; I added gutenberg.org itself only and no subdomains. — Earwig   talk  00:18, 25 August 2015 (UTC)

rfwiki.org
2 Exclusions. --Pessimist 10:00, 9 September 2015 (UTC)




 * Done. — Earwig   talk  19:26, 13 September 2015 (UTC)

Exclusion
http://wreferat.baza-referat.ru Copy articles Russian Wikipedia.--Arbnos (talk) 18:08, 15 September 2015 (UTC)


 * Done. — Earwig   talk  18:32, 15 September 2015 (UTC)

Add http://enzyklo.de
(Search engine) -- FriedhelmW (talk) 18:40, 1 October 2015 (UTC)
 * ✅ — Earwig   talk  20:25, 1 October 2015 (UTC)

Exclusion
http://ensiklopedia.ru/wiki/* Copy articles Russian Wikipedia.--Arbnos (talk) 00:02, 8 November 2015 (UTC)
 * Done. — Earwig   talk  00:28, 8 November 2015 (UTC)

Add http://encyclo.co.uk/
Thanks! -- FriedhelmW (talk) 19:01, 8 November 2015 (UTC)
 * Done. — Earwig   talk  21:36, 8 November 2015 (UTC)

Add http://research.omicsgroup.org/
This website seems to copy Wikipedia articles in the format hxxp://research.omicsgroup.org/index.php/ARTICLENAME. Example: http://research.omicsgroup.org/index.php/Statue_of_liberty for Statue of Liberty. epic genius (talk) 02:25, 12 November 2015 (UTC)
 * Done. — Earwig   talk  10:00, 12 November 2015 (UTC)

Add http://worldebooklibrary.net/articles/
Seems to be found a lot lately. Collect (talk) 23:24, 22 November 2015 (UTC)
 * Done. — Earwig   talk  10:23, 23 November 2015 (UTC)

Exclusion
http://wikigraff.ru/ Copy articles Russian Wikipedia.--Arbnos (talk) 11:08, 27 November 2015 (UTC)
 * Done. — Earwig   talk  14:53, 27 November 2015 (UTC)

Add http://nosmut.com/
This website seems to copy articles from Wikipedia... I think, although the home page is a bit weird. Anyway, http://nosmut.com/New_York_City_Subway.html, for example, duplicates New York City Subway. epic genius (talk) 04:25, 1 December 2015 (UTC)
 * Done. A strange site indeed. — Earwig   talk  04:29, 1 December 2015 (UTC)

richestcelebrities.org
Appears to use Wikipedia for some material - alas. Compare Death of Antonio Calvo to http://richestcelebrities.org/richest-actors/antonio-calco-net-worth/. Collect (talk) 16:18, 22 December 2015 (UTC)

Also add libreriauniversitaria.it   for similar use of Wikipedia articles/  Thanks. Collect (talk) 16:22, 22 December 2015 (UTC)


 * First is done. can you give an example of the second copying Wikipedia content? I can't find any. —  Earwig   talk  21:52, 22 December 2015 (UTC)


 * http://www.libreriauniversitaria.it/tom-riall-betascript-publishing/book/9786133017016 (appears to use material from Betascript Publishing, a repackager of Wikipedia - perhaps the filter should simply look for "betascript"?)   should suffice. Collect (talk) 22:24, 22 December 2015 (UTC)


 * Done. — Earwig   talk  23:14, 22 December 2015 (UTC)

Excluding some sites
This is probably me not knowing how to use the Copyvio % tool. I am having difficulty with Our Lady of the Good Success; it contains material virtually identical to http://[add www.]fisheaters.com/forums/index.php?topic=3468895.0, but the latter site copies Wikipedia and puts, withpout comment, https://en.wikipedia.org/wiki/Nuestra_Se%C3%B1ora_del_Buen_Suceso_de_Para%C3%B1aque at the bottom. I tried to add the site to Mirrors and forks/Mno (section), but this was not accepted as fisheaters.com is blacklisted. So the question I ask, which is due either to me missing something obvious, a bit of documentation that needs adding, or even an actual limitation of the Copyvio tool: How can specific sites be excluded, based on user reqiremwents?

I have recently started using this tool; a scenario I find is that a site which is copied from Wikipedia obviously comes up as t he prime suspect, obviously shouldn't be considered, and masks true copyedits from other sites. I would like to be able to click a found site (without going through the procedure of adding it to a permanent list) and for it then to be ignored in subsequent searches. Is this possible somehow? Also, it might make sense to exclude from the search (perhaps controlled by ab option) sites which are blacklisted by Wikipedia.

Apologies for taking your time with what is probably user inexperience of the tool, best wishes, and congratulations on a clever and much-needed tool, Pol098 (talk) 12:45, 24 December 2015 (UTC) P.S. I can't save this message with a valid link to fisheaters. com in it! Surely this should be OK on a Talk page?


 * this will happen frequently for pages that have an established history, where they are widely copied around the Internet. The tool works best on new pages. In this case, I don't usually like adding mirrors to this list that only mirror a single page, because then the list would become too large and unmaintainable. You should be able to simply ignore the mirror result and review any other matches the tool finds, using the direct compare option. — Earwig   talk  19:22, 24 December 2015 (UTC)
 * Thanks for response. I take your point about new pages, which isn't what I've been looking at. I did add one site to your list; I got the impression that it had a lot of material from Wikipedia, not just a one-off (if I remember rightly, I was editing several articles); you may wish to remove it, in which case I apologise for the unwanted addition. As comment—no need to respond—a page with 95% match because it's a copy from Wikipedia is a nuisance, and can mask others. The comparison often reports to the effect that no more pages will be compared because there are already a lot of hits. What I would like to see, but may not be of general uses or sensible to implement, is a way to implement a one-off temporary list of sites not to be checked in this particular case, or/and a way to tag listed sites with matches so that they are excluded from a later run. I've found an awful lot of long-standing pages with many edits that are crammed with swathes of copied text. I haven't been seriously using the tool for long, and may be talking nonsense: if so, ignore. Best wishes, Pol098 (talk) 19:48, 24 December 2015 (UTC)

hearplanet.com
http://www.hearplanet.com/article/930747 uses Wikipedia as a source  Collect (talk) 09:39, 31 December 2015 (UTC)
 * Done. — Earwig   talk  09:47, 31 December 2015 (UTC)

Also note a bunch of users on youtube.com quote Wikipedia a lot - and as youtube is not a reliable source in any event - probably should be excluded. Collect (talk) 09:40, 31 December 2015 (UTC)
 * I don't think this is a good blanket exclusion. Being a reliable source is mostly irrelevant (ELPEREN aside); it's more about whether the site has the potential for people to copy from it, and I think the answer there is yes. You're probably right that the reverse is much more common, but I'd rather people need to wade through some false positives than get a false negative. — Earwig   talk  09:47, 31 December 2015 (UTC)

Wondering - could one simply add "wikipedia" to be a marker for not showing a result? Many of these do have "wikipedia" somewhere in their source code . Collect (talk) 09:43, 31 December 2015 (UTC)
 * Right now it automatically excludes pages that link back to Wikipedia—I skipped just a text search to avoid rare false negatives—but that might be worth looking into more... — Earwig   talk  09:47, 31 December 2015 (UTC)
 * Hold on, I noticed hearplanet.com does that already. That's a mistake... will take a look in the morning. — Earwig   talk  09:48, 31 December 2015 (UTC)
 * Long morning. Fixed now; auto-excluding pages that link directly back to the article being searched. Still a bit strict, but should help somewhat. — Earwig   talk  10:26, 15 January 2016 (UTC)

turnitin
Alas - gives all the false positives which this list tries to avoid - can its results be tweaked to avoid long lists of "99% of words copied"? In fact maybe the folks there should be given the suggestion that once "wikipedia" is referenced in the source, that it not be listed as a violation separately from the Wikipedia violation? Collect (talk) 16:47, 22 January 2016 (UTC)
 * I'll look into this; I really want to change the way the turnitin output is shown. Ideally we just use it as a source for URLs like the search engine, but the reason the WMF chose not to do that when submitting their patch is that many turnitin results are behind paywalls. — Earwig   talk  20:14, 22 January 2016 (UTC)

datab.us
http://datab.us/i/Wichita,%20Kansas  is very suspiciously like the Wikipedia article  (like about 100%)  - and gives no attribution. It does use commons images - meaning I have no doubt this is an unattributed copy of Wikipedia. Thanks. Collect (talk) 21:59, 4 February 2016 (UTC)
 * Added. — Earwig   talk  00:19, 5 February 2016 (UTC)

wikitree.com
In fact, can you do a general exclusion of all sites beginning with "wiki" at all? there appear to be a bunch of them, and it might same some time in the exclusion process. Thanks. Collect (talk) 13:06, 14 February 2016 (UTC)
 * I don't know. Would need to do further research on how often sites with "wiki" in their name are not mirroring Wikipedia, because it could reasonably happen, though because the tool reports what sites were excluded now it's less likely to be an issue. — Earwig   talk  20:34, 14 February 2016 (UTC)

my-definitions.com
Please add http://my-definitions.com/fr/definition/ that copy lot of french WP articles (ex: ans analyse ). Thanks you --Framawiki (talk) 14:52, 1 April 2016 (UTC)
 * Done. — Ǝɐɹʍıƃ   ʇɐlʞ  21:15, 1 April 2016 (UTC)

Add http://fr.academic.ru/dic.nsf/frwiki/*
Can you add this website in the exlusion : http://fr.academic.ru/dic.nsf/frwiki/* ? Thanks ! --Bastenbas (talk) 13:39, 2 April 2016 (UTC)
 * Done. — Earwig   talk  15:14, 2 April 2016 (UTC)

Add http://lanimalchat.com/index.html
Hello, this website use the contents of wikipedia fr --Bastenbas (talk) 12:41, 3 April 2016 (UTC)
 * Done. — Earwig   talk  19:46, 3 April 2016 (UTC)

Gutenberg.us
Uses "World Heritage Encyclopedia" which is Wikipedia as a source. Considered a "sham encyclopedia". Collect (talk) 22:54, 21 May 2016 (UTC)
 * Done. — Earwig   talk  04:22, 22 May 2016 (UTC)

livingnewdeal.org/projects/ariel-rios-federal-building-murals-washington-dc/
Uses Wikipedia - but as a URL which might not necessarily be caught in "source Notes". Is the new check for "Wikipedia" on a page going to pick these URLs up? Thanks! Collect (talk) 23:27, 21 May 2016 (UTC)
 * I prefer to leave these cases for manual review, but maybe I'll try something more greedy. — Earwig   talk  04:23, 22 May 2016 (UTC)

nekropole.info/en/Cary-Grant
Presumably uses a lot more - the phrase "Creative Commons" might actually be better than a simple look for "Wikipedia" on such sites, as this one only uses that phrase.

www.rusc.com/old-time-radio/Cary-Grant.aspx?t=256 actually credits Wikipedia. Collect (talk) 12:54, 2 June 2016 (UTC)

allstarpics.famousfix.com/pictures/chloe-madeley
Credits "en.wikipedia.org". Collect (talk) 14:45, 3 June 2016 (UTC)

Add youtube
Should youtube be added? It seems quite unlikely that an article would be copied from a youtube description or comment (but many Youtube videos, e.g. trailers and such, use Wikipedia articles in their descriptions) Intelligent  sium  13:26, 30 June 2016 (UTC)
 * I don't think so, per my comments above in . Such a match generally warrants investigation. Exclusions are best suited for things that are always mirrors. — Earwig   talk  13:27, 30 June 2016 (UTC)

Special:Diff/802116377
Will you be able to resolve this false result in which the two links are copying the Wikipedia article? -- 1989 17:14, 24 September 2017 (UTC)

add wikimapia?
http://wikimapia.org/terms_reference.html — Preceding unsigned comment added by Sergkarman (talk • contribs) 16:46, 19 February 2018 (UTC)

https://kids.kiddle.co/Brooklyn_Navy_Yard
This website copied an earlier version of Brooklyn Navy Yard and actually credits Wikipedia. I'm not sure if there are other articles on the same site that copy from Wikipedia as well. epicgenius (talk) 18:30, 20 October 2018 (UTC)
 * Thanks, looks like a lot of WP-based articles, added. — Earwig   talk  18:39, 20 October 2018 (UTC)

Add https://www.govinfo.gov/
Can this be added to the exclusion list? — Preceding unsigned comment added by Pdxdoglover (talk • contribs) 00:03, 21 January 2019 (UTC)

Pdxdoglover (talk) 21:23, 21 January 2019 (UTC)

vk.com
It's a Russian Facebook, users freely copy information from Wikipedia, which skews the Copyvio rate. For example, when I wrote this article in August 2017 https://tools.wmflabs.org/copyvios/?lang=ru&project=wikipedia&title=Пудовкин,_Денис_Евгеньевич, the rate of confidence was 14,5%, but then it rose to 67% after some vk.com user quoted the article extensively in one of her posts in September 2018. Could you add vk.com to the exclusion list, please? Arbeite19 (talk) 12:07, 9 April 2019 (UTC)

https://www.toursandtravel.app/en/points-of-interests/new-york/brooklyn-bridge-park/85
I think the above page is almost entirely copy-pasted from an early version of Brooklyn Bridge Park without attribution. Compare the article version from 2015 and the above linked website. The only thing the other website did is to remove the "History" section. epicgenius (talk) 01:17, 12 July 2019 (UTC)
 * Yep, it's copying multiple articles. Added, thanks. — Earwig   talk  04:15, 12 July 2019 (UTC)

https://www.cruisebe.com/morningside-park-new-york-city-ny
this looks like it was sloppily copied from a previous version of Morningside Park (Manhattan). It even says at the bottom: Source: https://en.wikipedia.org/wiki/Morningside_Park_(New_York_City) epicgenius (talk) 20:09, 30 July 2019 (UTC)
 * Added, thanks! — Earwig   talk  00:48, 31 July 2019 (UTC)

http://www.nyc-architecture.com/HAR/HAR002.htm and related pages
This website has likely copied old versions of Wikipedia pages without attribution. For instance,
 * Cathedral of St. John the Divine - nyc-architecture, versus our article in 2007, comparison seen here. The nyc-architecture website copied the footnote number but removed any maintenance tags. There was a 98% match.
 * Grand Central Terminal - nyc-architecture versus our article in 2007, comparison seen here. The nyc-architecture website still has the reference numbers and "citation needed" tag. There was a 98% match (again).
 * St. Patrick's Cathedral (Manhattan) - nyc-architecture versus our article in 2007, comparison seen here. The nyc-architecture website still has the "citation needed" tag. There was a 98% match (again).
 * More examples of this sort can be found by Google search: https://www.google.com/search?q=%5Bcitation+needed+site%3Anyc-architecture.com&oq=%5Bcitation+needed+site%3Anyc-architecture.com&aqs=chrome..69i57.5797j0j1&sourceid=chrome&ie=UTF-8

It's very likely that the other website copied from Wikipedia, since very few other websites have a need to use the citation needed tag, and since there is such similarity between each of the pages from 2007. Granted, this website still has original content, but I am more concerned about the false positives from Wikipedia. epicgenius (talk) 05:40, 1 December 2019 (UTC)
 * Added; thanks for your investigation! — Earwig   talk  06:06, 1 December 2019 (UTC)

http://worddisk.com/wiki/
This appears to be a mirror of Wikipedia without attribution: it even has our main page at http://worddisk.com/wiki/search Caeciliusinhorto-public (talk) 14:49, 16 January 2020 (UTC)
 * ✅ Darylgolden(talk) Ping when replying 04:21, 8 April 2020 (UTC)

Onlineradiobox
This site copied content from Udaya Geetham, and should therefore be excluded from EarwigBot. -- Kailash29792 (talk)  17:21, 9 February 2020 (UTC)
 * ❌That site is copying from copyrighted source, so neither should be excluded. Crow</b><i style="color: black">Caw</i> 15:14, 30 April 2020 (UTC)

British and Irish Legal Information Institute (BAILII)
Earwig is flagging articles that quote from Irish Supreme Court case decisions hosted on BAILII. BAILII (here) and the Irish Courts Service (here) allow for direct quotation. British decisions can also be quoted. Could BAILII be removed from Earwig? AugusteBlanqui (talk) 13:42, 30 April 2020 (UTC)
 * Is there a good subdomain of the site for just the court decisions? In addition to allowing quotes of the court decisions, it also states The copyright in the text of legislation and judgments displayed on BAILII's website may belong to courts, other government bodies, judges, and/or to commercial publishers. BAILII cannot authorize any copying of such material. So if we whitelist the whole BAILII site we may miss catching some of those other cases. <b style="color: black;">Crow</b><i style="color: black">Caw</i> 14:10, 30 April 2020 (UTC)
 * This subdomain is safe to whitelist: https://www.bailii.org/ie/cases/IESC/  and this one:  https://www.bailii.org/ie/cases/IEHC/   Thanks!  AugusteBlanqui (talk) 14:21, 30 April 2020 (UTC)
 * Added. I note that their re-use policy just says "re-use" which is a little ambiguous. So to avoid any issues, please always quote the text (rather than incorporating it directly) and cite the web sites. But this should stop the Earwig matches. To other CopyPatrol users, this will not stop ErinBot from flagging these, which is probably a good thing. <b style="color: black;">Crow</b><i style="color: black">Caw</i> 15:11, 30 April 2020 (UTC)
 * Thanks . We will cite/quote from BAILII.  A question, if you don't mind, on how subdomains work for Earwig.  So https://www.bailii.org/ie/cases/IESC/  is the landing page for Irish decisions.  All the Irish decisions have web addresses that start after the IESC, for example https://www.bailii.org/ie/cases/IESC/2007/S28.html .  Is that what it means to whitelist a subdomain?  The pages 'below' that IESC address are included (technology not my strongest domain).  AugusteBlanqui (talk) 16:43, 30 April 2020 (UTC)
 * Yes that entry should whitelist everything after the trailing / in the url. If it doesn't, let me know. Thanks! <b style="color: black;">Crow</b><i style="color: black">Caw</i> 16:47, 30 April 2020 (UTC)

Historic American Engineering Record articles hosted on nycsubway.org
The following pages on nycsubway.org copy from the Historic American Engineering Record, a public domain source, and may bring up false positives.


 * https://www.nycsubway.org/wiki/The_Interborough_Subway_(Historic_American_Engineering_Record)
 * https://www.nycsubway.org/wiki/The_New_York_Rapid_Transit_Decision_of_1900_(Katz)
 * https://www.nycsubway.org/wiki/The_Impact_of_the_IRT_on_New_York_City_(Hood)
 * https://www.nycsubway.org/wiki/Design_and_Construction_of_the_IRT:_Civil_Engineering_(Scott)
 * https://www.nycsubway.org/wiki/Design_and_Construction_of_the_IRT:_Electrical_Engineering_(Kimmelman)
 * https://www.nycsubway.org/wiki/Architectural_Designs_for_New_York%27s_First_Subway_(Framberger)
 * https://www.nycsubway.org/wiki/Interborough_Rapid_Transit-Historic_American_Engineering_Record_Images

May I request that only these specific pages be added to the exclusion list? Epicgenius (talk) 18:24, 30 December 2020 (UTC)


 * ✅ — The Earwig   talk  19:15, 30 December 2020 (UTC)

Please add https://handwiki.org
It's a mirror site. Sudonet (talk) 09:13, 7 January 2021 (UTC)


 * ✅ —  The Earwig   talk 05:02, 8 January 2021 (UTC)

google-info.org
I thought I'd added google-info.org with this edit 12 April, but amp.en.google-info.org is still sullying the copyvio results (odd that the mighty corporate hand of Google hasn't yet come down to smite them). Should I have done something differently? BlackcurrantTea (talk) 16:19, 18 April 2021 (UTC)
 * I fixed this last week, but didn't notice you had brought it up here. Bug on my end. —&#8239; The Earwig (talk) 02:24, 14 May 2021 (UTC)

Exclusion
http://wikiorg.ru/wiki/*, because it's a clone of Ryussian Wiki. 78.37.129.71 (talk) 19:43, 13 May 2021 (UTC)
 * ✅. —&#8239; The Earwig (talk) 02:24, 14 May 2021 (UTC)

please add https://wordsimilarity.com
Could someone please add https://wordsimilarity.com/? It appears to be using Wikipedia directly, in, for example, https://wordsimilarity.com/en/avolition, messing with the Copyvio detector. Thanks!
 * EDIT: I also found https://eng.ichacha.net/zaoju/, which seems to be sourcing text from Wikipedia for at least some of its pages, as well as https://en.glosbe.com/, which seems to often source from something called "WikiMatrix" (I haven't really looked into it).

Yitz (talk) 18:33, 14 May 2021 (UTC)
 * ✅, added the three. —&#8239; The Earwig alt (talk) 18:48, 14 May 2021 (UTC)

Add spellchecker.net
It's not a mirror, but it seems to scrape random chunks of text from articles. Cheers, Estheim (talk) 22:47, 19 September 2021 (UTC)


 * Added, thanks. —&#8239; The Earwig (talk) 01:55, 20 September 2021 (UTC)

Mirrors and forks
Please add all of the WP mirrors listed under Mirrors and forks. Thank you Jamplevia (talk) 23:52, 2 November 2021 (UTC)


 * @Jamplevia, this is already done: read the first line of the exclusions page. If there's a particular mirror you're still getting results for, it may be getting parsed incorrectly by the tool; if so, please indicate which. —&#8239; The Earwig (talk) 02:52, 3 November 2021 (UTC)

nina.az
Hi,

I've got a question about "url = http://wikipedia.*.nina.az/". The tool still includes URLs starting with wikipedia.de.nina.az (see for example Copyvios Gioia) but the regular expression is supposed to ensure that these URLs are excluded. I didn't add the regex, but can anybody figure out why it doesn't work? I've tried matching the regex and the URL with re.match in Python and there it worked. Thanks in advance. --CaroFraTyskland (talk) 09:30, 7 November 2021 (UTC)

Please add https://hmong.ru/
It's another mirror site. Thanks, SamWilson989 (talk) 23:12, 29 May 2022 (UTC)

And https://wiki2.net too 92.242.69.182 (talk) 19:01, 29 June 2022 (UTC)

Additions
If I spot sites that seem to be plagiarising Wikipedia without attribution should I just add them to the list and forget about them or should they be reported somewhere else for the Wikimedia Foundation to lean on?

For context, I was cleaning up Nick Weir and I found: Both of these seem to be (badly) processed from old versions of our articles, possibly by an AI. Which category should those go in? They are not exactly mirrors. DanielRigal (talk) 15:21, 20 August 2022 (UTC)
 * https://networthroll.com/blog2/nick-weir-net-worth-2/
 * https://celebsbirthdaytoday.com/nick-weir/

Please add http://ikonysrebrnegoekranu.blogspot.com/
This blog (http://ikonysrebrnegoekranu.blogspot.com/2017/01/) contains a nearly plagiarized version of the article from Polish-language Wikipedia (https://pl.wikipedia.org/wiki/Popi%C3%B3%C5%82_i_diament_(film) ), which was expanded back in 2012/2013; meanwhile, Copyvio returns a false 96% plagiarism score on the Wikipedia side. Ironupiwada (talk) 12:10, 5 September 2022 (UTC)

Please add https://frwiki.wiki/
This is another mirror site of Wikipedia. Ironupiwada (talk) 12:21, 5 September 2022 (UTC)

Please add https://timenote.info/
Another fork of Wikipedia, it even mentions Wikipedia as a source, which leads to false positive Copyvio results (like here). Ironupiwada (talk) 12:58, 5 September 2022 (UTC)

Please add https://wiki.edu.vn/wiki25/
Fork, it mentions wikipedia as a source. Friniate talk 15:30, 12 October 2022 (UTC)

Please add latitude.to
Please add latitude.to to the exclusion list. It's not exactly a mirror, more of a Wikipedia link farm, but it still gave me a false positive. It's already on the link spam blacklist, as I found when I tried to add a link to this comment as an example :) Apocheir (talk) 03:13, 8 April 2023 (UTC)

Scrapes of wiki pages
It seems that many of the sites in the blacklist are there because they are scraping the wiki. As these appear and disappear frequently, I suspect this leads to a lot of maintenance workload. I'm wondering if there is not another solution to the problem that is based on back-testing?

The example that lead me here is this one for toroidal solenoid:


 * Earwig's Copyvio Detector indicates a similarity to an article on Zeta (fusion reactor) in Hellenicaworld. Can we be certain that the Toroidal solenoid article predates the Hellenicaworld article? Nolabob (talk) 12:10, 28 July 2023 (UTC)Reply

That Zeta article is a copy of the one I wrote here on the Wiki some years ago. My new article does indeed have bits in common with Zeta, and that is entirely deliberate, both are early UK fusion systems. The copyvio between the two pages here on the wiki is of course suppressed, but not the one with this 3rd party scrape.

It would seem that this could be avoided by testing to see if the external hit is a scrape. In this case, it would match to some very high degree, and thus be "likely a wikipedia scrape". This would require two matches on each possible hit, and I'm not sure what that would do to the performance, but I think it might avoid a lot of false positives? Maury Markowitz (talk) 17:58, 28 July 2023 (UTC)


 * Good idea, perhaps at least having a leaderboard with the sorted list of domains by match would be a good start to discover easily new mirrors. Thanks, Framawiki (please notify me when you reply) 19:31, 28 July 2023 (UTC)
 * Oh, yes, that might be a great intermediate solution. Maury Markowitz (talk) 21:00, 28 July 2023 (UTC)

Add https://www.populartimelines.com/
Says on the second search bar that it uses wikipedia as a source. I have also found this article https://medium.com/@populartimelines/timelines-of-famous-people-events-companies-and-more-726de9cb8950, but I'm unsure how reliable this website is. 2001:8F8:1123:D698:493A:CC2:EDDC:5AED (talk) 08:28, 5 November 2023 (UTC)


 * ✅. LittlePuppers (talk) 16:36, 5 November 2023 (UTC)

Add vintageisthenewold.com
Noticed [https://copyvios.toolforge.org/?lang=en&project=wikipedia&title=Only_Up! here] during a DYK nom. Seems to copy information verbatim from sites including Wikipedia. IceWelder &#91; &#9993; &#93; 09:45, 14 November 2023 (UTC)


 * some of their content is definitely copied from WP, but there's also a lot that is not; I'm a bit hesitant to put it on the list because I can't see a good way to isolate just the copied-from-Wikipedia pages. LittlePuppers (talk) 18:38, 14 November 2023 (UTC)
 * There is no original writing, so if it does contain something from a third-party sure that happens to be infringed on, surely the tool would also find the original source? IceWelder  &#91; &#9993; &#93; 00:59, 15 November 2023 (UTC)