User:GreenLipstickLesbian/Spotting copyright violations

Ever feel like you've been adding too much content to Wikipedia? Has your doctor told you your blood pressure is looking a bit low? Want to cross-paths with a long-standing contributor, but without a trip to WP:AE? Try finding copyvios!

But wait- you're telling me that that seems difficult, and complicated? And how would you even start? Fear not. Here is the unofficial, unprofessional, overly verbose, Green Lipstick Lesbian approved guide. Enjoy your newfound lack of free time.

This essay assumes you've read Text copyright violations 101 and have the common sense to figure out when a source is copying from Wikipedia.

Why are you looking for copyright violations, anyway?
No, but seriously- why?


 * You're a New Page Patroller, an AFC reviewer, or a DYK/GA/FA reviewer: Looking for copyright violations is a mandatory step in all of these processes. Or, at least, it should be.
 * You're seeing if you need to ask for a CCI to be opened on somebody: Typically, you've found a somewhat established user has entered copy-right violating text into an article, and you're trying to figure out if it was an accident, a one-off mistake, or indicative of a long-term problem
 * You're actually working through a CCI: at this stage, a pattern of copyright infringement will have been established, so it's much more about "finding where the text comes from" than anything else.
 * You're suspicious of an article: Sometimes you read an article, and something just feels "off". I tend to spot-check these and run Earwig on them (only checking the links in the page), and see if anything rears its ugly head. If you're lucky, you find a paragraph of blatant infringement added by a helpful IP or <10 edits user from a few years ago, and you remove the text and everybody is happy again. If you're less lucky, you've found a contributor with 10k+ edits copying material out of books.

General purpose tips
No matter which type of copyvio you're looking at, these tips tend to work.


 * Quality of prose that doesn't line up with the contributor's writing: Please don't confuse this with switching registers. Lots of editors, myself included, use an informal register on our own userpages or on talk pages, but switch to a formal register for writing. That being said, somebody producing nearly-incomprehensible text on their talk page while contributing detailed analysis of War and Peace in article space? That's something worth investigating.
 * Please be kind to people you find doing this. While they may, ultimately, need to be blocked from Wikipedia (or at least, the article space), they're genuinely trying to help. These contributors often don't speak English as a first language, and may be focused much more on getting their sentences "right" than their actual "writing." These contributors can often be persuaded into helping out on the more technical side of Wikipedia or Wikidata, at least until they're more confident in their own writing skills.
 * Alternatively, they genuinely don't understand that you can't copy and paste from external sources into Wikipedia. These contributors are often teenagers or children who haven't learnt about plagiarism yet, or they're adults who can't understand. Oftentimes, once blocked, the second group will sock extensively. It's frustrating- but they're just trying to help built the encyclopaedia. Insults and extensive commentary about their lack of competence or mental state are unhelpful. They won't understand why you're being mean to them- and likely never will.
 * Straight up un-encyclopaedic text: Text which violates our policies on original research or text that's overtly promotional can't be kept anyway, but you'd be surprised by how often these stuff is straight-up copied. Lazy PR people don't tend to be creative when it comes filling out their client's Wikipedia page. Who'd've thunk it?
 * Spinner Word Salad: Back in my day, before we had all this ChatGPT nonsense, plagiarists just had to paste their work into a spinner and hope the computer changed enough nouns to evade detection. If somebody shows up talking about Dr. Condo, investigate.

Unattributed copying within Wikipedia
Probably the most common- people don't understand the licensing requirements, and even when called out, a lot of people don't take it that seriously. Solving these is easy- you just follow the steps at RIA, slap a tag on the talk page and place a carefully-worded not on the contributor's talk. Here are the tells I've found which make spotting these easier:

These work for finding unattributed translations from other Wikipedias as well.
 * Maintenance tags older than the article/material: Yep. Was the article created in May, 2024? Does the maintenance tag stating "failed verification" also date to May 2024, or does it date to December, 2007?
 * Source access dates older than the article/material: If the source access date says it was retrieved in June 2015, then why was it only added to the article in April 2019?
 * Undefined refnames: Why is the material cited to  when there's no Smith2013 reference defined? This happens a lot when somebody splits an article incorrectly. Politely let them know that they need to follow the instructions at WP:PROPERSPLIT.

Finding the source of these is a different matter. Here are a few tips:


 * Check the user's contributions around the time they inserted suspect text: If they added 10,000 bytes to a new article called "Cuisine of Mars" five minutes before removing 10,000 bytes from "Martian culture", then you've probably found your source
 * Click on the images and see what else links to them: Also works for transvios from other Wikis
 * Look at old Wikipedia mirrors: Wikipedia mirrors don't update as quickly as Wikipedia. Oftentimes, googling a few sentences will bring you to a mirror of the old article, complete with the name.
 * Search the source names: Especially try rarer URLs, authors, and titles. Even the material has been removed from the source article, the reference might have been left behind.

Webvios- aka, blatant copy and paste jobs
Does Earwig return a large section of red linked to a cite source? Upon clicking on a pdf, do your eyes tell you that the material has been copied? Well then.

Sometimes, editors who have copy-pasted material don't cite a source. If you copy a sentence or two - and a sentence from right in the middle or the end if they've added a paragraph- and search, you can often find the source pretty easily. I always pick sentences from the middle or end, if I'm trying to either clear the material or prove its a copyvio. People tend to modify the first sentences, but leave the later parts of the text unmodified.

If an editor cites a source, but you're still suspicious- most often because the source doesn't support the material they gave you, the technique described above works well. It's very rare, but sometimes somebody will copy-paste from a source, then cite to a different one. Whether it's a good-faith attempt to find a more reliable source, or a bad-faith attempt to hide what they're really doing, it requires further investigation.

Bookvios
Ahh, bookvios. Ultimately, the hardest group to find and clear up. They're generally only picked up by accident by somebody who is reading the book for some other reason. The good news is that contributors adding bookvios tend to be the most meticulous about citing their sources- so once you get your hands on the book, you can confirm or deny an issue with reasonable certainty. There are a few tells, however.

But once our suspicions have been raised, how do we confirm whether or not there's an issue?
 * High-quality text that reads like original research: Typically, contributors who are capable at writing really well know enough about Wikipedia's content policies to know that original research isn't allowed. Authors, however, are paid to write high-quality original research.
 * Weird formatting: Ever seen a book where the line break falls mid-word, so, to save paper, the publisher chops the word in two and slaps a hyphen at the end of the first half? I don't know the official name for this practice. I do know that if somebody produces a line of text saying "The octopus is an example of a mammal. It lives in the mid-dle of the ocean, and is very smart" then the sentence was copy-and-pasted from a pdf copy of a book. Ditto any other special line-break marks- when uploading a document, a lot of pirate and archive sites use a line break in place of the hyphen, or along side it, and people sometimes forget to remove these when copy-pasting into a Wikipedia article.
 * In-text attribution to a source not cited: Sometimes contributors may cite something via another source if they can't access the original for whatever reason. Sometimes they're copying from somebody who cited the original source.


 * Look through the Internet Archive: Does Internet Archive have the book? If you can, check it out and look at the page number the material is cited to. If you can't, search for proper nouns or unusual verbs/adjectives, and see if you find anything similar in the book's previews.
 * Look through Google Books: Google books often locks you out of most of a page, but by searching individual sentences or sentence fragments, you can sometimes see enough to confirm your suspicions. Again, centre your searches on proper nouns, dates, numbers, and unusual adjectives/verbs.
 * Ask somebody with the book: You can find these people on REX, by searching the title of the book with filtered to "site:wikipedia.org" and seeing who else has cited it, or by looking at your local library.
 * Do not search the book title + PDF: Do not. Absolutely, do not do this. Especially don't filter your results to "site:reddit.com". This is totally a bad idea.

From other Wikipedias
The good news is that most tranvios tend to be from other Wikipedias, meaning clean-up is relatively painless and you have your source document linked in the side-bar. The bad news? Other Wikipedias tend to have different standards than enWiki, especially when it comes to notability, formatting, referencing, ect. You're going to have to make some tough calls here. To spot these, see the CWW section above. Most of the rules apply. That being said, here are some foreign Wiki quirks I've noticed. When I see a new enWiki article falling into one of the below patterns, I tend to double-check.


 * deWiki: The German Wikipedia doesn't like citation needed tags, and their solution has been to deprecate in-line references. Okay, not really, but deWiki articles tend to just put a reference list at the end of an article. The community expects that your information will be backed up by those sources. You can read more about deWiki quirks at the essayTranslating German Wikipedia.
 * esWiki: Does the article contain a lot of run-on sentences? Does it make me look like a concise speaker? Is the language a little bit dramatic? The first two are hallmarks of the Spanish language, the last one is a quirk unique to the more forgotten esWiki articles.
 * jaWiki: Modern day celebrity articles tend to end up as bullet-point list of the celebrity's likes, dislikes, and other trivia that enWiki labels as "cruft". This isn't true for non-celebrity articles. That being said, sometimes the jaWiki takes a more "public service" attitude towards businesses, going more in depth about location, openings, exhibits, ect. Often with bullet points. jaWiki likes their bullet points.
 * ruWiki: ruWiki doesn't believe in splitting pages for readability. Very long, well written pages that make your computer die inside? Check if the ruWiki article was translated.

From other websites
The other most common type is somebody copying a website into Google translate, then pasting the garbled result into a Wikipedia article. Earwig won't see these. Copypatrol won't see these. An individual editor will see these while manually double-checking the sources themselves, often as part of a CCI pre-investigation, a DYK check, or a GA review. The editor will then say a few choice swear words, tag the article for a revdel or G12, then have the tag removed by some well-meaning other editor with the rational "the other source isn't even in English?"


 * Nonsense words: Google translate isn't perfect. It has a really bad happen of translating the Spanish espesor as "diameter thickness", for example. If there's a random word or series of word that seems like Google gibberish, that's a bad sign.
 * Some users will write original prose in their native language, then machine translate it to English. That's still problematic, especially since they didn't catch the fact it produced gibberish, but it's not a copyright violation and should not be treated as such.
 * To find the original language, look at the article's subject and make an educated guess. Is it about a Spanish church? Then the original text was most likely in Spanish, Catalan, or Portuguese. Is it about a kpop idol? Maybe start looking at Korean or Japanese language sources, or throwing a few sentences into Google Translate and seeing what websites it returns.
 * Literally translated names and noun phrases: In modern day English, where we tend to just transliterate foreign names. For example, we don't translate Bahía Concepción literally to "Bay of Conception". However, Google translate will. Now, somebody could have just put the noun through Google translate while trying to come up with an English name. But, again, it's not a good sign.

From foreign-language books
The rarest type is somebody reading a foreign language book, translating it to English, and copying or closely paraphrasing the results in Wikipedia. These are almost never found, and if they are it's purely by accident. When the community establishes that a contributor has been doing this, it's very bad. Don't tackle this type of copyvio on your own- bring it up on the Copyright problems or the Contributor copyright investigations board. Try and make sure the pained screams you're now making don't wake up your neighbors.