User:MLauba/Signpost definitions

Copyright and plagiarism: Intellectual Property issues on Wikipedia
On 31 October the featured article on the main page had to be replaced because it was found to contain content in violation of our copyright policies. This sparked a series of discussions and led to the resignation and retirement of an arbcom member involved. (see The Signpost on 1 November 2010 for more details).

One thing that became clear from the event and following discussion is that many editors do not know the distinction between plagiarism and copyright violations on Wikipedia. Even worse, some editors are unaware of the policies altogether. As another recent case has shown, these editors can go on for years introducing plagiarism or copyright violations into Wikipedia completely undetected.

This dispatch aims to provide a clear overview of what plagiarism and copyright violations are and why the distinction between the two is important.

What is Copyright?
Copyright is literally, the right to copy. If something is copyrighted you are not allowed to copy it or modify it (such a modification is technically called a derivative works of the material) unless the copyright holder gives you permission. There is a long list of types of media subject to copyright, but for Wikipedia the most important ones are text, drawings/photographs, sound and video recordings. In most countries material is automatically copyrighted, unless explicitly stated otherwise. A requirement is that the work has enough "creative content", which is explained in more detail in the following secton.

Facts and Figures vs Intellectual property
Most readers will be familiar with the expression "facts and figures cannot be copyrighted". While absolutely correct, this statement is also often misunderstood. A number, a place of birth, a date, generic turns of phrases are indeed free for reuse at will. However, once these are put down in prose, there will rapidly be lots of possible variations not only on the wording but also the choice of relevant events or their sequence that can be used: put in shape by any author, the creative expression of facts and figures rapidly becomes unique and is very much subject the protection of intellectual property laws and ethics.

Of course, there are certain turns of phrases that restrict creativity, and when that is clearly the case ("there's only so many way to state that a person was born in a place at a specific time"), such a protection does not apply. Typically, Wikipedia has strict style rules for many things, for instance how the first sentence of any article should read, and these limit our choices in the matter. But beyond that, what facts we present, how we present them and how we write them up is required to be original, unique and our own, or clearly attributed if they are not.

And what about ideas?
Ideas, like facts and figures, cannot be copyrighted either - that's why so many novels, dramas and stories have plots with similar structures. Ideas can be patented, in certain cases trademarked, but patents and trademarks are other aspects of intellectual property laws that don't intersect with Wikipedia too often.

Does this sound counter-intuitive? Read the idea for a story outlined below, and try to imagine what movie or book it applies to:

''A young anonymous person suddenly gets dragged into rapidly unfolding events. Under the guidance of a wise mentor, he joins others in a quest that will eventually change him and the world around him forever''.

No doubt, you have at least one story in mind, but this idea is the backbone to things as different as the myth of Theseus, The Lord of the Rings, The Matrix, The Patriot, Avatar, both Star Wars trilogies, Wanted, The Devil Wears Prada, Gremlins, and much much more. All built on the same idea with only small variations

Copyright infringement
Copyright infringement is a matter of law: the reproduction of content generated by a third party without their permission, regardless of whether it is attributed or not. As Wikipedia's servers are located in the US, we are subject to US copyright laws.

Copyright case law is in constant flux, and the frontier between acceptable proportions of quoted material and infringement is decided by courts on a case-by-case basis. Recently, eight sentences out of a 30 sentences document were deemed fair use, but, conversely, courts have also found three sentences quoted out of a 300 page work to constitute infringement.

And copyright infringement isn't merely limited to verbatim copying - following the structure of a source too closely may also be deemed infringement. We can't solve the problem with a thesaurus or by adding or subtracting a few words.

There are many myths surrounding the use of third party material that lead our editors astray time and time again. Here are some of the more common ones:


 * 1) If there's no copyright notice, it's free / public domain
 * Since 1987, copyright notices have not been required. As a consequence, we must treat everything as copyrighted material unless it is explicitly marked as free or public domain.
 * 1) If it's on the web, it's free to reuse
 * Same as above. Unless it's explicitly marked as free or public domain, it is protected by copyright.
 * 1) We're giving publicity to the material; surely the author must find that desirable
 * Unless the author has authorized reuse, we can't make that assumption. We would need to write them to verify permission.
 * 1) Wikipedia is non-commercial; we don't do any harm
 * We are re-licensing content for reuse, including commercial usage, without permission.
 * 1) I work for the copyright owner, I can use the material
 * Unless you have explicit permission / instructions from your employer, you may not be in a position to decide, and should follow WP:PERMISSION to establish that you are indeed authorized to post the material here.

All of these boil down to one same thing: we cannot use or give away content for which we have no permission.

Copyright violation (Wikipedia)
Wikipedia's copyright policy is a particularly narrow interpretation of current copyright law and practices. What we determine to be a copyright violation in Wikipedia (or copyvio) can often be seen as something trivial that no court of law would determine to be at issue - today. So why have such a restrictive policy? There are three reasons for that:
 * First, to future-proof the encyclopedia. Over the past decades, copyright law, in particular in the US, has become increasingly strict in favour of copyright holders and against re-users. By holding a hardline policy, we hope to avoid a situation where a few years from now, a new law passed by Congress would suddenly force us to take down a large portion of Wikipedia in order to remain compliant.
 * Too, as mentioned above, the "line" of copyright changes from case to case. We might dismiss a situation as so trivial that no court of law would determine it to be an issue only to find that we are wrong, and that a court of law has judged it to be infringement. By the time we find that out, the content could already have been spread far and wide through our reusers.
 * Finally, our requirements are particularly strict because our very mission is to create a free body of reference. Content that is not ours, even in only sparse quantities, goes against our mission.

Often, editors confronted with challenges over copyrighted material quoted, even properly attributed, in their contributions, will state that they cannot be experts in copyright laws. And they are correct. Judging what constitutes a real copyright infringement is a matter for courts to decide. Fortunately, in order to contribute to Wikipedia, nobody needs that kind of expertise. As long as we adhere to our copyright policy, we remain well within what courts currently consider as acceptable use.

Our copyright policy is a long body of text with lots of mumbo-jumbo, but it boils down to this: don't copy / paste content from elsewhere unless you can demonstrate that you have explicit permission to do so (see WP:PERMISSION for how to demonstrate it). If you are the owner of content, you can donate material you had previously published elsewhere (see WP:DCM on how to proceed). If neither applies, don't use the copy / paste function in your browser, but describe the facts and figures you collected in your own words. You can use quotations if you need to, but keep them brief! (See the next section for more.)

What about Fair Use?
Fair use is a limited exception granted by law that allows limited re-use of copyrighted material without having to obtain permission. It covers mostly use of copyrighted material for the purpose of commentary or criticism of the original work, research and eduction, reporting and archiving. In a similar fashion to our copyright policy, Wikipedia defines and restricts fair use in the Non-Free Content policy, which is again more restrictive than current legal practice, and for the same reasons.

Under WP:NFC, brief verbatim excerpts of text are permitted, provided they are properly attributed and clearly identified as quotes. But the policy mostly concerns itself with non-text media where clearly defined rules are in place.

There are no hard and fast rules about how much text can be safely accepted on Wikipedia. Suffice to say, as little as possible. For instance, it is probably acceptable to quote two or three sentences from the conclusion of a movie review in an article's reception section. Conversely, it is not acceptable to build an article that is constituted of nothing but quotes from other sources, even if all of them are properly marked as such (it makes for a lousy article anyway). And similarly, if a text is only a few paragraphs long, quoting the whole of it will probably be marked as a copyvio and removed (as it's no longer a "brief excerpt").

Editors sometimes note that WP:NFC is more lax towards images than text. There is again a good reason for that. Images are extremely easy to remove or replace, when compared to text, in particular if the text has been paraphrased a little. Replacing a picture used under Fair Use, or more accurately, under our non-free content criteria, by a free image is a trivial operation. Replacing non-free text often requires time and editorial skills. And in some ways, policy is more liberal. Non-free images are not allowed outside of article space, but quotes may be used anywhere in Wikipedia so long as their use meets the guideline.

What is Close Paraphrasing?
Close paraphrasing is what happens when an attempt at summarizing a source produces text that remains highly similar to the original: changing a few words here and there, while retaining the overall structure of the original will result in a close paraphrase, a text that only differs from the source in a superficial manner.

A frequent cause of close paraphrase is the practice of copying some third party text into an article and then attempting to paraphrase from there, replacing some words with synonyms, shortening some sentences, shuffling some others around a bit. The volunteers reviewing entries on the copyright problems board will encounter such cases quite frequently. It is particularly telling that very often, editors who have generated a close paraphrase in that manner will be absolutely convinced that they did a good paraphrase, when reviewers will recognize the source clear as day.

The risk of close paraphrases when starting off a copy / pasted text is high, which is one of the reasons why admins will routinely delete any copy / pasted text even when the editor states that "he only copied here for a little bit and plans to rewrite it". (Another reason is that such a rewrite would form an unauthorized derivative work, which is also a copyright issue, but a primary reason is that Wikipedia can't host copyrighted content without permission for any length of time. Every time we hit save, we are certifying that our writing is free for reuse under our licenses.)

In order to minimize the risk of close paraphrasing or avoid it altogether, editors are advised to work with multiple sources, read them, put them away, do something different for a while, and then come back and summarize the sources in their own words.

Of course, this isn't always fool-proof - recently a reviewer noticed an instance where a close paraphrase was built out of three independent sources. Upon closer inspection, it turned out that each of these were obituaries or short biographical sketches written by the same author, who, while writing three distinct pieces, nevertheless highlighted the same key moments of the subject's life.

Plagiarism
In the most simple terms, plagiarism is to pass off someone else's work as one's own. It is an ethical issue: Wikipedia's own licenses for content, the GFDL and CC-BY-SA both define liberal conditions for re-using the material we publish, but they all rest on one fundamental condition: to give credit where credit is due. Any content that is reproduced from elsewhere without attribution constitutes plagiarism.

But plagiarism is quite simple to remedy: properly attributing the author of the content solves it.

As a free reference work that prohibits original research, Wikipedia stands on the shoulder of giants: the thinkers, scholars, academics, writers and reporters whose work we leverage to build our articles. Not only do we have an obligation to attribute those to whom we owe our information for ethical concerns, but also because doing so satisfies our verifiability requirements and demonstrates that the ideas expressed aren't our own original thought.

Plagiarism is a black or white issue: either third party material is properly attributed or it is plagiarized. There are no shades of grey.

Wikipedia standards vs. Academia and other environments
How plagiarism is perceived in academia is quite different than on Wikipedia. In academia, plagiarism is a cardinal sin, and being accused of it a serious charge. If such is verified, the perpetrator's academic reputation is often severely damaged as a consequence. On Wikipedia, while the charge gets leveled often (and frequently confused with copyvio), it is easier to correct and need not result in a ruined reputation.

In academia, beyond the direct morality issue, when plagiarism occurs, the outrage is not just about the moral aspect of passing off someone else's work as one's own, but also because academia is focused on original research. What is expected of students and scholars is a demonstration of their own work: their research, their deduction, analysis, sometimes leaps of faith, their discoveries. Plagiarism is an evidence of failure; the plagiarist is deemed too incompetent for their field of expertise and therefore incapable or unworthy to add to the body of human knowledge.

Wikipedia is fundamentally different in the sense that contrary to academia, we absolutely prohibit original research. Being able to verify where information in an article comes from is crucial in order to establish credibility of our content, and as such, we compile the expertise of others, those who have been recognized in their field.

Plagiarism, that is, the verbatim copy or close paraphrase of other material without attribution (provided it is not also a copyvio at the same time), is fixed through attribution. Only the ethical aspect is really considered on Wikipedia - claiming the words, the wisdom and the wits of someone else as our own work is clearly not acceptable. But the added stigma found in academia does not exist. If you find content that is plagiarized, attribute it, unless the source is copyrighted. If that's the case, remove it or tag it.

Copyvios, plagiarism and WP:SYNTH / WP:NOR
When alerted to issues, many editors feel like the obligations on avoiding plagiarism or copyright violations and the prohibition on original research and synthesis place them between a rock and a hard place. The apparent tension between those requirements is however only superficial. We prohibit original research, not original wording. And the synthesis prohibited by WP:SYNTH is to use two sources and to generate new, original conclusions from them, not to summarize what both say in different terms. The English Wikipedia is probably fortunate in the sense that the English language has a particularly rich vocabulary. Making a choice on what key facts to highlight out of several sources is the mark of a good encyclopedist, and making use of the richness of the English language while maintaining the original meaning is the mark of a solid writer. Attributing ideas, sourcing every quote, and finding the right balance between paraphrase and citation is the mark of a great editor.

As stated above, facts and figures cannot be copyrighted, but the selection of what facts to present and how we talk about them are unique, and it is that selection and that unique presentation that define, in the end, our purpose as an encyclopedia, a useful and free reference work, rather than a ragtag collection of material lifted from others. And while we slowly work our ways to identifying and fixing all the copyright problems and instances of plagiarism still dormant in our articles, we can nonetheless all strive to produce new content that fits that goal. And in case of doubt... there's always WT:CP to ask for a review.