User:Sphilbrick/Guide to copyright investigations

Intro
General discussion of topic (I had originally thought we needed a general discussion, but now that I see other relevant pages,such as Copyright problems/Advice for clerks, Copyright problems/Advice for admins and Mirrors and forks. I'm leaning toward making this more of a pure case study list, which can be linked from those pages.)

Still should have an intro; this a placeholder for the intro

Overview
At first glance, this suspected copyvio looks clear cut. An excerpt from the Wikipedia article is shown on the left, and the closest counterpart from the Official web page is shown on the right. Bold is used to indicate identical content.

However, there are some red flags, indicators that are often not present in typical copyvio situations. None of these are proof one way or the other, but they suggest caution, and indicate that more investigation is warranted.


 * 1) The first edit to the Wikipedia article is in 2006. Many copyvios are detected within minutes of creation. When an article is only a few minutes or even hours old, it is highly unlikely that another internet site with identical wording has been created using the Wikipedia article as a source.
 * 2) The editor creating the article has over 1000 edits. Many copyvios are created by people who simply are new to Wikipedia, and either don't know the Wikipedia rules about copyright, or don't know the general rules about copyright, or both. While experience editors can make mistakes, it is unusual to find an editor with this many edits involved in a copyright violation (Caution, exception in next point.)
 * 3) The editor has no examples of copyright violations noted on the editor's talk page. Some editors  don't understand the copyright rules, (or don't care) and can amass a fair number of edits before being blocked indefinitely. These editors may have an edit count into the hundreds, but they usually will have a number of examples of prior problems on their talk page.
 * 4) The copyright date on the Official Web Page is 2007. One does not have to add a copyright notice to assert copyright, but may sites add a copyright date to help ensure that there is no confusion. It is not uncommon to include a copyright year associated with the first year the material is created, then add a new year as each year passes. The existence of a entry of 2007 does not prove that copyright was first asserted in 2007, but it is worth checking, and interesting that 2007 is later than the date of the first Wikipedia entry.

Calling on the Wayback Machine
We enter the url of the current Official web site biography into the Internet Archive tool.

Result: http://wayback.archive.org/web/*/http://www.singingcookes.com/biography.html

The graphic at the top of the page indicates the dates for which there is an archive of that page. Clicking on the year brings up a calendar indicating the days in each year for which there is an archive. We first notice that, despite the 2007 date on the bio, there are archives back to 2001. We select on of the early ones, and confirm that there was a bio of the group in 2001, but interestingly, the wording is quite different than the current version, and hence, different than the Wikipedia version.

We would like to find a date close to, but prior to the Wikipedia version. Unfortunately, there seems to be a gap, with no entries in 2005 or 2006. The latest entry prior to 2006 is on 23 April 2004. However, that entry brings up a black screen. Working backwards in time until we find a valid entry, we get to 25 May, 2003. That entry is different than the original, but does not closely resemble the 2007 version. We can be sure that the words were materially edited after 2004, but we do not have a clear indication whether the web site version or the Wikipedia version was first.

A gap of that length is not usual for the Archive. One possibility is that a subpage had a different name in the intervening period. We use the Wayback Machine again, but this time, we start with the home page of the website: http://www.singingcookes.com/index.php. That gives us different results, but nothing prior to 2007. However, the extension of "php" is a hint. Many websites have converted over, perhaps they weren't using that format in the earlier years. Trying http://www.singingcookes.com/index.com doesn't help, but the Archive helpfully suggests looking for all pages under http://www.singingcookes.com/. Searching for that page in June of 2006 shows that the webpage was http://www.singingcookes.com/home.htm at the time, so let's change the search for the bio from an extension of html to htm. That still fails, but if we tinker around, we might end up trying http://www.singingcookes.com/bio.htm. That search yields paydirt; there are eight captures of this page in 2005 and 2006, the very period in question. We look at the page on 12 June 2006, and we see wording, virtually identical to that in the Wikipedia page, but preceding it by a month.

Our original call was right, but it did take some detective work to confirm that the Wikipedia page followed the web site. We want to be ultra-cautious when it comes to copyright to ensure that we are not violating copyright, but we also do not want to mistakenly accuse a Wikipedia editors of violating the rules if that hasn't happened. There are many examples where websites copy Wikipedia material and call it their own. This is not one of those cases, and we can remove the material under copyright, and decide whether the remaining article can be saved.