Wikipedia talk:Contributor copyright investigations/Darius Dhlomo

Presumptive removal?
Should any prose that smells of copyvio be presumptively removed? I have already found one definite and three possibles in a fairly small sample size and I think that with the potential scale of the problem presumptive removal would speed things up a little bit. Boissière (talk) 21:56, 4 September 2010 (UTC)
 * Yes, they should be presumptively removed. With the massive scale of this one there's really no other way to handle it, particularly since all of the articles currently listed are the ones they actually created. VernoWhitney (talk) 23:25, 4 September 2010 (UTC)
 * Can we look at Darius's edits by size of the edit instead? As I stated in the opening, the shorter articles (below 2.5KB creation size) he's created are practically a green light for original work. From a legal perspective, no one will bother contesting a couple of sentences describing basic, key information on a subject. Also, I would guess that the copyright problems will lie solely in biographies and not the likes of X at Games...etc. Sillyfolkboy (talk) (edits) Join WikiProject Athletics!  00:47, 5 September 2010 (UTC)
 * Yeah, I got it in my head that it would be easier to split out created articles from the other articles they've edited but not created, which is why it ended up like this. I'm running it through my bot right now so tomorrow I should be able to update the pages with created articles sorted by edit size and then other edited articles also sorted by edit size. VernoWhitney (talk) 03:11, 5 September 2010 (UTC)

Need help?
I just saw this report on ANI and thought I'd see if you'd like some help. I've never gotten involved here so I'm unsure as to how this works, procedurally-speaking. Should I claim an article in the list somehow? I'm guessing the x graphics means no copyright issues found. What happens if I do find something plagiarized? How does it get tagged, and is there somewhere else that would be reported? Sorry for so many questions, but I want to make sure I'm going about it properly before I jump right in, so I don't end up creating even more work for someone. &mdash; e. ripley\talk 04:36, 5 September 2010 (UTC)
 * Yes, n means no copyvio found. y Means there's a problem or at least a likely problem. If you find something that looks to be a problem, whether or not you can find a source, you can a) remove the copyvio yourself on the spot or b) replace the page with {{subst:copyvio}} and follow the instructions on the generated page that tell you how to list it on the Copyright problems daily subpage for others to follow up on. VernoWhitney (talk) 12:40, 5 September 2010 (UTC)
 * And what does the red X that some editors have been using indicate?  DGG ( talk ) 00:18, 9 September 2010 (UTC)
 * n generates, so it means no copyvio found. y generates which means there's a problem. VernoWhitney (talk) 00:27, 9 September 2010 (UTC)
 * A red X means there is no problem, but a green check mark means that there is? That is a very confusing convention. Tim Pierce (talk) 15:48, 14 September 2010 (UTC)
 * Sorry. :) For us, it's always been more of a "y" means problem, "n" means no problem. The images may be confusing, but the letters are pretty intuitive. Wish we could use a similar one-letter scheme that's more visually connected. :/ --Moonriddengirl (talk) 16:11, 20 September 2010 (UTC)
 * The convention always made sense to me because unlike most other places where a green check means "good" or the like, here we're actually looking for expected problems and a check means something like "yes, I found something". At least that's how I see it. VernoWhitney (talk) 16:24, 20 September 2010 (UTC)

In Cleanup instructions you note that All contributors with no history of copyright problems are welcome to contribute to clean up. I had in the past an issue related to copyright problems mainly due to misunderstandings, which was finally cleared. Would I be allowed to help here, or not? Rentzepopoulos (talk) 13:01, 20 September 2010 (UTC)
 * Hi. I remember that situation. It's understandable to be confused about our ability to include non-commercial material, but there were some issues with close paraphrasing in your initial rewrites. The request that only contributors with no history of copyright problems help is designed to protect both the project and the contributors, since a misunderstanding of this can make them contributory to infringement if they restore copyrighted material. Does it remain only the one instance? If so, I should think you'd be very welcome to mark those uncomplicated situations where the contributor did not add text, but only uncreative information. These are particularly likely to turn up in those articles to which he contributed, but did not create. --Moonriddengirl (talk) 16:11, 20 September 2010 (UTC)
 * Rentzepopoulos, besides actually reviewing articles to check for copying, there may be other stuff you can help with. There is going to be a lot of maintenance and list-making connected with this operation, I expect.  69.111.195.229 (talk) 06:51, 21 September 2010 (UTC)
 * Thank you both for your answers. I understand the reasoning explained by Moonriddengirl and I respect it. I will stay around and try to help differently, if I can.Rentzepopoulos (talk) 07:52, 21 September 2010 (UTC)

Refining approach
This evening I have been trying to develop an API program which would take the wikitext of a suspect article and try to count up the amount of prose in it. It does this by dividing the article into sections and counting the words in each section. A section is principally either a normal section between two headings or a cell in a table. The program then reports the largest section. This way an article consisting mainly of tables should return a low value. Here is what it produces for Articles 61 through 80 (I chose this because this has a reported but not yet cleaned copyvio in Athletics at the 1980 Summer Olympics – Men's 3000 metre steeplechase).


 * Cycling at the 1972 Summer Olympics – Men's individual road race - Max words in a section = 190
 * National champions Javelin (men) - Max words in a section = 115
 * Athletics at the 1992 Summer Olympics – Men's 800 metres - Max words in a section = 34
 * Estonia national football team 1996 - Max words in a section = 59
 * 1999–2000 in Dutch football - Max words in a section = 102
 * 2009 Vuelta a Colombia - Max words in a section = 589
 * 1987 Race Walking Year Ranking - Max words in a section = 47
 * 2008 Women's Pan-American Volleyball Cup Squads - Max words in a section = 27
 * 2004 UCI Road World Championships – Men's road race - Max words in a section = 40
 * European Sprint Swimming Championships 1994 - Max words in a section = 46
 * National Marathon champions (men) - Max words in a section = 103
 * Athletics at the 1980 Summer Olympics – Men's 3000 metre steeplechase - Max words in a section = 212
 * European Sprint Swimming Championships 1992 - Max words in a section = 49
 * Water polo at the 1988 Summer Olympics - Max words in a section = 112
 * Cycling at the 1992 Summer Olympics – Men's individual road race - Max words in a section = 152
 * Hockey at the 1999 Pan American Games - Max words in a section = 119
 * Squash at the 2007 Pan American Games - Max words in a section = 54
 * Athletics at the 1992 Summer Olympics – Men's 1500 metres - Max words in a section = 41
 * European Sprint Swimming Championships 1993 - Max words in a section = 104
 * Swimming at the 1995 Pan American Games - Max words in a section = 33

The program needs refinement - in 2009 Vuelta a Colombia it is being fooled by the list of teams near the end - I need to work out how to spot that. You can see that the copyvio article mentioned has a word count of 212. Is this an approach worth pursuing further? Boissière (talk) 22:51, 5 September 2010 (UTC)
 * Definitely worth doing as I imagine the large amount of biographies will be the difficult task to tackle. This will narrow them down immensely because so many of Darius's created biographies are just one or two sentences followed by tables. SFB/talk 20:59, 8 September 2010 (UTC)
 * Thanks for the feedback. I have held off from this for a bit due to all the hoo-ha related to this CCI as well as the problems with the program mentioned above. However I have tweaked the program a bit to separately give the size of the lead and the maximum size of the other sections of an article. The results of scanning the first 333 articles are given here. I am spurred on by the probability that the articles are going to be blanked which will cause me a few problems as the program simply reads the latest version. Boissière (talk) 11:36, 10 September 2010 (UTC)
 * Is anything going on with this? I've had the same idea, so if someone else is doing it, that's great.  I was thinking of just spotting anything with over 15 or so consecutive words.  Reading the latest revision isn't so good.  It's preferable to read all the revisions and figure out which words were added by DD's edits. 67.119.14.196 (talk) 06:29, 17 September 2010 (UTC)

Copyright question
I know that results and statistics themselves aren't copyrightable, but is there anything copyrightable in the specific format and wording in which they are presented? I ask in relation to this comment I made on the user's Talk page. -- Boing! said Zebedee (talk) 17:35, 11 September 2010 (UTC)
 * Tables are copyrightable to the extent that they are creative. If the content itself is not uncreative, a table is only copyrightable in the United States if it is creative in presentation or in the selection of facts. This is the reason why in Feist v. Rural a phone book was not found to be copyright infringement, because the information was presented alphabetically and included the same details that others would include. --Moonriddengirl (talk) 18:09, 11 September 2010 (UTC)
 * Oh, that particular source for tables has been discussed at the ANI thread, and we think it's okay. --Moonriddengirl (talk) 18:10, 11 September 2010 (UTC)
 * Ah, that's great, thanks -- Boing! said Zebedee (talk) 18:18, 11 September 2010 (UTC)

Translations
A certain amount of this copywio may have spilled over to other language versions through translations. Will a list of confirmed copyvio be kept here, and are there any ideas about how this particular problem could be checked and handled? --Sir48 (talk) 13:58, 14 September 2010 (UTC)
 * The pages here listing the user's contributions and any action taken to remove them will be kept (and archived when the case is closed). Hut 8.5 14:41, 14 September 2010 (UTC)

Copyvio?
This revision has been marked by a user as copyvio on here. Need explanation on why this is so. 121.120.214.122 (talk) 15:42, 14 September 2010 (UTC)

I think I'm onto something
I've been focusing my efforts Darius's biographies and I've come to believe that DD was using a bot of some sort to create articles. I think this for several reasons. One is that many of them have this "fill in the blank" quality. They almost always include the athlete's gender and the phrase his [or her] native country. The articles' spelling tends to be consistent with that of their sources. If the source uses the British way of spelling things (i.e. metres instead of meters) then so does Darius. If its an American source then he uses the American spelling.--*Kat* (talk) 01:26, 21 September 2010 (UTC)
 * You're not the first to have observed the former. (There's a lot of discussion of this on the main cleanup page (q.v.).)  The point about the spelling is interesting, though.  Does this apply to the 1-paragraph stubs that people are opining cannot be copyright violations?  Please point to an example. Uncle G (talk) 01:38, 21 September 2010 (UTC)
 * Next time I see it, I'll link you to it. Meanwhile check this out.
 * Source: http://www.fifa.com/worldfootball/statisticsandrecords/players/player=194514/index.html
 * DD's Article (original version): http://en.wikipedia.org/w/index.php?title=Hjalmar_Zambrano&oldid=302245402
 * Both the source and the article use the same style of date. A style that I haven't seen him use in his other original articles.  Coincidence?  I think not.--*Kat* (talk) 05:13, 21 September 2010 (UTC)
 * BTW, that article doesn't even make sense. It says the player debuted in 1992 after playing in a FIFA match in 1987.  WTF? --*Kat* (talk) 05:15, 21 September 2010 (UTC)
 * Are you sure that your date preferences aren't misleading you? I'm seeing "23/04/1971" in the source and "1971-04-23" in the wikitext.  Uncle G (talk) 10:38, 21 September 2010 (UTC)

Diffs dont always identify every edit
On Contributor copyright investigations/Darius Dhlomo 24, I went to the first article listed, which was Vasily Rudenkov. It lists one edit, but checking the Revision history of Vasily Rudenkov shows two. Neither was an issue, but I wanted to make sure people know they have to examine the article history to make sure... dm (talk) 02:01, 21 September 2010 (UTC)
 * The listings are designed not to include reversions or minor edits of less than 100b. That may be the case there. :) --Moonriddengirl (talk) 11:13, 21 September 2010 (UTC)
 * That is the case, since to be precise edits aren't shown when they are likely reversions or when they add less than 100 bytes, and this edit actually removes 22 bytes. These edits don't get diffs because they should be largely clean. There is a chance that they contain copyvio (e.g., a large chunk of clean text replaced by shorter copyvio text) but this is pretty unlikely and listing them all would make the job endless instead of just nearly so. VernoWhitney (talk) 11:47, 21 September 2010 (UTC)

Should we continue to update the subpages?
Once the bot has blanked an article, and the article is subsequently sorted out, is it still helpful to update pages such as Contributor copyright investigations/Darius Dhlomo 8? My guess is that once articles have been tagged it's no longer necessary, but I just wanted to be sure in case not doing so creates extra work for someone else down the line. Regards --WFC-- 10:00, 23 September 2010 (UTC)
 * Do carry on updating the subpages, that way editors know which blanked articles still need to be checked. Hut 8.5 10:21, 23 September 2010 (UTC)
 * Yes, please continue to update the subpages. It's entirely likely that some of the blankings will be removed by editors who aren't actually looking for copyvio, and it would be bad if those articles are overlooked. VernoWhitney (talk) 11:03, 23 September 2010 (UTC)

Current statistics
Current statistics. Just counting the number of y and t templates, so exact numbers will be slightly off. Based on this, I would say 20% of the checked articles are copyright violations. Probably in the whole set the number is lower.
 * Articles 1–1000: 317 checked, 60 CopyVio's.
 * Articles 1001–2000: 313 checked, 55 CopyVio's.
 * Articles 2001–3000: 417 checked, 96 CopyVio's.
 * Articles 3001–4000: 122 checked, 26 CopyVio's.

I've read calculations like this before, but I wanted to know the current status, so I calculated it and decided to share it.--EdgeNavidad (Talk · Contribs) 20:52, 13 November 2010 (UTC)

Ongoing?
I just stumbled across this. If the cleanup is ongoing (and it looks like it is), I'd be happy to help out. I don't have any experience in a WP project like this, so I'll need someone to show me where the (metaphorical) mops are kept and how to use one. -Ornithopter (talk) 07:06, 4 August 2012 (UTC)
 * Oh, look at that. There's a "how to help" link right there in the side bar. -Ornithopter (talk) 07:09, 4 August 2012 (UTC)

Roberta Bonanomi article
This article was created by DD. In the info box, it is stated (info entered by original creator DD) that she is 6 feet 7 inches tall, and weighs 200 pounds. This is absurd and wrong. If you don't believe me, Google her name and check under images. You will find photos of her on podiums and she is obviously not 6' 7" tall, etc. How much other mis-information of this type is there in article created/edited by DD? — Preceding unsigned comment added by 71.227.178.154 (talk) 03:06, 20 October 2012 (UTC)