Wikipedia talk:Contributor copyright investigations/20230303

Some questions
Hi @MER-C. Thanks for opening the CCI. New here, so I've got a few questions —Femke 🐦 (talk) 03:26, 4 March 2023 (UTC)
 * 1) Most of @EMsmile's edits are correctly attributed copying within Wikipedia. I've not found any problems there. I was wondering if we could allow EMsmile to mark these herself, so that uninvolved editors can look at the problems areas (close paraphrasing, copying from non-compatible 'open-access' sources + incorrectly labeling them).
 * 2) EMsmile has a tendency to quote significantly to avoid violations. Individually, those quotes are okay (almost always below 50 words), but on some pages the overall amount of quotes from a single source can be significant. Is that fine?
 * 3) The instructions say "The  template may be used for articles where you were unable to determine whether or not a violation occurred, but are prepared to remove the article from consideration – either because the material is no longer present in the article, or it is adequately paraphrased so as to no longer be a violation (please specify which).". I'm a bit confused about 'unable to determine'. Most material that is no longer present is still easy to check. Is the expectation that one does try to determine if a violation takes place?


 * @Femke I'm around right now so I'll try and answer:
 * 1. It's good that the text copied in between is attributed... but this adds a bit of an extra issue, since EMsmile has copied between articles she's contributed to. An example is this edit to Climate change in Africa, where she copies content from effects of climate change on agriculture... which has 22 edits listed at this CCI. So we'll have to check the content that gets copied over if it was written by her... yeah, I know, it's a pain. At least EMsmile properly attributed it-- there's other CCI subjects who would do this and wouldn't ever attribute, creating likely unfixable attribution/copyright nightmares.
 * 2. When it comes to the subject of excessive quotations/ "quote farms" there's no exact policy/agreement on what constitutes "too much". I generally base it off of the context; what is editorially "too much" or otherwise unneeded? What can be said without quoting directly from the source?
 * 3. Generally, if the text is no longer in the article and wasn't moved somewhere, then it's probably not worth the time figuring out if the text was a violation and then when it was removed. Now, 97% of the stuff that ends up at CCI is lower quality content that gets rewritten/removed over time for a variety of reasons, as is unlikely to be restored to the article. In this case, the content is of a better quality and if you think there's a chance it could get restored there maybe merit in checking.
 * I hope this answers your questions. Moneytrees🏝️(Talk) 20:58, 4 March 2023 (UTC)
 * Thanks, that makes sense.
 * I still hope we can involve @EMsmile significantly. We can ask her to go over the high risk articles herself first (all of the articles listed at User:EMsmile), and then leave a note at the CCI page when she has removed or paraphrased copyright violations on a certain article. That way uninvolved editors only need to double check and probably do some additional tweaking to ensure it's fully paraphrased.
 * Me and EMsmile have had frequent editorial disagreements about quoting too much over the years. It may be good if these are evaluated by independent editors.
 * Clear
 * —Femke 🐦 (talk) 11:22, 5 March 2023 (UTC)
 * @Femke, if EMsmile is willing to rewrite/remove issues and offer the rewrites for review and also be available for source that would be welcome. I'll add a note about overquotes to the CCI background. Moneytrees🏝️(Talk) 18:02, 6 March 2023 (UTC)

Using Copyvio detection tool?
Could we get a bot to run the copyvio detection tool over all the articles that I ever touched in order to help us with finding those where there are still copyvio problems now? (although in some cases, they might be new copyvios, not those that I had caused myself) For example, I just ran it for SDG 7 and this is the result: https://copyvios.toolforge.org/?lang=en&project=wikipedia&title=Sustainable+Development+Goal+7&oldid=&action=search&use_engine=1&use_links=1&turnitin=0 For me it looks OK. It says Violation Possible, 67.5% similarity. From which percentage value onwards do the alarm bells go on? Note for this article it's a bit unusual because it lists a lot of indicators which are sentences which are also on all sorts of other websites. It's not copyvio because those indicators are quoted and have to be in the same wording that the UN used. But in general, can the copyvio detection tool help us here with this process? EMsmile (talk) 18:08, 6 March 2023 (UTC)


 * @EMsmile, see User:Moneytrees/CCI_guide. Earwig unfortunately won't be too much help at this CCI-- from my understanding most of the violations come from journals, which Earwig cannot scan. Most comparisons will probably have to be manual. The percentage doesn't really matter, it's more what gets highlighted. Now, if we had the option of running TurnitIn, which can read journals, on all the articles listed here, that might be useful, but I'm not how that would be accomplished. Moneytrees🏝️(Talk) 18:34, 6 March 2023 (UTC)
 * No, not just journal publications. When I first started out, in 2014 and 2015 I used for example publications by GIZ where I knew they were "meant to be" open access (as I used to work there) but they have no compatible licence listed (back then nobody thought about CC BY...). So the pdf files of those publications are on the web, so therefore the copyvio tool should bring up copyright infringement, right? E.g. I just made some improvements at urine diversion now to content that I had added there in 2014 (as a newbie). When I run the copyvio tool for the article now it says 35.5% similarity (https://copyvios.toolforge.org/?lang=en&project=wikipedia&title=Urine+diversion&oldid=&action=search&use_engine=1&use_links=1&turnitin=0). So in any case, having the % similarity available for all of the articles that I ever worked on could therefore perhaps be useful? If I click "use Turnitin" I get "no matching results found". - By the way, I also used a lot of grey literature for my Wikipedia work on sanitation which is all available online. I didn't often use publications that are behind paywalls as I don't have access to those either but would need to ask the author to send them to me (which I did sometimes but not very often). EMsmile (talk) 18:45, 6 March 2023 (UTC)
 * I usually check each website with above 20-25%. For SDG7, there is an entire paragraph copied from a non-free source, and that's 41.7%. —Femke 🐦 (talk) 19:42, 6 March 2023 (UTC)
 * Do you mean the para starting with "Finance for energy access remains far below the investment needed to achieve SDG 7 by 2030:"? If so, it wasn't added by me but by someone else here (so I hadn't checked that one when I did the recent review; I had only checked my own additions.) If that's the paragraph you mean then I suggest it's condensed right down. EMsmile (talk) 20:14, 6 March 2023 (UTC)
 * OK, interesting about the percentage values. I see now here: User:Moneytrees/CCI_guide it says "View the results. Ignore the percentage, go off of highlighted text. At least check everything above ten percent. " Good to know. It also talks about detecting mirrors here: https://en.wikipedia.org/wiki/User:Moneytrees/CCI_guide#Detecting_mirrors . I also found quite often that people copied from us later, rather than us copying from them. EMsmile (talk) 20:22, 6 March 2023 (UTC)

Individual SDG articles

 * Note that earwig hasn't picked up on the copyvio in the Indicator 7.4.1 paragraph from . —Femke 🐦 (talk) 21:00, 6 March 2023 (UTC)
 * I've fixed that now as well. The copyright status of those UN publications is not completely clear (I mean the one called "Progress towards the Sustainable Development Goals Report of the Secretary-General"). I had a very long discussion about copyright of the UN declaration where the SDG targets and indicators were announced. The vague consensus in the end was that it's compatibly licenced although it was very hard to get to the bottom of it. That UN progress report might or might not fall into the same category. Here is the old discussion from 2020: https://en.wikipedia.org/wiki/Talk:Sustainable_Development_Goals#Should_SDG_5_have_its_own_article_and_is_there_copyright_violation_when_listing_the_targets? which links to this: https://commons.wikimedia.org/wiki/Commons:Copyright_rules_by_territory/United_Nations. The doc in question might fall under ? Not sure. EMsmile (talk) 23:22, 6 March 2023 (UTC)
 * I don't see how a progress report can be seen as either treaty or a convention. I don't see any evidence that other UN documents are under public domain. The page you link explicitly says that the UN seeks copyrights for reports. —Femke 🐦 (talk) 19:20, 8 March 2023 (UTC)
 * It says on that page: "Public information material designed primarily to inform the public about United Nations activities (not including material that is offered for sale)" (here). I think some of the individual SDG articles used info from that progress report. I don't know whether the info is already paraphrased or not (would require checking). If not and if you are sure that this is not compatibly licenced (I am not sure, as per that sentence that I just provided) then it would need corrections for the 17 individual SDG articles. EMsmile (talk) 09:43, 9 March 2023 (UTC)
 * In any case, I'll go through the 17 SDG articles again. Have just fixed up some things at SDG 2 which had been copied from here (a publication based on that UN progress report but not compatibly licenced). I think the same is probably the case for the other individual SDG articles too. I worked on those back in mid 2020 and had some collaborators who prepared work in their (or my) sandboxes and I then copied it across. I had told them that they had to paraphrase but I didn't carefully check everything. No excuse, just an explanation how this happened. This is what happened e.g. here. I'll work on that for the remaining 15 individual articles (I think SDG 7 and SDG 2 should be OK now), give me around 2 weeks. EMsmile (talk) 10:40, 9 March 2023 (UTC)
 * I've been working through the individual SDG articles and have now done SDG 1 to 8. At first I only looked at the copy vios that I had introduced when I first set up the article (mainly by taking from that UN progress report). More recently, I have also been removing other people's added content fairly radically: There was a lot that was copied from various UN websites that was in UN speech and likely copy vios. I've been quite ruthless now, culling down those articles a fair bit. Even though I am not checking sentence by sentence anymore, it's quite time consuming. Interestingly, a lot of the content was added by "one off editors" who only made that one edit, often in 2020 or 2021, and did not stick around with Wikipedia editing. EMsmile (talk) 14:26, 13 March 2023 (UTC)

@EMsmile: when you clean articles up (again, thanks!), can you make sure you add the webarchive versions you check? Somebody else will be going over to double check all remaining text by you, and it's a bit of a waste of time to both find these webarchive links. —Femke 🐦 (talk) 17:07, 17 March 2023 (UTC)
 * Hi Femke, it's not really clear to me what you want me to add where and how? Could you please do an example in the SDG 11 article and I can then follow that example for the remaining SDGs articles that I still have to review (numbers 11 to 17)? Overall, most of the individual SDG articles had about 90% "UN speech" (whether copyright violated or with quotes) and only very little from other sources like journal papers. So one has to wonder what the added value of such Wikipedia articles are if they just repeat what's on UN websites... Probably after 2030 they will sink into oblivion anyhow. The only very important article is the main SDG article. Still, the copyright vios have to be fixed up of course and I am quite ruthless now with taking out those UN speech type sentences (just leaving a quoted sentences here and there) and don't think it's worth sinking too much additional time into these individual SDG articles. If someone else wants in future to add info from secondary sources to them, they're free to do that. But I think most papers tend to look at all of the SDGs together, not really individual SDGs (except maybe papers on WASH which will mention SDG 6 in their background/justification). EMsmile (talk) 11:40, 18 March 2023 (UTC)
 * For SDG 9, imagine you had written SDG 9 recognizes that humanity's ability to connect and communicate effectively, move people and things efficiently, and develop new skills, industries and technology, is crucial in overcoming the many interlinked economic, social and environmental challenges in the 21st century, and I wanted to check this is not copyvio, I would have to spend 5 minutes finding the right webarchive version of that article. The original link is dead.
 * You are now going over your additions line by line (I think), so will likely do the same exercise. I'm asking you to simply add the archive url link when you check these pages. —Femke 🐦 (talk) 11:46, 18 March 2023 (UTC)
 * OK, I can do that, going forward. However, I might disappoint you if I say that I am not going through the SDG articles line by line. Instead, I am focusing on checking all the content that I had added. If I see other people's copyright violations along the way, I'll also remove those but I am not checking all their content systematically. For the example in the SDG 9 article "SDG 9 recognizes that humanity's ability to connect and communicate effectively etc." this was not added by me but by User:2030_SDGs so I haven't investigated it. EMsmile (talk) 17:26, 19 March 2023 (UTC)
 * Thanks! Just going line by line over your contributions is fine :) —Femke 🐦 (talk) 17:30, 19 March 2023 (UTC)

Are synonmys the answer?
I have a question about the use of synonyms. Femke said synonmys are useful. I just wonder if they can be regarded as "the answer" or if it can somehow be flawed in itself. Let's say we have a very simply sentence in the original source that says "the costs of wastewater treatment is rising rapidly in developing countries". So if I use a synonym for each major word, am I then doing the right thing? Seems a bit artificial. Or do I have to in addition change the ordering of the words? So like this: I get it that synonyms are good when the original sentence uses difficult complex words for which simpler words exist. But what if the original sentence is already short, clear using simple words then wouldn't my changing it over to synonyms be quite artificial and actually a bit of a waste of time? Often those kinds of sentences are actually well accepted facts and the source that says it is just repeating a well known fact. Which then comes down to the WP:BLUESKY. But I guess that's a different topic again. EMsmile (talk) 10:54, 7 March 2023 (UTC)
 * "The cost of sewage treatment is increasing fast in low income countries."
 * Or is this much better: "In low income countries, the cost of sewage treatment is increasing fast"?


 * You're right that you also need to change the structure of the sentence. And yes that is time-consuming. This is one of the reasons that people usually do not paraphrase sentences but instead summarise text. When writing, you should not have the original text in front of your eyes ideally, but really start from scratch. For some sentences, there are only limited ways to say it, and then it can be okay to have a sentence closely resembling the source. This is very rare, most of your examples are relatively easy to summarise or paraphrase —Femke 🐦 (talk) 18:32, 7 March 2023 (UTC)
 * I sometimes have a certain "key statement" that I think is worth adding to a Wikipedia article, i.e. just one particular sentence. Therefore, it's not easy to "summarise" one sentence, especially if it's already short, like in my example: "the costs of wastewater treatment is rising rapidly in developing countries". Same applies to some key statements from IPCC reports. - But OK, I get the "theory" about this. EMsmile (talk) 19:26, 7 March 2023 (UTC)

When is it a matter of too many quotes?
Another thing is that I tend to use quotes a fair bit and Femke generally says it's too often. So I am trying to reduce how often I use quotes. However, for the article on SDG I think I've pretty much converted as many of the quoted sentences to "own text" as needed. Could you, Moneytrees, take a look and tell me if it's still too many quotes? I've left quoted sentences in particular for those sentences where scholars have voiced specific and well-formulated critiques of the SDGs. I feel that changing those into "my own words" wouldn't do it justice and it's better to stick to their words. Just to re-iterate: I get that too many quotes would make it non encyclopedic to read. But would you say the number of remaining quotes in this SDG article is OK as it is now? EMsmile (talk) 11:02, 7 March 2023 (UTC)
 * For that article a peculiarity was that there Biermann et al. published a good big report on the SDGs (not compatibly licenced) and then later they also published a shorter summary (compatibly licenced). I found that many of the key statements from the big report that I wanted to include (and had to use quotes for) were now also in the shorter summary - from where I could take directly as it was compatibly licenced and written in fairly simple language. EMsmile (talk) 11:02, 7 March 2023 (UTC)
 * The article has 46 quotes I believe (even though many are misrepresentations of the source / misquotes. There is not "means of achieving" or "means of achievement" in the source for instance). There are a few clearly appropriate quotes ("smokescreen of hectic political activity" should not be said in Wikivoice). Others are unnecessary: Sustainable_Development_Goals largely consists of quotes and doesn't need to, and the same with Sustainable_Development_Goals. I think about 10 quotes can clearly be paraphrased, and many would become suitably easy to understand only then. @Moneytrees: is that a copyright issue, or is that mainly an issue of prose quality? —Femke 🐦 (talk) 18:14, 14 March 2023 (UTC)
 * I'll take another look at those examples that you provided (thanks for pointing them out). The "means of achieving implementation" name of some targets is a naming convention that the UN documents use. It is explained well in a compatibly licenced publications which I have cited in this section: "Notably they contain both ‘Outcome’ (circumstances to be attained) and ‘Means of Implementation' (MoI) targets." and "The first 16 SDGs each include number-designated outcome targets (e.g., 6.1, 6.2) and two to four letter-designated MoI targets (e.g., 6a, 6b), while Goal 17 'Strengthen the means of implementation and revitalize the Global Partnership for Sustainable Development' is wholly about how the SDGs will be achieved." So I am not sure what you want me to do about this. Cite this same publication each time I mention the term "outcome target" for those targets of an SDG that are about outcomes? EMsmile (talk) 17:56, 15 March 2023 (UTC)
 * I've now added that ref (Batram et al.) to the "means of implementation" target for the first 8 individual SDG articles (replacing "achievement" with "implementation"). Will continue with the others later. But is this what you meant with "even though many are misrepresentations of the source / misquotes"? I've also deleted the quote in the Americas section (was not all that relevant). And converted some of the quotes in the Sustainable_Development_Goals into own words. Still two quotes remaining there but at least not as many as before (if someone wants to get rid of those as well they are free to do so; this is not a copyright issue but an issue of prose quality). EMsmile (talk) 18:35, 15 March 2023 (UTC)

Expert contributions and copyright
I just asked on Discord how copyright works with expert contributions, and they weren't quite sure. For context: EMsmile works with experts who send her Word documents with suggestions to add to Wikipedia, which she than adds. The text is specifically written for Wikipedia and not previously published. @EMsmile: have you asked Volunteer Response Team how to deal with the permissions the authors give? Do you have to forward the emails? I don't quite know how to repair attribution in cases you've not mentioned the provenance of the text in your edit summaries. —Femke 🐦 (talk) 19:42, 9 April 2023 (UTC)
 * Hi Femke: No, I haven't worked with the Volunteer Response Team before. I am not sure what you want me to do there? I would be happy to forward any such e-mails but there would be dozens per article plus several marked up Word documents per article because there is often quite a lengthy ping pong between me and the content experts. Would all this have to be archived somewhere? What for? I normally always mention in the edit summaries and on talk pages when I include content that was sent to me by content experts. What might happen though is this: if I make a dozen of quick edits in the space of an hour then I might only mention it in the first edit summary of that hour that the sentences were sent to me by a content experts. If needed I can ensure I add it to every single edit summary. - What came out of the discussion on Discord? Is there a link where I can read it? I haven't used Discord so far. EMsmile (talk) 08:46, 18 April 2023 (UTC)
 * The Discord discussion was inconclusive, hence the request to send the question to the VRT ("Describing the method, and asking what the appropriate way is to document the consent you got about licensing from the experts"). You can join the Discord at WP:DISCORD. I asked in the cci channel. —Femke 🐦 (talk) 16:24, 18 April 2023 (UTC)
 * I've looked at the Discord discussion now. Seems inconclusive, as you say. Sorry to be daft but I still don't know what exactly you want me to ask the VRT and why? The process that I follow is like this: I ask an expert "how would you improve this Wikipedia article?"; they then send me an e-mail with a marked up Word doc which shows how they would improve it. I check if their new sentences are "their own words" plus a ref or if they are copyright vios. If they are "own words" plus a ref, then I enter it into the Wikipedia article, mentioning in the edit summary that it was sent to me by the expert. (if it was copyright vio then I tell them we can't add it like that but need to paraphrase). I always explain to the experts our process and ask them to check the Wikipedia article a second time, after the changes have been made (and sometimes a third, fourth etc. time). To me this is all rather clear and uncontroversial so I am not sure what the question is that I need to ask the VRT now and why?
 * Oh and I also always offer to the experts that they edit the Wikipedia articles themselves but they always decline this option and choose the marked-up Word document route instead. EMsmile (talk) 13:41, 19 April 2023 (UTC)
 * They way you describe it, the expert do not explicitly agree to publish their words under the license Wikipedia uses. I expect this is required legally, and this consent needs to be documented somewhere accessible. —Femke 🐦 (talk) 16:23, 19 April 2023 (UTC)
 * Often times, we modify their words anyhow to make the sentences more readable because they often send us long complicated sentences (being academics). Either way, they have agreed to helping with these Wikipedia editing efforts which is documented in the e-mails that we exchange. Do you want me to ask the Volunteer Response Team if such e-mails need to be stored somewhere? But then what about privacy issues, i.e. their actual e-mail address would then be stored somewhere, doesn't that cause privacy issues in itself? - How do you do it in your Exeter Wikimedian in Residence project? Do all the experts that you contacted edit themselves under their own Wikipedia logins? EMsmile (talk) 10:36, 20 April 2023 (UTC)
 * When you have to paraphrase them completely, it's just advice and I can see no reason for any copyright problems.
 * When you do use their words directly, there are two things I wonder
 * Are they given a message such as "By writing text for Wikipedia, you agree to Wikipedia's Terms of Use and agree to irrevocably release your text under the CC BY-SA 3.0 License and GFDL. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license."?
 * Does that message need to be documented at VRT.
 * I don't know the answer. We don't direclty copy-paste from word documents for this reason. So far, most experts have given advice or edited directly, rather than suggesting text. —Femke 🐦 (talk) 16:00, 20 April 2023 (UTC)
 * I looked at the volunteer response team website and am not sure it's the right place where to ask this. The only page/e-mail address where this would vaguely fit is this: https://en.wikipedia.org/wiki/Wikipedia:Contact_us/Licensing . But why not rather keep this on-wiki rather than going down this e-mail route? Wouldn't this fit better at Village Pump or WP:HELPDESK or here: https://en.wikipedia.org/wiki/Wikipedia_talk:Copyright_problems ? - We haven't given our content experts an explicit message such as "By writing text for Wikipedia, you agree to Wikipedia's Terms of Use and agree to irrevocably release your text under the CC BY-SA 3.0 License and GFDL. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license." I think something like this might actually scare them off as it sounds more complicated than it is... In most cases we end up changing/adjusting their proposed wording anyhow, so their inputs are more advisory, not word for word copied. I could send them such a message in our e-mail correspondence, if it's deemed necessary by Wikipedia admins I guess. Then the question is still does Wikimedia need to store that somewhere? How, where, for how long, with full name and e-mail address? I am just wondering if you are making this more complicated than it needs to be? Let me ping User:sadads, maybe he can advise. Also pinging User:ASRASR as this relates to the way our project is set up.  EMsmile (talk) 22:43, 20 April 2023 (UTC)
 * And that's really cool that you managed to convince experts to edit directly. Of course that would be ideal. In our case, almost all of them have said "no, thank you" to direct editing and I think it's understandable from their perspective. Learning all the Wikipedia guidelines is daunting for them. These are often high flying, very busy and well known academics. I am just happy if they do provide reviews, comments and suggestions for the Wikipedia articles that we send them, let alone expecting them to make edits themselves. EMsmile (talk) 22:43, 20 April 2023 (UTC)
 * Other places are fine too. It's maybe too difficult a question for HD or VP, so CP may be the best location.
 * In terms of message: you may be able to make it less legalese, as long as they know the license?
 * In terms of storage: I don't know. That's why I want you to ask.
 * In terms of experts editing directly; it helps that we're physically in the same room as them on occasion, or they are close colleagues of mine. I'm not sure we've managed to have them edit directly via email. —Femke 🐦 (talk) 06:48, 22 April 2023 (UTC)
 * OK, I think I've understood it now and have posted the question on the talk page of Copyright problems here. EMsmile (talk) 09:44, 25 April 2023 (UTC)
 * Acknowledging this message: I am not exactly a copyright expert. But my understanding, is that suggested edits by someone off wiki in a private channel, wouldn't constitute "copyright" -- and by having a good faith, clear and transparent interaction with the experts with the full knowledge it would be published on Wikipedia -- I don't think anyone would fault the current exchange. Anyway hope that helps, Sadads (talk) 00:43, 27 April 2023 (UTC)