User talk:Doc James/Rebut

They looked at only 10 wikipedia articles
I think the fact that they looked at only 10 Wikipedia articles should be added to the first (or second at the latest) paragraph. --Hordaland (talk) 03:14, 5 June 2014 (UTC)
 * I have added the Daily Mail reference. Would be it okay for me to edit the rebuttal myself? Axl  ¤  [Talk]  09:14, 5 June 2014 (UTC)
 * Yes please. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 20:48, 5 June 2014 (UTC)
 * While they did indeed look at "only" ten articles, that is incidental. Of the ten chosen, "Lung cancer" and "Major depressive disorder" are featured articles. "Concussion", "Chronic obstructive pulmonary disease" and "Hypertension" are good articles. They assessed a large number of assertions. If they had assessed more articles, they would not have reached different conclusions&mdash;partly because their method was flawed, and partly because the articles chosen are representative of a mixture of our highest quality material and the more common lower quality material. Axl  ¤  [Talk]  10:47, 6 June 2014 (UTC)
 * Thanks. --Hordaland (talk) 15:16, 6 June 2014 (UTC)

Some comments
I think this is good. A few comments and suggestions:
 * The popular press has summarized the conclusions as 90% of Wikipedia's medical content being wrong. Suggest (something like): The popular press has taken this to mean that up to 90% of Wikipedia's medical content is wrong.
 * We contend that the paper itself contains many errors. Its methodology is very weak, and its data analysis is entirely wrong. True, but I feel that somewhere we should be saying that it's "fatally flawed"—that's a key point of the letter, imo; "errors" can range from inconsequential to fundamental. Suggest something like: We contend that the paper itself contains fundamental errors. Its methodology appears to be fatally flawed, and its data analysis is entirely wrong.
 * The article was co-authored by 18 Doctors of Osteopathy from various locations; it is difficult to see how all of them could have contributed significantly to it. Do we really want to go there? My impression is that at least 10 of them probably had to work their socks off performing ungratefully vague data assessment tasks. Since the journal adheres to ICMJE recommendations, the question would technically be whether they all individually fulfilled the four ICMJE authorship criteria . This is actually quite a serious ethical allegation that I think we should almost certainly drop.
 * Ten reviewers, described as medical students or rotating interns... I believe this information comes from a personal communication; ie, something like: Ten reviewers, described (in a personal communication) as medical students or rotating interns...
 * It states in the paper that they are students or residents. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 20:51, 5 June 2014 (UTC)


 * ...even for experts the task [the reviewers] were assigned would have taken many hours to carry out properly. It's normal for studies to be time-consuming; the problem is (apart from the assumed lack of expertise) that the instructions they appear to have been given were insufficient and inappropriate for the task.
 * direct data. This is a key phrase. Aren't they "essential data" / "key data" or, depending on the context, "original data" that are being withheld? (A small note re When Dr. Heilman spoke with the lead author...: as James is an individual author, this should perhaps read: When one of us (J.H.) spoke with the lead author...
 * It is impossible to assess the validity of the their claims independently, because no specific errors in Wikipedia are mentioned directly in the original journal article. I feel we should also mention somewhere around the sleight of hand by which "discordances" are interpreted in the Discussion and Conclusions as "errors". As it stands, the paper does not actually provide any direct evidence of error.
 * Is Wikipedia a perfect source? No, but it is just as good as many and better than most other sources out there. Hasty's work did not have a comparison group. Basically he invented a new method to test the quality of medical content and then only applied this new method to one source, Wikipedia. Suggest: Is Wikipedia a perfect source? No, but other studies suggest that its quality is similar to or better than many other prominent sources (refs). The study did not have a comparison group. Basically, Hasty et al invented a new method to test the quality of medical content on the internet and then applied this unvalidated method to a single source, Wikipedia. 86.181.64.67 (talk) 09:29, 5 June 2014 (UTC)
 * I think we have addressed these? Doc James  (talk · contribs · email) (if I write on your page reply on mine) 20:53, 5 June 2014 (UTC)
 * I think so. A couple of queries though: Have we made adequately clear that the paper does not show any direct evidence of error on Wikipedi. (I think so, but I'd need to look with a fresh pair of eyes now.) Also, the fact that they're withholding data that is essential for proper peer review of their work? 86.181.64.67 (talk) 21:18, 5 June 2014 (UTC)

More comments

 * "This is not simply carelessness in wording; it is quite clear that the researchers had no understanding of statistical hypothesis testing." - I'd soften this, ?'it would appear that the researchers have made a significant error in their statistical design'
 * Even if we were to accept that that 2 students reviewing a topic is an appropriate gold standard (which we do not), without a comparator this single data point is meaningless. It would be interesting to know what he would have found if he had applied this methodology to a NICE guideline or to emedicine.
 * Where this study was an assessment by 2 single students, Wikipedia is built by a consensus of people, many of whom are experts.  We recently surveyed our top contributors to Wikipedia's medical pages and asked about their backgrounds. What we found was that 52% have either a masters, PhD, or MD. Another 33% have a BSc.Ian Furst (talk) 15:08, 5 June 2014 (UTC)
 * Yes have adjusted. Good suggestions. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 20:56, 5 June 2014 (UTC)
 * Word count is still at 750ish, I'm going to add some strikeouts to stuff I think might be removed without changing the meaning. OK?  Also, I think the opening could be more compelling, working on an alternate version. Ian Furst (talk) 23:26, 5 June 2014 (UTC)

On May 1st 2014 Hasty et al, published a claim that nine out of ten medical articles in Wikipedia “contain many errors” compared to the peer-reviewed literature. The popular press, fueled by frequent interviews with Hasty, has taken this to mean that as much as 90% of Wikipedia’s medical content is wrong. Wikipedia, is built on collaborative editing and we thought Hasty’s work might help improve medical content. Instead, we found that Hasty had used an unvalidated test of content quality, applied it to Wikipedia alone and made significant errors in study design and data analysis. We believe, the author’s conclusions are not supported by the results.
 * Alternate opening,
 * With respect to "compared the peer-reviewed literature", they did not compare Wikipedia to the peer reviewed literature. They compared single facts on Wikipedia to a single piece of peer-reviewed literature. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 05:40, 6 June 2014 (UTC)


 * This sentence, "Hasty's work had no comparison group; basically, he invented a new unvalidated tool for testing the quality of medical text and then only applied this new method to one source: Wikipedia." consider change to "Hasty has created a diagnostic test to judge the quality of medical literature, without any attempt at validation. No comparison to a gold standard or measurement of accuracy.  He then applied this test to a single source; Wikipedia." Ian Furst (talk) 23:44, 5 June 2014 (UTC)
 * It is not really a diagnostic test. It is just a method. IMO this makes it out to be more than it is. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 05:40, 6 June 2014 (UTC)


 * This sentence is fantastic, "This example shows that differences are not necessarily errors." I don't know if you can highlight it somehow. Ian Furst (talk) 00:06, 6 June 2014 (UTC)
 * Finished adding my changes except for opening, see above Ian Furst (talk) 00:18, 6 June 2014 (UTC)

Word count
This should be 300-500. We are currently over 1000. Doc James (talk · contribs · email) (if I write on your page reply on mine) 15:33, 5 June 2014 (UTC)
 * Where did you find that word count limitation. I couldn't see any figure specified in their Information for Authors . 86.181.64.67 (talk) 16:19, 5 June 2014 (UTC)
 * They emailed it to may. They said we can go a bit over. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 20:07, 5 June 2014 (UTC)
 * Ok :) I think they should grant us that. 86.181.64.67 (talk) 20:42, 5 June 2014 (UTC)
 * Have trimmed it back to 700 words. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 20:48, 5 June 2014 (UTC)


 * I wouldn't expect the editors would want to make too much of an issue over a particular word count. Given all the negative publicity Wikipedia has received in the wake of this peer-review publication, I believe the editors of the journal will recognize they have something of an ethical duty to allow us adequate space to respond concisely to the claims made in the paper—especially given the sweeping character of the conclusions. Providing a forum for post-publication review is a relevant aspect of editorial responsibility. 86.181.64.67 (talk) 16:50, 6 June 2014 (UTC)

English
I am of the opinion that we need to use simpler english. Terms like "discordances" will confuse matter. Will work on this. Doc James (talk · contribs · email) (if I write on your page reply on mine) 15:51, 5 June 2014 (UTC)
 * Regarding style, I agree it's good to be simple, but it's also important in a scientific publication to use the appropriate terminology. For example, interobserver reliability is a real issue here, and I think we should call it by name. It's also important to emphasize that the method is completely unvalidated (ie how can they know that they're actually measuring what they aim to measure?). Imo, "discordances" is a case apart: I see what you mean, but they have their own definition of the word, and they make it central to their (flawed) reasoning. Personally, I can't see a way of appraising their work without adopting it in scare quotes. Maybe others can though. 86.181.64.67 (talk) 16:14, 5 June 2014 (UTC)

Evaluator versus reviewer
There was as suggestion that we should call them evaluators rather than reviewers or researchers? Thoughts? Doc James (talk · contribs · email) (if I write on your page reply on mine) 20:58, 5 June 2014 (UTC)
 * I suggested "evaluators", but I'm not at all wedded to that. 86.181.64.67 (talk) 21:19, 5 June 2014 (UTC)
 * I also like evaluators. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 22:07, 5 June 2014 (UTC)
 * evaluators. They weren't really the researchers and observers (I think the most accurate term), is too scientific Ian Furst (talk) 23:13, 5 June 2014 (UTC)
 * I don't mind: either evaluators or reviewers. (But not researchers.) Axl  ¤  [Talk]  10:31, 6 June 2014 (UTC)

Wrong conclusions
We believe the author’s conclusions are not supported by the results. Query: Why the subjective "We believe"? The conclusions are unsupported. Assuming for a moment that the methods for identifying differences (much better word than "discordances", btw) was completely valid, they would still have only been a measure of differences. Not errors as their conclusions claim. 86.181.64.67 (talk) 08:00, 6 June 2014 (UTC)
 * Agree Doc James  (talk · contribs · email) (if I write on your page reply on mine) 08:12, 6 June 2014 (UTC)

A wording concern
 Even if we were to accept that that 2 students reviewing a topic is an appropriate gold standard... I'm concerned about the wording here. The tool wouldn't need to provide a "gold standard" to be reasonably valid. Also, it wasn't really a case of "2 students reviewing a topic"; rather 2 students implementing a flawed study protocol. Not sure how to go about rephrasing this. —86.181.64.67 (talk) 08:23, 6 June 2014 (UTC)
 * A diagnostic tool doesn't need to be a gold standard, but it should have been compared to one. In this sentence I'm taking the position that, even if we accepted that the new tool is the gold standard, the study is still just a data point. Ian Furst (talk) 10:48, 6 June 2014 (UTC)
 * ...or a "reference standard" perhaps (use of the term "gold standard" is somewhat discouraged nowadays ). I think their methods aim a quality assessment tool. Since they ultimately purport to identify (ie 'diagnose') errors, I think there are certain parallels with a diagnostic test. They do need to be able to define some sort of a reference standard, imo. Something they've sidestepped by appealing to UpToDate, and then (amazingly) to whatever pertinent article happens to be retrieved from any search engine. Validation of interobserver reliability would also be essential. 86.181.64.67 (talk) 11:12, 6 June 2014 (UTC)
 * reference standard is fine by me - that's what happens when an 'old guy' writes. Agree there are many other errors in creating the test, but we've only got 500 words Ian Furst (talk) 11:40, 6 June 2014 (UTC)
 * Ouch, I didn't word that well. Not altogether sure now about the advisability of getting into questions of "standards". Another consideration is their free usage of the term. They talk about errors when checked against standard peer-reviewed sources. What do they mean by "standard" there? "Typical" perhaps... 86.181.64.67 (talk) 12:10, 6 June 2014 (UTC)


 * From paragraph 3: "Given the official role of NICE in setting health policy in England and Wales, it is somewhat ironic that The Daily Telegraph, a UK paper, repeated this incorrect statement and the BBC covered the story so uncritically." I don't think that this sentence is necessary. Axl  ¤  [Talk]  10:30, 6 June 2014 (UTC)
 * Agree it could be lost, but I think it's a powerful statement and adds weight to the rebuttal Ian Furst (talk) 10:48, 6 June 2014 (UTC)
 * My thoughts too. Imo, it interrupts the focus and flow. I'm also not unsure whether this level of journalistic fact checking could reasonably be expected. 86.181.64.67 (talk) 10:49, 6 June 2014 (UTC)
 * Seems like very basic fact checking to me. Trimmed half of it. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 12:10, 6 June 2014 (UTC)
 * Okay, thanks. Axl  ¤  [Talk]  12:29, 6 June 2014 (UTC)


 * Is there a source for the survey of Wikipedia's medical editors? Axl  ¤  [Talk]  10:52, 6 June 2014 (UTC)
 * Currently no. I could add it on Wikipedia. It is pending publication. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 12:11, 6 June 2014 (UTC)
 * Okay, never mind. Axl  ¤  [Talk]  12:29, 6 June 2014 (UTC)

I like it
I think the rebuttal looks pretty good. Ian Furst (talk) 10:58, 6 June 2014 (UTC)
 * I'm also fairly comfortable now. 86.181.64.67 (talk) 12:14, 6 June 2014 (UTC) 22:46, 7 June 2014 (UTC)
 * Yes, it looks very good. Axl  ¤  [Talk]  12:32, 6 June 2014 (UTC)

"students"?
A qualm... Are we perhaps placing too much emphasis on the fact that some of the evaluators were technically still students? After all, these guys do have quite a lot of medical education under their belt. As a reviewer, I'd be especially concerned about 1) what sort of training they were given in order to be able to identify the most reliable peer-review publication/s for each particular statement, and moreover 2) that the instructions they were given didn't really seem to encourage them to do that anyway. 86.181.64.67 (talk) 16:28, 6 June 2014 (UTC)
 * Interns are 3 and 4th year medical students. They are not doctors. And they have very little medical training. They are just starting out and have done a year and a half of medical sciences. Residents are physicians but also students. Better but still exceedingly early in the career. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 22:18, 6 June 2014 (UTC)
 * Ok James. Systems vary and I know next to nothing about Canada. I noticed that writing in The Guardian, Luisa Dillner characterized them (presumably in UK terms, and perhaps somewhat rapidly) as "middle-grade doctors". Fwiw, I've worked alongside certain dedicated people who did excellent research work while still technically classifiable as "students". So, personally I think we should be wary of placing too much emphasis on this status. If they'd been provided with appropriate bibliographic training on identifying reliable sources, and if they'd had a better study protocol to work with, they might have been in a position to handle the task appropriately. 86.181.64.67 (talk) 09:23, 7 June 2014 (UTC)
 * Yes the Guardian has made an error. Interns are not doctors they are medical students in North America. Yes some of them are excellent. They still are just starting out and some of them are not excellent. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 20:00, 7 June 2014 (UTC)

Disagreement between evaluators
Is it worth mentioning that the evaluators themselves often disagreed with each other? Also, that the statistical analysis was inappropriate? Axl ¤  [Talk]  21:05, 6 June 2014 (UTC)
 * Yes. What wording do you recommend? Doc James  (talk · contribs · email) (if I write on your page reply on mine) 22:18, 6 June 2014 (UTC)
 * How about "In 37% of assertions, thw two evaluators disagreed with each other." I derived this from the number of dissimilar assertions divided by the total. Axl  ¤  [Talk]  22:57, 6 June 2014 (UTC)
 * How about "In 37% of assertions, the two evaluators disagreed whether or not it was an "error"". Doc James  (talk · contribs · email) (if I write on your page reply on mine) 23:42, 6 June 2014 (UTC)
 * "Or not" is not necessary. How about: "In 37% of assertions, the two evaluators disagreed as to whether it was an error." Axl  ¤  [Talk]  08:17, 7 June 2014 (UTC)
 * They were asked to identify differences ("discordances") not "errors". 86.181.64.67 (talk) 09:33, 7 June 2014 (UTC)
 * Yes, technically you are right. It was the compiling researcher who classified the "discordances" as errors. However I think that the 37% disagreement rate is significant. How about: "In 37% of assertions, the two evaluators disagreed over the verification of the assertion." Axl  ¤  [Talk]  09:52, 7 June 2014 (UTC)
 * Ok :) but I'd also like to stick with the statement proposed below - key concepts here, imo. 86.181.64.67 (talk) 10:07, 7 June 2014 (UTC)
 * How about "In 37% of cases, the two evaluators did not agree on the verification of the assertion." so we do not use assertion twice in the same sentence. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 20:04, 7 June 2014 (UTC)
 * Yes, that's fine, James. Axl  ¤  [Talk]  12:48, 9 June 2014 (UTC)

Responding to Axl's original points:
 * Suggest: "The evaluators seem to have received no specific bibliographic training, and the instructions given to them appear to have been insufficient to permit interobserver reliability." These are two key points that need saying, imo. Fwiw, I don't think it's our role to provide actual statistics of their concordance.
 * Agree on saying "the statistical analysis was inappropriate". That's different from saying there were errors. 86.181.64.67 (talk) 09:07, 7 June 2014 (UTC)


 * Actually I would like to see more emphasis on the statistical method. McNemar's test requires paired data—which this is not (at least not in the way that they compiled the test). Indeed because of the nature of the methodology, there is no statistical test that can demonstrate a "significant" difference ("discordance") from the evaluators' findings. Either the evaluators found that the assertions were verified in all cases (100%), or they were not verified (some value lower than 100%).


 * As an aside, I suspect that Hasty originally used a Chi-squared test, but the manuscript was rejected from serious journals because of the obvious flaws. By changing to the less well-known McNemar test, he managed to obfuscate the weakness to the "peer reviewers" of the Journal of the American Osteopathic Association. Axl  ¤  [Talk]  10:02, 7 June 2014 (UTC)


 * Surely the key thing to say is that the analysis was inappropriate? (With the clear implication that it doesn't provide any meaningful information.) 86.181.64.67 (talk) 10:12, 7 June 2014 (UTC)
 * We state "made significant errors in study design and data analysis". I am okay with providing more details on the the analysis used. We need to be careful about speculation though. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 20:06, 7 June 2014 (UTC)

Not sure about this
" The evaluators seem to have received no specific bibliographic training, and the instructions given to them for verification in the literature appear to have been insufficient to permit interobserver reliability." as it is speculation. Doc James (talk · contribs · email) (if I write on your page reply on mine) 20:45, 7 June 2014 (UTC)
 * I am hesitant to write about "appearances". We should be more concrete. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 20:47, 7 June 2014 (UTC)
 * "Two sets" means 4? Thus adjusted. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 20:50, 7 June 2014 (UTC)
 * Perhaps I have been hedging too much. Interobserver reliability looks poor (we mention the 37% overall discrepancy between observers). If specific training had been given, one would expect the authors to state that - though I agree we shouldn't speculate. If they'd been in a position to provide training though, they would scarcely have suggested just using any old search engine. 86.181.64.67 (talk) 20:54, 7 June 2014 (UTC)
 * Yes agree which we state and our readers will pick up on. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 21:52, 7 June 2014 (UTC)

"On average, in 37% of cases, the two evaluators did not agree on the verifibility of the assertion. " Overall (not on average), if I've understood Axl right (above). Perhaps we should provide absolute numbers alongside the percentage figure? 86.181.64.67 (talk) 21:00, 7 June 2014 (UTC)
 * Okay yes. How about "The average disagreement between the two evaluators on the verifiability of assertions was 37%" Doc James  (talk · contribs · email) (if I write on your page reply on mine) 21:51, 7 June 2014 (UTC)
 * What you changed it to works aswell. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 21:52, 7 June 2014 (UTC)
 * It's not "on average". It is the total, but there's no need to mention "overall" either. Axl  ¤  [Talk]  12:59, 9 June 2014 (UTC)


 * One remaining doubt: If they could not find the fact in the chosen source, or it was in conflict with a fact found, the authors went on to assume it was an "error". On the subject of speculation, are we quite sure about the underlined passage here? I found their description a bit ambiguous (though personally I wouldn't make too much of this issue, which they could doubtless explain). Based on the text and table I felt I couldn't be altogether sure exactly what they did. Would it perhaps be better to state  It seems that if they could not find the fact in the chosen source, or it was in conflict with a fact found, the authors went on to assume it was an "error".? Not sure about this. 86.181.64.67 (talk) 22:08, 7 June 2014 (UTC)
 * Yes agree. Doc James  (talk · contribs · email) (if I write on your page reply on mine) 22:17, 7 June 2014 (UTC)
 * The original statement is indeed my understanding of the method used. However, as you find their description ambiguous, I don't mind adding "It seems that". Axl  ¤  [Talk]  13:02, 9 June 2014 (UTC)
 * I don't think it's clear to the point that one can be completely sure of what they did and didn't search and how the classification was done. Others have said they find it unclear too. As a novel method, it really needed to be described in more detail imo. From our point of view, my reasoning is that we've genuinely done our best to understand what they did; by admitting we're not altogether sure, we can't be accused of simply having misinterpreted the explanation. That's my perception, but if others are convinced it's actually all crystal clear, I'll willingly defer to their judgement. 86.181.64.67 (talk) 08:51, 10 June 2014 (UTC)


 * Okay, I have made a few minor adjustments. Hopefully, we are close to the final version. Axl  ¤  [Talk]  17:42, 10 June 2014 (UTC)
 * This copyedit is of course appropriate on strictly grammatical grounds. What I was trying to get at was that the Discussion section, which does not address study limitations, sidesteps all questions of interpretation. 86.181.64.67 (talk) 09:01, 11 June 2014 (UTC)
 * Well, I don't have a strong opinion about it. If you would like to change it back, please do so. Axl  ¤  [Talk]  10:17, 11 June 2014 (UTC)
 * Ah, you already changed it. That's fine. Axl  ¤  [Talk]  10:18, 11 June 2014 (UTC)
 * (sorry about that) I've readjusted the wording. 86.181.64.67 (talk) 12:47, 11 June 2014 (UTC)
 * Okay. Axl  ¤  [Talk]  10:03, 12 June 2014 (UTC)