Talk:Replication crisis

Open science collaboration recent science publication
In August 2015 the open science collaboration (based in the Center for Open Science) published a paper in Science (journal) (the paper appears to be open access), in which they report the outcomes of 100 replications of different experiments from top Cognitive and Social Psychology journals. Depending on how they assessed replicabilityie I(e.g. ndependent p values or aggregate data (meta-analytic) or subjective) they report replicability of social psychology studies between 23% (JPSP P values) and 58% (PsychSci - Metaanalytic) and between 48% p value, JEP and 92% metaanalytic PsychSci for cognitive studies. The paper is (to my judgement) be very carefully constructed and very thorough. It is not easy to interpret these percentages by the way as there is hardly any data from other fields about replication success rates. The only indications come from cell biology (see the science paper) where they are talking about percentages as low as 11% to 25% (probably based on p value alone). If this is indicative for all sciences (but I would not hazard to do so) it appears that psychology is neither much worse, nor much better than most. But that would be my own original interpretation and hence not useful for Wikipedia.

I think we should construct a brief section on the outcomes of this programme / paper for this article. I will think about it - but it may take some time (busy) and should be done with due attention to nuance, anyone else is welcome to start it. Arnoutf (talk) 14:27, 30 August 2015 (UTC)
 * See reproducibility project. Andries (talk) 21:59, 1 September 2015 (UTC)

Working on Updating Replication crisis page
I'm working on updating this page and working edits can be found in my sandbox. Pucla (talk) 17:41, 13 November 2015 (UTC)--pucla

QRPs
We define QRPs as "while not intentionally fraudulent, involve capitalizing on the gray area of acceptable scientific practices or exploiting flexibility in data collection, analysis, and reporting" and then we list a bunch of gray areas...and "falsifying data".

I would argue that many researchers don't/didn't even realize that, e.g., selective stopping bought in significant bias, and I'm assuming it hasn't been traditionally prohibited. Falsifying data seems like a different category altogether. Unlike the others, it's not a gray area. This is noted in the discussion section of the cited article (currently ref 5 in the main article) from which we draw our conclusion: " Although falsifying data (Item 10 in our study) is never justified, the same cannot be said for all of the items on our survey"

It's also worth noting that the statement that "A survey of over 2,000 psychologists indicated that nearly all respondents admitted to using at least one QPR" does not appear to be in the study cited. The highest any individual category is 66.5% ("In a paper, failing to report all of a study’s dependent measures"). It certainly seems likely, but I don't see any statement of that.

We can safely say "a majority" while leaving off the "falsifying data" category, as both the "In a paper, failing to report all of a study’s dependent measures" and "Deciding whether to collect more data after looking to see whether the results were significant" categories exceed 50%. (Table/fig 1 in the article).

As such, I'm going to be bold and remove the falsifying data category and change the claim to "a majority". The above is documentation of why and provides a starting pont for refutation if someone disagrees with my assessment.

General Wesc (talk) 16:10, 13 February 2016 (UTC)

"Quotes" subsection removed
I've removed the "quotes" subsection from the page, as it doesn't really fit within the article as-is: see below for the text I removed. -- Markshale (talk) 15:15, 6 March 2016 (UTC)

Begin removed text:


 * By Diederik Stapel From the authorized english translation by Nicholas J.L. Brown available as a free download in PDF format
 * Clearly, there was something in the recipe for the X effect that I was missing. But what? I decided to ask the experts, the people who’d found the X effect and published lots of articles about it [..] My colleagues from around the world sent me piles of instructions, questionnaires, papers, and software [..] In most of the packages there was a letter, or sometimes a yellow Post-It note stuck to the bundle of documents, with extra instructions: “Don’t do this test on a computer. We tried that and it doesn’t work. It only works if you use pencil-and-paper forms.” “This experiment only works if you use ‘friendly’ or ‘nice’. It doesn’t work with ‘cool’ or ‘pleasant’ or ‘fine’. I don’t know why.” “After they’ve read the newspaper article, give the participants something else to do for three minutes. No more, no less. Three minutes, otherwise it doesn’t work.” “This questionnaire only works if you administer it to groups of three to five people. No more than that.” I certainly hadn’t encountered these kinds of instructions and warnings in the articles and research reports that I’d been reading. This advice was informal, almost under-the-counter, but it seemed to be a necessary part of developing a successful experiment. Had all the effect X researchers deliberately omitted this sort of detail when they wrote up their work for publication? I don’t know.
 * From his memoirs: "Ontsporing" (English, "Derailment") Nov. 2012

End of removed text.

General Section Makes No Sense
We've got a statement about the prevalence of the problem, followed by a bulleted list of disciplines each followed by a percentage, and then another percentage in brackets, all appear to have been either rounded to the the nearest 10% or to have been drawn from a conveniently sized sample. How am I supposed to interpret this information? Am I being told that 90% of chemistry papers are not reproducible or that 90% of them are? Is 60% some sort of confidence interval? The whole this is just nonsensical. --81.151.18.242 (talk) 08:21, 1 August 2016 (UTC)


 * Mentions I've seen are that a high percentage of studies are not reproducible when tried, and that it's hard or seldom even tried. It's also characterized that peer-review or major publication or major study doesn't matter -- it's that the complexity of the topics and the methods and such simply leads to different results.  Markbassett (talk) 13:47, 4 August 2016 (UTC)


 * p.s. There seems a the usual kinds of debate or diversity of viewpoints about the reality or definition or significance of this item, over the causes, and so forth. e.g.:
 * The Reproducibility Crisis is Good for Science, Slate
 * Science is in a reproducibility crisis: How do we resolve it?, phys.org
 * Psychology Is in Crisis Over Whether It’s in Crisis, WIRED
 * Science Caught in Reproducibility Crisis, Principia on difficulties
 * Is there a reproducibility “crisis” in biomedical science? No, but there is a reproducibility problem Science-Based Medicine
 * Cheers, Markbassett (talk) 14:27, 4 August 2016 (UTC)


 * The Overall section may or may not make sense, but it doesn't shed any light on the scope of any replication crisis; the numbers that matter are the proportion of experiments that are not replicated or suffer replication failures, not the proportion of researchers that are involved. If a researcher performs 100 experiments over his career, and 1 fails to replicate, that's not really a crisis. The numbers quoted aren't informative to people outside the field, and I'd wonder how informative they are to persons within the fields. I'd be tempted to remove those numbers altogether. (The question is whether the problems of weak effects, small samples, publication bias, and p-hacking are confined to psychology and clinical medicine, or are more widespread.) Ob:XKCD Lavateraguy (talk) 21:39, 31 December 2017 (UTC)

Public policy
Please explain how the inability to reproduce the same result in a study comparing subjects wearing body cameras to subjects not wearing body cameras doesn't relate to the research replication crisis? The article cited even explains why they may have been unable to get a result that supports when they did the same experiments (police officers wearing body cameras compared to officers in the same department not wearing cameras). Did your read the articles? Natureium (talk) 18:04, 21 October 2017 (UTC)


 * (1) The sources don't tie this new study to the Replication Crisis. (2) There is no study that was replicated. The earlier studies may very well be right for the localities and individuals that they studied. (3) That two small-N studies come to a different conclusion on a new phenomena than a large-N study is not what the Replication Crisis is about. Snooganssnoogans (talk) 18:25, 21 October 2017 (UTC)


 * I agree with Snoogans - Natureium, do the articles you propose to cite mention "replication crisis" or "failure to replicate" directly? Unless they do, this is WP:SYNTH. Neutralitytalk 21:11, 21 October 2017 (UTC)


 * I understand that thinking that led this content to be added but yes given the sources, it is SYN. I wonder if there is a source out there, that actually mentions this is an example... Jytdog (talk) 00:43, 22 October 2017 (UTC)
 * Thanks for explaining. I understand. I'll look for a source that mentions it with those specific words. Natureium (talk) 14:45, 23 October 2017 (UTC)

Failure to reproduce figures in 'Outline' section misleading
These figures are not suggestive of a reproducibility crisis and appear to have been misunderstood so should not be highlighted in this page.

The usual significance cut offs (p value) eg 0.001 or 0.05 mean that it is completely normal not to be able to reproduce an experiment, which is what the figures refer to.

The p value is usually set at 0.05 which means that a researcher would only have needed to repeat 20 studies in their career to expect a irreproducible result.

The percentage of scientists who fail to reproduce an experiment without knowing how many experiments they have tried to reproduce has no meaning. They may just indicate tendency to re-run studies in different fields.

Since the figures are meaningless without context they give a misleading impression of the failure to reproduce eg that 87% of chemistry experiments are irreproducible.

The source article is fine, and elaborates on the above figures to highlight other issues, for instance how often these failures to reproduce are published. If I had time I would add the entirety of the findings to this page, but the data currently chosen should be removed as they are not indicative of the page topic.

reference
Here's an additional reference from behavioral neuroscience: --Randykitty (talk) 15:38, 23 January 2019 (UTC)
 * PS: the talks presented at the meeting from which this report was the result are available online on the Tel Aviv University website (here). --Randykitty (talk) 16:42, 10 January 2020 (UTC)

Reproducibility crisis in machine learning
The article misses a section about reproducibility crisis in machine learning. -- JakobVoss (talk) 17:24, 13 September 2019 (UTC)

Paragraph on "Causes"
The paragraph, "Glenn Begley and John Ioannidis proposed these causes for the increase in the chase for significance:   Generation of new data/publications at an unprecedented rate.    Majority of these discoveries will not stand the test of time.    Failure to adhere to good scientific practice and the desperation to publish or perish.    Multiple varied stakeholders. They conclude that no party is solely responsible, and no single solution will suffice."

has these issues:

It lacks a precise reference. I have found it: Reproducibility in Science. Improving the Standard for Basic and Preclinical Research C. Glenn Begley, John P.A. Ioannidis, Circulation Research. But I have only found the third point in the abstract.

The second point is no cause, but an effect. It remains unclear why the fourth point should be a cause.

Therefore, I suggest to drop this paragraph and I will do so unless somebody asks for keeping it. Werner A. Stahel en (talk) 13:04, 14 October 2020 (UTC)

Shift to a Complex Systems Paradigm
This section is difficult to understand and may need further clarification. It is not clear experimental studies on more complex, nonlinear systems should be less reproducible than studies on simple, linear systems. Appropriately large samples sizes and correct statistical analysis should yield reproducible findings when these actually exist. Biologic systems are typically complex and nonlinear yet these are successfully studied. The two references cited (107 and 108) discuss the complexities and instabilities in psychological research but do not convincingly make that case that such research should not be reproducible. — Preceding unsigned comment added by TailHook (talk • contribs) 06:56, 16 December 2020 (UTC)

Sources for most impacted fields
The article currently states that the "replication crisis most severely affects the social sciences and medicine.". To make this claim, it is not sufficient to show that replicability is low in the mentioned fields. Instead, a source would have to state that this problem is more prevalent in social sciences and medicine than in others. While two sources are provided, they both do not support this claim in my opinion. I would therefore argue that this claim should be removed unless more substantial sources can be given.

Jochen Harung (talk) 18:24, 25 April 2021 (UTC)


 * The citations mentions:
 * In disciplines such as medicine, psychology, genetics and biology, researchers have been confronted with results that are not as robust as they originally seemed.
 * Some have blamed the reliance on p-values for the replication crises now afflicting many scientific fields. In psychology, in medicine, and in some fields of economics, large and systematic searches are discovering that many findings in the literature are spurious.


 * Medicine is supported and we could mention psychology, but social sciences, the broader field that include psychology, is not specifically supported. However many of the other social sciences are mentioned in the body of the article and although WP:LEADCITE mentions the need for information in the lead to be cited, we should avoid duplication of citations in both the lead and the body. It seems a fair generalization to say "social sciences" rather then listing out the sciences mentioned in the article. Richard-of-Earth (talk) 00:55, 29 April 2021 (UTC)


 * Thank you for your comment. Maybe I am missing something, but it is still unclear to me where any of the sources explictly state that the social sciences are more severely affected than other disciplines. They only state these fields are affected without an explicit comparison in severity to other fields. Furthermore, your first citation mentions genetics and biology which do not belong to the fields of social science nor medicine. Therefore, even if we ignore the point that no explicit comparison of different fields is apparent in the cited sources, biology would also have to be included in the above statement. Jochen Harung (talk) 17:52, 29 April 2021 (UTC)


 * I felt it was implied since other subjects are not mentioned, but perhaps that is a sort of WP:OR. Sure, just remove it. The readers can decide for themselves. Richard-of-Earth (talk) 19:36, 1 May 2021 (UTC)

Regarding the hat note
...that I recently added. Please compare the hat note to that of Reproducibility. Thanks CapnZapp (talk) 11:27, 26 October 2021 (UTC)

Small typo?
In the section: Historical and philosophical roots; sentence in the last paragraph: "This theory holds that each "system" such as economy, science, religion or media on communicates using its own code: true/false for science, profit/loss for the economy, news/no-news for the media, and so on.". (new->news?)--S.POROY (talk) 12:53, 30 January 2022 (UTC)

Misuse of Prevalence section
I've noticed that some people are adding content to the Prevalence section that is not about quantitative measures of replicability and QRPs, likely because other sections are not subdivided by field. –LaundryPizza03 ( d c̄ ) 02:17, 24 March 2022 (UTC)

Theory-Crisis
In the field of metascience, a growing number of recent publications are concerned with how theoretical as opposed to methodological or statistical shortcomings might be the cause of low replication rates, at least in psychological science. Some of these even talk about a "Theory-crisis" in psychology. I was wondering if it would make sense to create a separate subsection under "Causes" to report on the considerations that have been made in this area of study concerning the replication crisis. Examples of these publications are:

Fiedler, K. (2017). What constitutes strong psychological science? The (neglected) role of diagnosticity and a priori theorizing. Perspectives on Psychological Science, 12(1), 46-61. https://doi.org/10.1177/1745691616654458

Oberauer, K., & Lewandowsky, S. (2019). Addressing the theory crisis in psychology. Psychonomic Bulletin & Review, 26(5), 1596-1618. https://doi.org/10.3758/s13423-019-01645-2

Oude Maatman, F. (preprint). Psychology's Theory Crisis, and Why Formal Modelling Cannot Solve It. PsyArxiv. https://doi.org/10.31234/osf.io/puqvs

Szollosi, A., & Donkin, C. (2021). Arrested theory development: The misguided distinction between exploratory and confirmatory research. Perspectives on Psychological Science, 16(4), 717-724. https://doi.org/10.1177/1745691620966796

ProgressiveProblemshift (talk) 16:43, 15 May 2023 (UTC)

Missing reference + unclear sentence
In the section "Background", the explanation of how NHST works ends by saying "Although p-value testing is the most commonly used method, it is not the only method.". This sentence is missing a reference, but on top of that I would argue it is not very clear. It raises the question: "The only method for what?". Given the content of that paragraph one could say it most likely means "not the only method to test significance", but since the page is on replication, I'd say that it would make more sense if it referred to methods to establish whether findings were successfully replicated in general. In such a case, I have a good reference in Nosek et al. (2022) where a small section at the beginning is dedicated to defining when we can say that original findings were replicated (i.e. "How do we decide whether the same occured again?", p. 722), where the authors describe different methodologies and criteria by which replications are defined as successful. I'd love to do it myself, but I would need a confirmation that this edit makes sense!

Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Dreber, A., Fidler, F., Hilgard, J., Struhl, M. K., Nuijten, M. B., Rohrer, J. M., Romero, F., Scheel, A. M., Scherer, L. D., Schönbrodt, F. D., & Vazire, S. (2022). Replicability, robustness, and reproducibility in psychological science. Annual Review of Psychology, 73, 719-748. https://doi.org/10.1146/annurev-psych-020821-114157 ProgressiveProblemshift (talk) 13:04, 22 May 2023 (UTC)

Wrong statement
In the subsection "Result-blind peer review", it is reported that "more than 140 psychology journals have adopted result-blind peer review". I believe this statement is wrong. If one check the website article that's cited, the author says that 140 journals at the time were using registered reports (which implies result-blind peer-review). She says "journals" without mentioning specific disciplines. If one checks the source she cites, it's the COS web page on registered reports, so I assume that to come up with the 140 number she probably checked the COS's page on TOP scores for journals. By consulting the page, presently only 46 journals adopt some form of registered reports. The statemente should be changes or just deleted since, I believe, it's misleading and incorrect. Here one can see the stats of psychology journals when it comes to adopting registered reports: https://topfactor.org/journals?factor=Registered+Reports+%26+Publication+Bias&disciplines=Psychology&page=3 ProgressiveProblemshift (talk) 14:14, 14 July 2023 (UTC)