Talk:Selection bias

Removing Selection Bias?
Might be handy if this page included some information about removing selection bias - but I don't have the knowledge. Cached 11:42, 5 December 2005 (UTC)
 * I just started a new section on this. Tobacman 20:08, 11 January 2006 (UTC)
 * What about using techniques like difference-in-difference and PSM to mitigate selection bias. --27.104.144.248 (talk) 02:58, 22 October 2020 (UTC)

"Selection bias" vs "sampling bias"
I believe that much of this article as written so far pertains to sampling bias, not selection bias, and hence the article clamors for substantial revision. Properly, selection bias refers to bias in the estimation of the causal effect of a treatment because of heterogeneous selection into the group that receives the treatment. Sampling bias refers to biases in the collection of data. (See also the article on bias (statistics).) Tobacman 20:08, 11 January 2006 (UTC) The stub on censored regression models also helps make this distinction and given that this is a one strategy used to overcome these kinds of biases should probably be linked into this article Albertivan 13:31, 30 March 2007 (UTC)
 * I agree that the difference between sample and selection bias is not clearly stated. A trusted reference when describing this difference is greatly appreciated. -- steffen (09/09/08) —Preceding unsigned comment added by 85.115.15.50 (talk) 08:49, 9 September 2008 (UTC)
 * I made a distinction, but it's rather the answers from lecturers I've asked than any definite definition, since I've found none so far. However, the following text below which was there before I came needs some clarification - can not unconscious manipulation also be accidental? Mikael Häggström (talk) 17:23, 30 October 2009 (UTC)
 * Malmquist bias ia a sample bias, not a selection bias! You are not selecting your data, you get the bias from the sampling! Generally, you cannot avoid Malmiquist bias. Only being able to observe well fainter than the population faintest object. SkZ (talk) 02:26, 12 April 2023 (UTC)

Another possible distinction is that sampling bias is rather produced by an accidental or instrumental bias in the sampling technique, as against a deliberate or unconscious manipulation.

Self-selection Bias in Online Polling
I am interested in ttesting the validity of online polling, under which the assumption is made that the set of participation in a poll is defined by the persons who view the website for other reasons, and who participate in the poll based upon their discovery of its presence.

I am especially interested in determining the validity of such a sample when affected by a self-selection bias, particularly when a subset of participants with a predetermined answer to a poll, is selected by self-recruitment - that is, by other means than discovery only upon visitation of the website, such as mutual e-mail notification.

To what degree can such participants bias the results of the poll, in comparison to their relative participation?

Any thoughts on approach?

Nothing apart from appreciating

Heckman was is smart, yes. why doesnt he make his model clear for public consumption. Its too mathematical.

Stephen mpumwire, Fortfortal

From "invalid" to "wrong".
I am changing one word in the first paragraph again. Conclusions drawn from statistical analysis are inductive, not deductive. As such, inductive arguments exhibit a valence of strength. The language of logic dictates that deductive arguments are either valid or invalid, and either sound or unsound. Since only deductive arguments have validity, it does not make sense to refer to statistical inductive arguments as valid. It makes sense to refer to statistical arguments that suffer from a selection bias as weak. Weak arguments are inductive in nature and are not likely to preserve truth. Kanodin 09:35, 9 July 2007 (UTC)


 * I put invalid back. I'll explain later, when I have time.--Boffob 11:14, 9 July 2007 (UTC)
 * OK, I'm back. The issue here is not the inductive nature of statistical arguments, it's the validity of the underlying assumptions of the statistical analysis. Statistical arguments are inductive in nature but they should tend to the truth as the sample size increases (particularly, if one could have the entire population as a sample and were able to observe all quantities of interest, then the truth would be known with probability 1). Ignoring selection bias means relying on plainly wrong assumptions, namely that the data are a random sample from the target population (when they are not). The statistics and probabilities computed under such invalid assumptions will hence not merely be weak, they will not represent reality, in the sense that you will not obtain consistent estimators for the quantities wanted (those that relate to the target population). The big issue with selection bias, like with any sampling bias, is that merely increasing sample size without correcting the sampling technique will not alleviate inferential problems.--Boffob 14:45, 9 July 2007 (UTC)
 * I see your point that you want to show that statistics that fail to account for selection bias are likely to be wrong, but it is not enough to call an argument invalid simply because it relies on plainly wrong assumptions. My central point is that using 'validity' to describe an inductive argument is wrong. "Weak" captures the spirit of what we are saying. It is only appropriate to use validity to describe deductive arguments that meet the criteria of validity. If we were to use validity to describe an argument, we would be saying that it is impossible for the premises to be true and the conclusion false. Statistical arguments, by definition, make no such assertion--often times statisticians provide a 'p' value, which describes the likelihood that the results are a statistical artifact. In fact, all statistical arguments that sample less than the entire population are invalid, not merely the defective ones. Most statistical arguments are inductive. The inductive analogue for validity is strength. A strong argument is such that if its premises are true, then the conclusion is probably true. Weak arguments are the negation: If an argument is weak, it is not the case that if the premises are true, then the conclusion is probably true. In the verbiage of logic, I see no problem with describing defective inductive arguments as weak, but I can understand why you would want more strength behind the evaluation in the article. Perhaps it would be acceptable to word it like this: "Statistical arguments that do not take selection bias into account are inductively weak, and their conclusions are likely to be wrong because the tainted sample does not represent the population." I think it would be fine to use stronger language, but using 'invalid' in a place where it does not belong hurts the credibility of the article. Kanodin 19:18, 11 July 2007 (UTC)
 * It seems our quarrel is about whose technical lingo we should use. I put "invalid" instead of "weak", under an layman definition (after all, some of the premises are known to be false in the presence of selection bias), because it is concise and conveys the importance of the possible distortions caused by selection bias better than "may be weak". Your argument is that, strictly using the logic definition of "validity", this is not appropriate because statistical arguments boil down to the expression of probability of a conclusion given a set of premises in one form or another (you give the example of p-values which are the probability of observing the data, or more extreme data, given that the null hypothesis is true). I have to say that reading the article in terms of logic verbiage appears to be déformation professionnelle (my own déformation would read "may be weak" as "may converge weakly"). While I appreciate the compromise you offer, I don't see why one should burden the article through strict adherence to the vocabulary of formal logic.--Boffob 23:13, 11 July 2007 (UTC)
 * There is a good possibility that someone may read this article and assume that invalid is used from a logicians standpoint. While I am averse to using invalid in the way the article prints it, it is also not essential that the article uses the word 'weak'. So, maybe another compromise would be to use neither 'weak' nor 'invalid'. The sentence could then be simplified as: "Statistical arguments that do not take selection bias into account often produce wrong conclusions." Kanodin 23:37, 12 July 2007 (UTC)
 * Now we are getting somewhere. How about just replacing "invalid" with "wrong"? Here, the expression "statistical arguments" is again, sort of a logician's vocabulary. More often, in natural and social sciences, we're talking about the statistical analysis of the results of an experiment or data collected from a study, the term "argument" is not common use as far as I know. And then the qualifier "often" may not be entirely correct either, as the amount of bias induced may be relatively small in many cases (possibly more often than not, who knows), but we still want to express the fact that selection bias could completely skew the conclusions to the opposite of what they should be (which "may be weak" does not convey, unless one is a logician).--Boffob 01:18, 13 July 2007 (UTC)
 * ETA: I made the change. I think it reads OK, though that doesn't leave out the possibility of a more elegant rephrase.--Boffob 01:23, 13 July 2007 (UTC)
 * Looks good. Kanodin 08:38, 13 July 2007 (UTC)

Truman vs. Dewey
Sherry Seethaler gives the case of the Chicago Tribune headline, "Dewey Defeats Truman" which was based in part on a telephone survey. At the time, telephones were expensive items whose owners tended to be in the elite - who favored Dewey much more than the average voter. Where should this go in the article? In the Participants section? --Uncle Ed (talk) 20:10, 20 May 2009 (UTC)

Unclear relation
I moved this line below to here from after Partitioning data with knowledge of the contents of the partitions, and then analyzing them with tests designed for blindly chosen partitions, because I think the examples need more description here in order to be self-explanatory in the article, without the reader having to read all those target articles to understand why they even should be read. Mikael Häggström (talk) 15:37, 24 September 2009 (UTC)

(see stratified sampling, cluster sampling, Texas sharpshooter fallacy)

Circular Analysis and Double Dipping
I started an article specifically on circular analysis/double dipping. In my field of fMRI this might occur where a researcher adjusts parameters in a classifier to improve its accuracy, or stops adding pre-processing steps when the result 'reaches significance'. I think it definitely needs a section or something on this page. Also a redirect from circular analysis and double dipping. Here's the initial version of the page:

Circular Analysis is the selection of parameters of an analysis using the data to be analysed. It is often referred to as double dipping, as one uses the same data twice. Circular analysis inflates the statistical strength of results and, at the most extreme can result in a strongly significant result from noise.

Examples
At its most simple, it can include the decision to remove outliers, after noticing this might help improve the analysis of an experiment. The effect can be more subtle. In fMRI data, for example, considerable amounts of pre-processing is often needed. These might be applied incrementally until the analysis 'works'. Similarly, the classifiers used in a multivoxel analysis of fMRI data require parameters, which could be tuned to maximise the classification accuracy.

Solutions
Careful design of the analysis one plans to perform, prior to collecting the data, means the analysis choice is not affected by the data collected. Alternatively, one might decide to perfect the classification on one or two participants, and then use the analysis on the remaining participant data. Regarding the selection of classification parameters, a common method is to divide the data into two sets, and find the optimum parameter using one set and then test using this parameter value on the second set. This is a standard technique used (for example) by the princeton MVPA classification library.

Dr. Nosenzo's comment on this article
Dr. Nosenzo has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

"I am not sure there is an agreement across disciplines on the labels and definitions of "selection" and "sampling" biases. To my understanding, "selection bias" occurs when a rule other than simple random sampling is used to sample the underlying population that is the object of interest, resulting in a distorted representation of the true population. [source: http://www.dictionaryofeconomics.com/article?id=pde2008_S000084]. Sampling bias occurs when the selected sample is non-representative of the underlying population, which may hinder generalizability of findings. A source that discusses this distinction is: BRS Behavioral Science, Fifth Edition. This contradicts what is presented here in the Wiki article.

I also disagree with the classification of "types" of selection bias; some of the types listed in the article do not seem to relate, strictly speaking, to the issue of non-random sampling (eg the discussion of which studies are included in a meta-analysis; the discussion of data reporting and disclosure). I think these are distinct issues that have more to do with scientific malpractice than with sampling or selection bias."

We hope Wikipedians on this talk page can take advantage of these comments and improve the quality of the article accordingly.

We believe Dr. Nosenzo has expertise on the topic of this article, since he has published relevant scholarly research:


 * Reference : Anderson, Jon E. & Burks, Stephen V. & Carpenter, Jeffrey P. & Gotte, Lorenz & Maurer, Karsten & Nosenzo, Daniele & Potter, Ruth & Rocha, Kim & Rustichini, Aldo, 2010. "Self Selection Does Not Increase Other-Regarding Preferences among Adult Laboratory Subjects, but Student Subjects May Be More Self-Regarding than Adults," IZA Discussion Papers 5389, Institute for the Study of Labor (IZA).

ExpertIdeasBot (talk) 16:34, 2 August 2016 (UTC)

Dr. Turon's comment on this article
Dr. Turon has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

"Sampling bias is systematic error due to a non-random sample of a population,[2] causing some members of the population to be less likely to be included than others, resulting in a biased sample, defined as a statistical sample of a population (or non-human factors) in which all participants are not equally balanced or objectively represented.[3] It is mostly classified as a subtype of selection bias,[4] sometimes specifically termed sample selection bias,[5][6][7] but some classify it as a separate type of bias.[8]

my suggestion: Sampling bias is systematic error due to a non-random sample of a population,[2] causing some members of the population to be less likely to be included than others, resulting in a biased sample. When the sample differs systematically from the population from which it was drawn, any statistical analysis of the sample will reflect features of the population in a biased manner. suggestion to add in Types paragraph: "Incidental Truncation. Here, we do not observe [the outcome of interest] because of the outcome of another variable. The leading example is estimating the so-called wage offer function from labor economics. Interest lies in how various factors, such as education, affect the wage an individual could earn in the labor force. For people who are in the workforce, we observe the wage offer as the current wage. But, for those currently out of the workforce, we do not observe the wage offer. Because working may be systematically correlated with unobservables that affect the wage offer, using only working people (..) might produce biased estimators of the parameters in the wage offer equation." (quoted from "Introductory Econometrics" by J.M Wooldridge, 2nd ed, p. 585.

suggestion to add in the Mitigation paragraph: The Heckman sample selection correction consists in modeling the selection of individuals into the sample as well as the equation of interest. A crucial and often problematic ingredient of this method is that one must have access to data on variables which affect the selection process but do not affect the outcome of interest. This is called the exclusion restriction."

We hope Wikipedians on this talk page can take advantage of these comments and improve the quality of the article accordingly.

We believe Dr. Turon has expertise on the topic of this article, since he has published relevant scholarly research:


 * Reference : Dickson, Matt & Postel-Vinay, Fabien & Turon, Helene, 2014. "The Lifetime Earnings Premium in the Public Sector: The View from Europe," IZA Discussion Papers 8159, Institute for the Study of Labor (IZA).

ExpertIdeasBot (talk) 02:35, 6 September 2016 (UTC)

Cognitive bias, statistical bias, Metaphysical bias
As others have noted above, this page confounds multiple concepts. I'm particularly concerned with the section on "observer bias" and reference to "anthropic reasoning" I'm commenting on the talk page of Anthropic Principle too. I haven't yet read the Tegmark or Bostrom books sited there, but I don't think such WP:FRINGE metaphysical arguments should be presented without caveat on this page. DolyaIskrina (talk) 16:06, 2 July 2019 (UTC)
 * I attributed the statement to Philosopher Nick Bostrom, which at least gives the reader some heads up that were are suddenly not talking about selection or sampling bias, but something philosophical (also some would argue cosmological). I propose moving the "observer bias" down to ==See also== I would also cut the paragraph about meteor impacting the earth, which I don't find very informative and as described prima facia wrong.DolyaIskrina (talk) 00:46, 11 July 2019 (UTC)

Better example?
Vaccines causing autism might be a better example of Types/susceptibility bias. — Preceding unsigned comment added by 2606:A000:825A:500:1D9:45B5:6018:DA31 (talk) 04:12, 10 November 2019 (UTC)

Merging article Cherry picking to here
Both are the same subject. Crashed greek (talk) 06:04, 16 November 2022 (UTC)
 * Strong oppose – I don't think these are the same subject. And if they were you're gonna need a lot more than one sentence to justify it. Consider an upcoming vote on a controversial topic, proposition T, for instance, as an example.
 * If a survey company, Big Survey, conducts 10 opinion polls on the street, and in 9 of those, proposition T is defeated. But then in a public vote, it passes by a large margin. We might then conclude there was a selection bias in the opinion polls: people who supported proposition T were less likely to answer the survey, and where they did, were more likely to be dishonest (because the topic is controversial). Or perhaps, proposition T was more strongly opposed by the elderly, who were more likely to answer because they are retired, and have more time to do so, and so they were overrepresented in the opinion polls. This is a selection bias.
 * Then, in the aftermath, under fire from whoever paid for the opinion polls, Big Survey claims they did correctly control for selection bias factors, and the proof is one of their published surveys did predict that proposition T would pass, which is in fact true. This is cherry picking, because it is technically true, but only supports their claim of accuracy if we're willing to ignore all the data that didn't support their narrative. - 79.141.43.10 (talk) 14:39, 23 November 2022 (UTC)
 * Oppose: There is an explanatory line on the page that explicitly differentiates the two: "Cherry picking, which actually is not selection bias, but confirmation bias...". Selection bias is a function of flaws in the methodological approach to data sampling and handling. Confirmation bias is the conscious or unconscious manipulation of already extant data sets. Iskandar323 (talk) 05:55, 24 November 2022 (UTC)
 * Oppose - if that one sentence is your only reason then you are wrong per the Selection bias article itself FuzzyMagma (talk) 14:06, 24 April 2023 (UTC)