Experimental benchmarking

Experimental benchmarking allows researchers to learn about the accuracy of non-experimental research designs. Specifically, one can compare observational results to experimental findings to calibrate bias. Under ordinary conditions, carrying out an experiment gives the researchers an unbiased estimate of their parameter of interest. This estimate can then be compared to the findings of observational research. Note that benchmarking is an attempt to calibrate non-statistical uncertainty (flaws in underlying assumptions). When combined with meta-analysis this method can be used to understand the scope of bias associated with a specific area of research.

History
The start of experimental benchmarking in social science is often attributed to Robert LaLonde. In 1986 he found that findings of econometric procedures assessing the effect of an employment program on trainee earnings did not recover the experimental findings.

Experimental benchmarking is often conducted in medical research, such as Schnell‐Inderst et al. (2017) and Burden et al. (2017).

Procedural Considerations
The most instructive experimental benchmarking designs are done on a large scale. They also compare experimental and non-experimental work that looks at the same outcome and the same population.

Observational Designs That Can Be Assessed with Benchmarking
Non-experimental, or observational, research designs compare treated to untreated subjects while controlling for background attributes (called covariates). This estimation approach can also be called covariate adjustment. Covariates are attributes that exist prior to experimentation and therefore do not change based on treatment. Examples include age, gender, weight, and hair color. For example, if researchers are interested in the effect of smoking cessation classes on the number of cigarettes smoked a day, they may carry out covariate adjustment to control for ethnicity, income and the number of years the smoker has been smoking.

Covariate adjustment can be carried out in a variety of ways. Gordon et al. (2018) illustrate many of these methods by means of online advertising data, such as propensity score matching, stratification, regression adjustment, and inverse probability weighted regression adjustment. They find that despite great variation in variables within their data, observational methods cannot recover the causal effects of online advertising. This study ultimately provides evidence that without a randomized control trial, it is impossible to detect symptoms of bias. Bias is not always going to be in one direction or of the same magnitude.

Selected Examples of Experimental Benchmarking
Bloom et al. (2002) looks at the study of the impact of mandatory welfare-to-work programs to ask which non-experimental methods get closest to recovering the experimentally estimated effects of such programs. They also question if the most accurate non-experimental methods are accurate enough to take the place of experimental work. They ultimately argue that none of the methods approach the accuracy of experimental methods for recovering the parameter of interest.

Dehijia and Wahba (1999) examine LaLonde's (1989) data with additional non-experimental findings. They argue that when there is enough subject pool overlap and unobservable covariates do not impact outcomes, non-experimental methods can indeed estimate treatment impact accurately.

Glazerman, Levy and Myers (2003) perform experimental benchmarking in the context of employment services, welfare and job training. They determine that non-experimental methods may approximate experimental estimates, however these estimations can be biased enough to impact policy analysis and implementation.

Gordon et al. (2018) utilizes data from Facebook to see if the variation in data collected by the advertising industry allows for observational methods to recover the causal effects of online advertising. Specifically, the study aims to analyze the effectiveness of Facebook ads on three outcomes: checkout, registration and page view. They find that despite great variation made possible by the nature of social media, it is not possible to accurately recover the causal effects.

Medicine
Rubin, D. B. (1973). Matching to remove bias in observational studies. Biometrics, 159–183.

Stuart E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical science : a review journal of the Institute of Mathematical Statistics, 25(1), 1-21.

Social Sciences
Smith, Jeffrey, and Petra Todd. 2005. "Does Matching Overcome LaLonde's Critique of Nonexperimental Methods?" Journal of Econometrics 125(l-2):305-353