User:Dandinpantelimon/sandbox

Expert Judgment (EJ) denotes a wide variety of techniques ranging from a single undocumented opinion, through preference surveys, to formal elicitation with external validation of expert probability assessments. In the nuclear safety area, Rasmussen et al (1975) formalized EJ by documenting all steps in the expert elicitation process for scientific review. This made visible wide spreads in expert assessments and teed up questions regarding the validation and synthesis of expert judgments. The nuclear safety community later took onboard expert judgment techniques underpinned by external validation (for a review see Cooke 2012 ). Empirical validation is the hallmark of science, and forms the centerpiece of the classical model of  probabilistic forecasting (Cooke 1991 ). A European Network coordinates workshops. Application areas include nuclear safety, investment banking, volcanology, public health, ecology, engineering, climate change and aeronautics/aerospace. For a survey of applications through 2006 see Cooke and Goossens (2008) . Aspinall (2010) and Sutherland and Burgman (2015) give exhortatory overviews. A recent large scale implementation by the World Health Organization is described in (Aspinall et al 2016 ; Hald et al 2015 ). A long running application at the Montserrat Volcano Observatory is described in (Aspinall et al 2002 ; Aspinall 2006 ; Wadge and Aspinall 2014 ).

The classical model scores expert performance in terms of statistical accuracy (sometimes called calibration) and informativeness (Cooke et al 1988 ). These terms should not be confused with “accuracy and precision”. Accuracy “is a description of systematic errors” while precision “is a description of random errors”. In the classical model statistical accuracy is measured as the p-value or probability with which one would falsely reject the hypotheses that an expert’s probability assessments were statistically accurate. A low value (near zero) means it is very unlikely that the discrepancy between an expert’s probability statements and observed outcomes should arise by chance. Informativeness is measured as Shannon relative information (or Kullback Leibler divergence) with respect to an analyst-supplied background measure. Shannon relative information is used because it is scale invariant, tail insensitive, slow, and familiar. Parenthetically, measures with physical dimensions, such as the standard deviation, or the width of prediction intervals, raise serious problems, as a change of units (meters to kilometers) would affect some variables but not others. The product of statistical accuracy and informativeness for each expert is their combined score. With an optimal choice of a statistical accuracy threshold beneath which experts are unweighted, the combined score is a long run “strictly proper scoring rule”: an expert achieves his long run maximal expected score by and only by stating his true beliefs. The classical model derives Performance Weighted (PW) combinations. These are compared with Equally Weighted (EW) combinations, and recently with Harmonically Weighted (HW) combinations, as well as with individual expert assessments.

While some mathematicians and decision analysts regard combining expert judgments as a mathematical problem, the classical model regards expert combination as more akin to an engineering problem. A bicycle obeys Newton's Laws but does not follow from them. It is designed to optimize performance under constraints. Similarly expert judgment combination is viewed as a tool for enabling rational consensus by optimizing performance measures under mathematical and decision theoretic constraints. The theory of rational consensus (Cooke 1991 ) is summarized in (Wittmann et al 2014 ; Cooke 2015 ).

Real expert judgment studies differ in many ways from research or academic exercises. The experts are typically recruited in a traceable peer nomination process based on their knowledge of and engagement with the subject of the study; they may receive remuneration. In all cases, experts’ reasoning is documented, and their names and affiliations are part of the reporting. However, to encourage candid judgments, individuals’ responses are not exchanged within the group and association of names with assessments is not reported in the open literature, but is preserved to enable peer review by the problem owner.

Elicitations typically last several hours; the elicitation protocol is formalized and is part of the public reporting. Elicitation styles differ among practitioners, including face-to-face interviews, with or without plenary briefing and training, and "supervised plenary". Remote elicitation is rarely used, but some recent studies use online face-to-face tools.

Why validate?
Since experts are invoked when quantities of interest are uncertain, the goal of structured expert judgment is a defensible quantification of uncertainty. Confronted with uncertainty, society at large will always harken to prophets, oracles, pundits, blue ribbon panels, crowd wisdom reputed to have performed well in the past. Scientists and engineers, in contrast, are typically averse to any methodology which eschews empirical validation. Most invocations of expert judgment do not attempt any form of validation, as if the predicate “expert” were validation enough. The classical model’s emphasis on validation is its distinguishing feature. Virtually all validation data with real experts and real applications (as opposed to academic exercises) has been generated by practitioners with the classical model.

One of the first studies with experienced and inexperienced experts (Cooke et al 1988 ) showed that expert performance on questions from their field of expertise was not predicted by their performance on “almanac questions”. Experienced and inexperienced experts performed similarly on questions outside their field, but the experienced experts were much better on questions from their field. Hence, validation must be based on assessments of uncertain quantities from the experts’ field, to which we know, or will know, the true values within the time frame of the study. Such quantities are called “calibration” or “seed” variables.

Finding good calibration variables is difficult, and requires a deep dive into the subject matter at hand. The quality of the calibration, and the performance on calibration variables, buttresses the credibility of the whole study. At the end of the day, the problem owner will ask: “if expert A has very good performance on the calibration variables, whereas expert B has very poor performance, am I going to ignore that difference?” If the owner’s answer is “yes” then the calibration variables have failed in their purpose and the effort has been for naught.





Validation data


The best argument for validation of expert judgment is the expert judgment data itself. Whereas the pre-2006 data contains wide variations in numbers of experts and numbers of calibration variables, the 33 independent professionally contracted post-2006 elicitations are more uniform in design, better resourced, better documented and better lend themselves to aggregate presentation. The data comprise in total 320 experts. Figure 1 shows the distribution of experts over the number of assessed calibration variables.

The p-values are sensitive to the power of the statistical test, and hence to the number of calibration variables. These numbers are roughly comparable for experts in the post-2006 data. Figure 2 shows the p-values of all post-2006 experts, arranged from best to worst.

In this summary, 227 of the 320 experts have a statistical accuracy score less than 0.05, which is the traditional rejection threshold for simple hypothesis testing. Half of the experts score below 0.005, and roughly one third fall into the abysmal range below 0.0001. These numbers challenge the assumption that the predicate “expert” is a guarantee of quality with regard to uncertainty quantification.

There is however, a bright side: 93 of the 320 experts would not be rejected as statistical hypotheses at the 5% level. Figure 3 shows the statistical accuracy of the best expert and second best expert for each of the 33 studies. “Best” is defined in terms of the individual’s combined score, which accounts for both statistical accuracy and informativeness, but is driven by statistical accuracy. The plot arrangement is from best to worst of the best experts.

25 of the 33 studies have at least one, and usually two or more experts whose statistical accuracy is acceptable. Simply identifying those experts and relying on them would be a big improvement over un-validated expert judgment (spotting good performers without measuring performance is a fool’s errand - Flandoli et al 2011 ).

An oft heard suggestion is ‘Why not ask the experts to weight each other?” They often know each other or each other’s work, and they may well concur on whose opinion should weight heaviest. A moment’s reflection councils caution, however. Could an expert with poor performance be able to identify those other experts who perform well? Data on this question is sparse, but in a few studies experts were asked to rate each other, and the “group mutual rankings” were negatively correlated the rankings in terms of performance (Aspinall and Cooke, 2013 ). Cooke et al (2008) compared performance of various weighting schemes, including “citation weights” based on experts’ citation numbers and found performance comparable to equal weighting.

In-sample validation
Data used to gauge expert performance can also be used to measure performance of combination schemes. With regard to PW these are in-sample comparisons, as the validation data is also used to initialize the combination model. Were PW not superior in-sample there would be little point in conducting out-of-sample validation. Lichtendahl et al (2013) suggest that averaging experts' quantiles might be superior to EW. Averaging quantiles is easier to compute than averaging distributions, and is frequently employed by unwary practitioners. Averaging quantiles is mathematically equivalent to Harmonically Weighted (HW) combinations of distributions. Figure 4 shows the p-values and combined scores of PW, EW, and HW for the 33 post-2006 studies, arranged according to PW scores. EW has the best combined score in 3 cases, HW is best in 4. PW is best in 26 cases. In 18 cases (55%) the hypothesis that HW is statistically accurate would be rejected at the 0.05 level. In 8 cases rejection would be at the 0.001 level.

Out-of-sample validation


Since the variables of interest are rarely observed within the time frame of the studies, out-of-sample validation mostly reduces to cross validation, whereby the model is initialized on a subset of the calibration variables (training set) and scored on the complimentary set (test set). The difficulty is in choosing the training/test set split. If the training set is small, then the ability to resolve expert performance is small and the PW of each training set poorly resembles the PW of the real study. If the test set is small then the ability to resolve differences in combination schemes is small. That said, Eggstaff et al (2014) considered all splits of training/test sets, and showed that PW outperformed EW out-of-sample.

There is an out-of-sample penalty for PW’s statistical accuracy score. Figure 5 (left) shows out-of-sample statistical accuracy of PW and EW as function of training set size. Both scores increase due to loss of statistical power in the test set, but PW increases faster. As out-of-sample PW is better able to resolve expert performance it approaches the in-sample PW. The combined score (right) shows that out-of-sample dominance of PW grows with training set size.

With $n$ calibration variables the total number of splits (excluding the empty set and the entire set) is $2^{n}&minus;2$, which becomes unmanageable. Recent research suggests that using 80% of the calibration variables for the training set is a good compromise of competing interests. Using Eggstaff’s code, Figure 6 shows the ratios of combined scores for PW / EW per study, aggregated over all splits with 80% of the calibration variables in the training set. Performance weighting PW is demonstrably superior to simple combination schemes (EW, HW) that do not use performance data. However, PW places greater demands on the analyst, both with regard to generating meaningful calibration variables and explaining the methods and results. Finding competent and experienced analysts is the greatest bottleneck for applications. Refinement of performance measures, and improvements in cross-validation methods and software would also be welcome.

Websites
http://rogermcooke.net All mathematical details, an overview of post-2006 applications, recent publications and links to data are available online

Software

 * EXCALIBUR (website) software for processing expert judgment data with the classical model. Freely available.