Forecasting presents decision makers with actionable information that they can use to prevent (or prepare for) economic (Shin et al., 2013; Huang et al., 2016; Mak et al., 1996), engineering Guangliang (1996); Zio (1996b); Neves and Frangopol (2008), ecological Borsuk (2004); Failing et al. (2004); Morales-Nápoles et al. (2017); Johnson et al. (2018), social Cabello et al. (2012); Kläs et al. (2010); Craig et al. (2001), and public health burdens Evans et al. (1994a); Alho (1992).
Research combining expert opinion to produce an aggregate forecast has grown rapidly, and a diverse group of disciplines apply combination forecasting techniques. Cross-communication between different applied areas of combination forecasting is minimal, and as a result, different scientific fields are working in parallel rather than together. The same mathematical ideas in combination forecasting are given different labels depending on application. For example, the literature refers to taking an equally-weighted average of expert forecasts as: equal-weighting, unweighted, and 50-50 weighting.
This scoping review focuses on methods for aggregating expert judgments. The aim is to survey the current state of expert combination forecasting literature, propose a single set of labels to frequently used mathematical details, look critically at how to improve expert combination forecasting research, and suggest future directions for the field.
We map key terminology used in combining expert judgemental forecasts and consolidate related definitions. A textual analysis of scoped articles highlights how combination forecasting techniques have evolved. A prespecified list of questions was asked of every in-scope manuscript: whether point predictions or predictive densities were elicited from experts, methods of aggregating expert predictions, experimental design for evaluating combination forecasts and how forecasts were scored (evaluated). We tabulated techniques for evaluating forecasts and condensed terms referring to the same evaluative metric.
Section 2 gives a brief historical background of combination forecasting and current challenges. Section 3 describes our literature search, how articles were defined as in-scope, and our analysis. Section 4 reports results and section 5 discusses common themes, terminology, advocates for key areas that need improvement, and recommends future directions for aggregating expert predictions.
2.1 Human judgmental forecasting
Judgmental forecasting models—predictions elicited from experts or non-expert crowds and combined into a single aggregate forecast—have a long history of making well calibrated and accurate predictions (Edmundson, 1990; Bunn and Wright, 1991; Lawrence and O’Connor, 1992; O’Connor et al., 1993). Advances in judgmental forecasting take two paths: building sophisticated schemes for combining predictions (Clemen, 1989; Clemen and Winkler, 1999a; Clemen, 2008) and eliciting better quality predictions Ayyub (2001); Helmer (1967).
Initial combination schemes showed an equally-weighted average of human-generated point predictions can accurately forecast events of interest (Galton, 1907). More advanced methods take into account covariate information about the forecasting problem and about the forecasters themselves (for example weighting experts on their past performance). Compared to an equally-weighted model, advanced methods show marginal improvements in forecasting performance (Fischer and Harvey, 1999; McLaughlin, 1973; Armstrong and Forecasting, 1985; Winkler, 1971; Clemen, 1989).
In this work we will study combinations of expert predictions. Combining non-expert predictions often falls into the domain of crowdsourcing, and crowdsourcing methods tend to focus on building a system for collecting human-generated input rather than on the aggregation method.
Past literature suggests experts make more accurate forecasts than novices (Armstrong, 2001a, 1983; Lawrence et al., 2006; Spence and Brucks, 1997; Alexander Jr, 1995; French, 2011; Clemen and Winkler, 1999a). Several reasons could contribute to this increased accuracy: domain knowledge, the ability to react to and adjust for changes in data, and the potential to make context-specific predictions in the absence of data (Armstrong, 1983; Lawrence et al., 2006; Spence and Brucks, 1997; Alexander Jr, 1995). The increased accuracy of expert opinion led some researchers to exclusively study expert forecasts (Armstrong, 2001a; French, 2011; Genre et al., 2013), however crowdsourcing—asking large volumes of novices to make predictions and using a simple aggregation scheme—rivals expert-generated combination forecasts (Howe, 2006; Lintott et al., 2008; Prill et al., 2011). Whether or not expert or non-expert predictions are solicited, judgmental forecasting agrees that human judgment can play an important role in forecasting.
Judgmental forecasts can have advantages over statistical forecasting models. Human intuition can overcome sparse or incomplete data issues. Given a forecasting task with little available data, people can draw on similar experiences and unstructured data to make predictions, whereas statistical models need direct examples and structured data to make predictions. When data is plentiful and structured, statistical models typically outperform human intuition (Meehl, 1954; Kleinmuntz, 1990; Yaniv and Hogarth, 1993). But whether a statistical or judgemental forecast is best depends on the circumstances.
An understanding of the type of forecasts that models can produce and a mathematical description of a combination forecast can clarify how judgmental data, number of forecasters, and the combination scheme interact.
2.2 A framework for combination forecasting
Forecasting models can be statistical, mechanistic, or judgmental. We define a forecasting model
as a set of probability distributions over all possible events. Each probability distribution is typically assigned a vector, called the model’s parameters, that is used to differentiate one probability distribution from another , where is probability distribution for a specific choice of , and are all possible choices of model parameters.
Models can produce two types of forecasts: point predictions or predictive densities. Point forecasts produce a single estimate of a future value (Bates and Granger, 1969; Granger and Ramanathan, 1984) and are frequently used because they are easier to elicit from experts and early work was dedicated to combining specifically point forecasts Granger and Ramanathan (1984); Bates and Granger (1969); Galton (1907). Probabilistic forecasts are more detailed. They provide the decision maker an estimate of uncertainty (probability distribution) over all possible future scenarios (Clemen and Winkler, 1999a; Stone, 1961; Winkler, 1981; Genest et al., 1986; Winkler, 1968; Dawid et al., 1995; Ranjan and Gneiting, 2010; Gneiting et al., 2013; Hora and Kardeş, 2015). Probabilistic densities can be thought of as more general than point forecasts. A point forecast can be derived from probabilistic forecast by taking, for example, the mean, median, or maximum a posteriori value. A probabilistic density assigning all probability mass to a single value can be considered a point forecast.
A combination forecast aggregates predictions, either point or probabilistic, from a set of models and produces a single aggregate forecast (Clemen and Winkler, 1999a; Winkler, 1981; Genest et al., 1986). Given a set of models , a combination model maps the cartesian product of all models onto a single class of suitable probability distributions (Gneiting et al., 2013). The goal of combination forecasting is to find an optimal aggregation function . Typically the model is parameterized such that finding an optimal amounts to finding the parameter vector that produces an optimal forecast.
There are several ways to improve a combination model’s forecasting ability. Combination models can improve forecast accuracy by considering a more flexible class of aggregation functions . Soliciting expert opinion (versus novices) can be thought of as improving individual forecasts used as input into the combination model. Crowdsourcing takes a different approach to improve forecast accuracy (Howe, 2006; Brabham, 2013; Abernethy and Frongillo, 2011; Forlines et al., 2014; Moran et al., 2016). These methods consider a simple class of aggregation functions and collect a large number of human-generated forecasts . By accumulating a large set of human-generated predictions, a crowdsourcing approach can create flexible models with a simple aggregation function.
This framework makes clear the goals of any combination forecasting model. Some focus on improving individual models , others focus on more flexible aggregation functions (). In this work we will consider combination forecasting models that include expert-elicited forecasts as their raw material and pursued building more flexible aggregations models.
2.3 A brief timeline of existing work
Francis Galton was one of the first to formally introduce the idea of combination forecasting. In the early 20th century, he showed aggregating point estimates from a crowd via an unweighted average was more accurate compared to individual crowd estimates (Galton, 1907). Galton’s work was empirical, but laid the foundation for exploring how a group of individual conjectures could be combined to produce a better forecast.
Since Galton, combination forecasting was mathematically cast as an opinion pool. Work in opinion pools began with Stone (Stone, 1961) in the early 1960s. He assumed a set of experts had an agreed upon utility function related to decision making, and that experts could each generate a unique probability distribution to describe their perceived future ”state of nature”. To build a single combined forecast, Stone proposed a convex combination of each expert’s probability distribution over the future—an opinion pool. Equally weighting individual predictions would reproduce Galton’s model, and so the opinion pool was a more flexible way to combine expert opinions.
In the late 1960’s, Granger and Bates formalized the concept of an optimal combination forecast. In their seminal work (Bates and Granger, 1969)
, several methods were proposed for how to combine point predictions to reduce, as much as possible, the combined forecast’s variance. Methods for combinining forecasts was further advanced by Granger and Ramanathan, and framed as a regression problem(Granger and Ramanathan, 1984). Work by Granger, Bates, and later Ramanathan inspired several novel methods for combining point forecasts (Gneiting et al., 2013; Hora and Kardeş, 2015; Cooke et al., 1991; Wallis, 2011). Combination forecasts often produce better predictions of the future than single models.
It wasn’t until the 1990’s that Cooke generalized the work of Stone and others, and developed an algorithm coined Cooke’s method, or the Classical Model (Cooke et al., 1988, 1991) for combining expert judgment. Every expert was asked to provide a probability distribution over a set of possible outcomes. To assign weights to experts, a calibration score statistic compared the expert’s probability distribution to an empirical distribution of observations. Experts were assigned higher weights if their predictions closely matched the empirical distribution. The calibration score was studied by Cooke and asymptotic properties were summarized based on Frequentist procedures (Cooke et al., 1988; Cooke, 2015). Cooke’s model also assigned experts a weight of for poor predictive performance, and if an expert’s performance was under some user-set threshold they were excluded from the opinion pool. Cooke’s model garnered much attention and has influenced numerous applications of combining expert opinion for forecasting (Cooke, 2014; Clemen, 2008; Cooke, 2015).
Alongside frequentist approaches to combination forecasting, Bayesian approaches began to gain popularity in the 1970’s (Morris, 1974). In the Bayesian paradigm, a decision maker (called a supra Bayesian), real or fictitious, is asked to evaluate expert forecasts and combine their information into a single probability distribution (Hogarth, 1975; Keeney, 1976). The supra Bayesian starts with a prior over possible future observations and updates their state of knowledge with expert-generated predictive densities. Combination formulas can be specified via a likelihood function meant to align expert-generated predictive densities with observed data. The difficulties introduced by a Bayesian paradigm are familiar. The choice of likelihood function and prior will affect how expert opinions are pooled. Past work proposed many different likelihood functions, and interested readers will find a plethora of examples in Genest and Zidek (Genest et al., 1986), and Clemen and Winkler Clemen and Winkler (1999a, 1986); Clemen (1989).
2.4 Recent work in combination forecasting
Recent work has shifted from combining point estimates to combining predictive densities. Rigorous mathematical theory was developed and framed the problem of combining predictive densities (Gneiting et al., 2013). Work combining predictive densities showed results similar in spirit to Granger and Bates’ (Bates and Granger, 1969; Granger and Ramanathan, 1984) work on combining point predictions. Ranjan and Gneiting (Ranjan and Gneiting, 2010; Gneiting et al., 2013) showed a set of calibrated predictive distributions, when combined using a linear pool, necessarily leads to an overdispersed and therefore miscalibrated combined distribution. This mimics Granger and Bates’ results (Bates and Granger, 1969). They showed combining unbiased point predictions can lead to a combination method that makes biased point estimates.
This work in miscalibrated linear pools inspired new methods for recalibrating forecasts made from a combination of predictive densities. To recalibrate, authors recommend transforming the aggregated forecast distribution. The Spread-adjusted Linear Pool (SLP) (Berrocal et al., 2007; Glahn et al., 2009; Kleiber et al., 2011) transforms each individual distribution before combining, the Beta Linear Pool (BLP) applies a beta transform to the final combined distribution Gneiting et al. (2013); Ranjan and Gneiting (2010), and a more flexible infinite mixture version of the BLP Bassetti et al. (2018), mixture of Normal densities Baran and Lerch (2018)
, and empirical cumulative distribution functionGarratt et al. (2019) also aim to recalibrate forecasts made from a combination of predictive densities.
Machine learning approaches assume a broader definition of a model as any mapping that inputs a training set and outputs predictions. This allows for more general approaches to combining forecasts called: ensemble learning, meta-learning, or hypothesis-boosting in machine learning literature. Stacking and the super-learner approach are two active areas of machine learning research to combine models. Stacked generalization (stacking) (Wolpert, 1992) proposes a mapping from out-of-sample predictions made by models (called base-learners) to a single combination forecast. The function that combines these models is called a generalizer and can take the form of any regression model, so long as it maps model predictions into a final ensemble prediction. The super-learner ensemble takes a similar approach to stacking (Van der Laan et al., 2007; Polley and Van Der Laan, 2010)
. Like stacking, the super-learner takes as input out-of-sample predictions from a set of models. Different from stacking, the super-learner algorithm imposes a specific form for aggregating predictions, a convex combination of models, such that the weights assigned to each model minimize an arbitrary loss function that includes the super-learner predictions and true outcomes of interest. By restricting how predictions are aggregated, super-learner is guaranteed better performance under certain conditions(Van der Laan et al., 2007; Polley and Van Der Laan, 2010). Stacked and super-learner models often perform better than any individual forecasts and their success has led to applying them to many different problems (Syarif et al., 2012; Sakkis et al., 2001; Che et al., 2011; Wang et al., 2011), however the machine learning community is debating issues with stacked models Ting and Witten (1999) and how they can be improved Džeroski and Ženko (2004).
2.5 Open challenges in combination forecasting
Combination forecasting has three distinct challenges: data collection, choice of combination method, and how to evaluate combination forecasts.
Crowdsourcing (Howe, 2006; Brabham, 2013; Abernethy and Frongillo, 2011; Forlines et al., 2014; Moran et al., 2016) and expert elicitation Amara and Lipinski (1971); Yousuf (2007); O’Hagan et al. (2006) are two approaches to collecting judgemental forecasts that attempt to balance competing interests: the quantity and quality of judgemental predictions. Crowdsourcing trades expertise for a large number of contributors. Expert judgemental forecasting takes the opposite approach and focuses on a small number of independent high-quality forecasts. Both methods try to enlarge the space of potential predictions so that a combination method can create a more diverse set of predictive densities over future events (Dietterich et al., 2002; Bates and Granger, 1969).
Combination methods are faced with developing a set of distributions over events of interest that take predictions as input and produce an aggregated prediction aimed at optimizing a loss function. Major challenges are how to account for missing predictions (Capistrán and Timmermann, 2009), correlated experts Armstrong (1985); Bunn (1985, 1979), and how to ensure the combination forecast remains calibrated Ranjan and Gneiting (2010); Gneiting et al. (2013); Berrocal et al. (2007); Glahn et al. (2009); Kleiber et al. (2011); Garratt et al. (2019).
No normative theory for how to combine expert opinions into a single consensus distribution has been established, and a lack of theory makes comparing the theoretical merits of one method versus another difficult. Instead, authors compare combination methods using metrics that measure predictive accuracy: calibration, and sharpness (Jolliffe and Stephenson, 2012; Gneiting and Raftery, 2007; Gneiting and Ranjan, 2011; Dawid, 2007; Hora and Kardeş, 2015). Combination methods that output point forecasts are compared by measuring the distance between a forecasted point estimate and empirical observation. Probabilistic outputs are expected to be calibrated and attempt to optimize sharpness, or the concentration of probability mass over the empirical observations (Gneiting and Raftery, 2007; Gneiting and Ranjan, 2011; Hora and Kardeş, 2015; Jolliffe and Stephenson, 2012).
2.6 Past Reviews on Combination forecasting
Our review underlines the digital age’s impact on combination forecasting. Collecting expert opinion in the past required one-on-one meetings with experts: in person, by phone, or mailed survey, and the internet decreased the burden of eliciting expert opinion by using online platforms to ask experts for their opinion (Howe, 2006). Past work focused on using statistical models to combine forecasts, but increases in computing power broadened the focus from statistical models to machine-learning techniques. Our review explores how the digital age transformed combination forecasting and is an updated look at methods used to aggregate expert forecasts.
Many excellent past reviews of combination methods exist. Genest and Zidek give a broad overview of the field and pay close attention to the axiomatic development of combination methods (Genest et al., 1986). Clemen and Winkler wrote four reviews of aggregating judgmental forecasts (Clemen and Winkler, 1986; Clemen, 1989; Clemen and Winkler, 1999a, b). The most cited manuscript overviews behavioral and mathematical approaches to aggregating probability distributions, reviews major contributions from psychology and management science, and briefly reviews applications. These comprehensive reviews center around the theoretical developments of combination forecasting and potential future directions of the science. Our work is an updated, and more applied, look at methods for aggregating expert predictions.
3.1 Search algorithm
The Web of Science database was used to collect articles relevant to combining expert prediction. The search string entered into Web of Science on 2019-03-06 was (expert* or human* or crowd*) NEAR judgement AND (forecast* or predict*) AND (combin* or assimilat*) and articles were restricted to the English language. All articles from this search were entered into a database. Information in this article database included: the author list, title of article, year published, publishing journal, keywords, and abstract (full database can be found at https://github.com/tomcm39/AggregatingExpertElicitedDataForPrediction).
To decide if an article was related to combining expert judgement, two randomly assigned reviewers (co-authors) read the abstract and were asked if the article was in or out of scope. We defined an article as in-scope if it elicited expert judgments and combined them to make a prediction about natural phenomena or a future event. An article moved to the next stage if both reviewers agreed the article was in-scope. If the two reviewers disagreed, the article was sent to a randomly assigned third reviewer to act as a tie breaker and was considered in scope if this third reviewer determined the article was in-scope.
Full texts were collected for all in-scope articles. In-scope full texts were divided at random among all reviewers for a detailed reading. Reviewers were asked to read the article and fill out a prespecified questionnaire (Table 4). The questionnaire asked reviewers to summarize: the type of target for prediction, the methodology used, the experimental setup, and terminology associated with aggregating expert opinion. If after a detailed review the article is determined to be out of scope it was excluded from analysis. The final list of articles are called analysis-set articles.
3.2 Analysis of full text articles
From all analysis-set articles, abstract text was split into individual words, we removed English stop words—a set of common words that have little lexical meaning—that matched the Natural Language Toolkit (NLTK)’s stop word repository (Loper and Bird, 2002), and the final set of non-stopwords were stemmed Willett (2006).
A univariate analysis: (i) counted the number of times a wordappeared in abstract text per year , (ii) the total number of words among all abstracts in that year , and (iii) the frequency a word appeared over time . If a word did not appear in a given year it received a count of zero ().
Words were sorted by and a histogram was plotted of the top 5% most frequently occurring words in abstract text. Among the top most frequently occurring words, we plotted the proportion ( of each word over time.
Full text articles were scanned for key terms related to aggregating expert judgments. Evaluation metrics, a preferred abbreviation, related names, whether the metric evaluated a binary or continuous target, and formula to compute the metric was included in a table (Table 3). Terms specific to aggregating judgmental data were grouped by meaning and listed in a table (Table 1) along with a single definition. If multiple terms mapped to the same concept, our preferred label was placed at the top.
4.1 Search results
The initial Web of Science search returned articles for review. After random assignment to two reviewers, articles were agreed to be out of scope. The most frequent reasons for exclusion were the lack of experts used for prediction or the use of experts to revise, rather than directly participate in generating, forecasts. The in-scope articles come from articles two reviewers agreed to be in-scope, and out of articles a randomly assigned third reviewer considered in-scope. Full text analysis determined another articles out of scope, and the final number of analysis-set articles was (Fig. 1).
Analysis set articles were published from to . Publications steadily increase in frequency from until . After , publication rates rapidly increase until (Fig. 2).
Analysis-set articles were published in journals, and the top publishing journals are: the International Journal of Forecasting ( articles), Reliability Engineering & System Safety ( articles), and Risk Analysis and Decision Analysis ( articles each). Combination forecasting articles often emphasize the role of decision makers in forecasting, and these top-publishing journals sit at the intersection of forecasting and decision sciences.
The top most frequent words found in articles’ abstracts are related to our initial search: “expert”,“judgment”, “forecast”, “combin”, and “predict”. Words related to modeling and methodology are also frequent: “model”, “method”, “approach”, “predict”. The word “assess” appears less frequently in abstracts and the word “accuracy” even less frequent (Fig. 3).
The proportion of words: “expert”, “forecast”, “model”, “method”, and “data” appear intermittently in the s and appear more consistently in the s (Fig. 4). The words “probabili*” and “predict” occur in abstract text almost exclusively after the year . The rise of “forecasts”, “model”, and “data” suggests data-driven combination forecasting schemes may be on the rise, and the uptick of “probabil*” and “predict” could be caused by an increase in aggregating expert probability distributions (rather than point forecasts).
4.2 Forecasting terminology
Forecasting terminology centered around six distinct categories (Table 1
): frameworks for translating data and judgment into decisions (Forecasting support system, probabilistic safety assessment), broad approaches to aggregating forecasts (behavioral aggregation, mathematical combination, integrative judgment), specific ways experts can provide predictions (integrative judgment, judgemental adjustment), terms related to weighting experts (equal weighted linear pool, nominal weights), different names for classical models (Cooke’s method, mixed estimation), and philosophical jargon related to combination forecasting (Laplacian principle of indifference, Brunswik lens model).
Only a few concepts in the literature are assigned a single label, the majority are given multiple labels. Some concepts’ labels are similar enough that one label can be swapped for another. For example, equal-weighted, 50-50, and unweighted all refer to assigning equal weights to expert predictive densities in a linear opinion pool. Other concepts are assigned different labels, for example forecasting support system and adaptive management, that may make it difficult to understand both terms refer to the same concept.
4.3 Forecasting targets
Forecasting research focused on predicting categorical variables (articles, 34%) and time-series ( articles, 40%), but the majority of articles attempted to predict a continuous target ( articles, %) (Table. 2).
The type of forecasting target depended on the application. Ecological and meteorological articles (Johnson et al., 2018; Cooke et al., 2014; Li et al., 2012; Tartakovsky, 2007; Morales-Nápoles et al., 2017; Borsuk, 2004; Abramson et al., 1996; Mantyka-Pringle et al., 2014; Kurowicka et al., 2010; Wang and Zhang, 2018) focused on continuous targets such as: the prevalence of animal and microbial populations, deforestation, and climate change. Economics and managerial articles focused on targets like: the number of tourist arrivals, defects in programming code, and monthly demand of products (Song et al., 2013; Kabak and Ülengin, 2008; Huang et al., 2016; Failing et al., 2004; Shin et al., 2013). Political articles focused on predicting presidential outcomes, a categorical target (Hurley and Lior, 2002; Graefe et al., 2014a; Morgan, 2014; Graefe, 2015, 2018; Graefe et al., 2014b). Risk-related targets were continuous and categorical: the probability of structural damage, nuclear fallout, occupational hazards, and balancing power load (Kläs et al., 2010; Zio and Apostolakis, 1997; Cabello et al., 2012; Adams et al., 2009; Neves and Frangopol, 2008; Jana et al., 2019; Hathout et al., 2016; Wang et al., 2008; Ren-jun and Xian-zhong, 2002; Zio, 1996b; Baecke et al., 2017; Brito and Griffiths, 2016b; Craig et al., 2001; Mu and Xianming, 1999; Brito et al., 2012). Public health papers predicted continuous targets over time, like forecasting carcinogenic risk (Evans et al., 1994a) and US mortality rates Alho (1992).
Targets were often either too far in the future to assess, for example predicting precipitation changes in the next million years (Zio and Apostolakis, 1997), or related to a difficult-to-measure quantity, such as populations of animals with little or no monitoring Johnson et al. (2018); Borsuk (2004); Mantyka-Pringle et al. (2014). The majority of analysis-set articles placed more importance on the act of building a consensus distribution than studying the accuracy of the combined forecast (Johnson et al., 2018; Cooke et al., 2014; Li et al., 2012; Kläs et al., 2010; Zio and Apostolakis, 1997; Song et al., 2013; Clemen and Winkler, 2007; Tartakovsky, 2007; Morgan, 2014; Borsuk, 2004; Kabak and Ülengin, 2008; Cabello et al., 2012; Adams et al., 2009; Neves and Frangopol, 2008; Failing et al., 2004; Evans et al., 1994a; Hora and Kardeş, 2015; Abramson et al., 1996; Hathout et al., 2016; Wang et al., 2008; Mantyka-Pringle et al., 2014; Kurowicka et al., 2010; Zio, 1996b; Brito and Griffiths, 2016b; Gu et al., 2016; Mu and Xianming, 1999; Wang and Zhang, 2018; Shin et al., 2013; Brito et al., 2012; Baron et al., 2014).
All articles defined a small number of specific forecasting targets. The majority of targets related to safety. Public health, ecology, and engineering applications focused on forecasting targets that, if left unchecked, could negatively impact human lives or the surrounding environment. What differed between articles was whether the forecasting target could be assessed, and if ground truth data was collected on targets.
4.4 Forecasting methodology
Articles taking a Bayesian approach accounted for % of analysis-set articles and emphasized how priors can compliment sparse data (Zio and Apostolakis, 1997; Bolger and Houlding, 2017; Clemen and Winkler, 2007; Tartakovsky, 2007; Huang et al., 2016; Neves and Frangopol, 2008; Abramson et al., 1996; Ren-jun and Xian-zhong, 2002; Mantyka-Pringle et al., 2014; Brito and Griffiths, 2016b; Wang and Zhang, 2018; Brito et al., 2012). Many papers focused on assessing risk (Zio and Apostolakis, 1997; Brito and Griffiths, 2016b; Brito et al., 2012; Tartakovsky, 2007). For example, the risk of losing autonomous underwater vehicles was modeled using a Bayesian approach that incorporated objective environmental data and subjective probabilities of loss solicited from experts (Brito and Griffiths, 2016b; Brito et al., 2012). Other papers assessed the impact of subsurface hydrology on water contamination (Tartakovsky, 2007), the risk of structural deterioration Neves and Frangopol (2008), and the economic risk associated with government expenditures Wang and Zhang (2018).
Bayesian methods involved beta-binomial models, decision trees, mixture distributions, or Bayesian belief networks. Often Bayesian models involved complicated posterior computations, requiring numerical integration to compute forecast probabilities. Past work suggested a Bayesian framework could better model subjective probabilities elicited from experts(Clemen and Winkler, 2007), however Frequentist techniques were used in almost 50% of articles.
Frequentist models for combining forecasts (Cooke et al., 2014; Kläs et al., 2010; Mak et al., 1996; Hurley and Lior, 2002; Morales-Nápoles et al., 2017; Borsuk, 2004; Hanea et al., 2018; Cabello et al., 2012; Adams et al., 2009; Alho, 1992; Evans et al., 1994a; Jana et al., 2019; Hora and Kardeş, 2015; Hathout et al., 2016; Wang et al., 2008; Ren-jun and Xian-zhong, 2002; Kurowicka et al., 2010; Baldwin, 2015; Baecke et al., 2017; Seifert and Hadida, 2013; Gu et al., 2016; Mu and Xianming, 1999; Graefe et al., 2014b; Alvarado-Valencia et al., 2017; Shin et al., 2013; Franses, 2011)
were typically convex combinations of expert judgment or linear regression models that included expert judgment as a covariate. Including expert judgment as a covariate in a linear regression model is related to judgemental bootstrapping(Armstrong, 2001b) and the Brunswik lens model Hammond and Stewart (2001). Both techniques are mentioned in analysis-set articles and rely on a Frequentist regression that divides human judgment into predictions inferred from data and expert intuition,
where represents the expert’s forecast,
is a Normal distribution,is a vector of objective information about the target of interest, are estimated parameters, and is argued to contain expert intuition. This model can then infer what covariates () are important to expert decision making and to what extent expert intuition () is involved in prediction.
Articles that did not use classic regression combined statistical predictions (called ‘crisp’) with qualitative estimates made by experts using fuzzy logic. Cooke’s method inspired articles to take a mixture model approach and weighted experts based on how well they performed on a set of ground-truth questions.
Articles using neither Bayesian or Frequentist models (Johnson et al., 2018; Li et al., 2012; Petrovic et al., 2006; Song et al., 2013; Graefe et al., 2014a; Morgan, 2014; Cai et al., 2016; Kabak and Ülengin, 2008; Graefe, 2015, 2018; Failing et al., 2004; Ren-jun and Xian-zhong, 2002; Hora et al., 2013; Baron et al., 2014)
resorted to: dynamical systems, simple averages of point estimates and quantiles from experts, and tree-based regression models.
The majority of models were parametric. Non-parametric models included: averaging quantiles, equally weighting expert predictions, and weighting experts via decision trees. These models allowed the parameter space to grow with increasing numbers of judgmental forecasts. Parametric models included: linear regression, ARIMA, state space models, belief networks, the beta-binomial model, and neural networks. Expert judgments, when combined and used to forecast, showed positive results in both nonparametric and parametric models. Parametric Bayesian models and non-parametric models could better cope with sparse data than a parametric Frequentist model. Bayesian models used a prior to lower model variance when data was sparse and non-parametric models could combine a expert judgments without relying on a specific form for the aggregated predictive distribution.
Authors more often proposed combining expert-generated point estimates compared to predictive distributions. A diverse set of models were proposed to combine point estimates: regression models (linear regression, logistic regression, ARIMA, exponential smoothing), simple averaging, and neural networks(Cabello et al., 2012; Adams et al., 2009; Mak et al., 1996; Graefe et al., 2014b; Baron et al., 2014), and fuzzy logic Petrovic et al. (2006); Kabak and Ülengin (2008); Jana et al. (2019); Ren-jun and Xian-zhong (2002). Authors that combined predictive densities focused on simpler combination models.
Most predictive distributions were built by asking experts to provide a list of values corresponding to percentiles. For example, a predictive density would be built by asking each expert to provide values corresponding to the 5%, 50% (median), and 95% percentiles. Combination methods either directly combined these percentiles by assigning weights to each expert density (Sarin, 2013; Hanea et al., 2018; Morales-Nápoles et al., 2017; Cai et al., 2016; Bolger and Houlding, 2017; Kabak and Ülengin, 2008; Zio and Apostolakis, 1997; Brito and Griffiths, 2016a), or built a continuous predictive distribution that fit these discrete points Brito et al. (2012); Abramson et al. (1996); Neves and Frangopol (2008); Failing et al. (2004); Wang et al. (2008); Kurowicka et al. (2010).
4.5 Forecasting evaluation metrics
Only 42% (22/53) of articles evaluated forecast performance using a formal metric. Formal metrics used in analysis-set articles are summarized in Table 3. The articles that did not include a metric to compare forecast performance either did not compare combination forecasts to ground truth, evaluated forecasts by visual inspection, or measured success as the ability to combine expert-generated forecasts. Among articles that did evaluate forecasts, most articles focused on point estimates (68%, 15/22) versus probabilistic forecasts (23%, 5/22), and two articles did not focus on point or probabilistic forecasts from experts.
The most commonly used metrics to evaluate point forecasts were: the Brier score, mean absolute (and percentage) error, and root mean square error. Even when predictive densities were combined, the majority of articles output and evaluated point estimates.
A small number of articles combining probability distributions used metrics that evaluated aggregated forecasts based on density, not point forecasts. Expert forecasts were evaluated using relative entropy and a related metric, the calibration score (see Table 3 for details). These metrics were first introduced by Cooke (Cooke et al., 1988, 1991).
The logscore is one of the most cited metrics for assessing calibration and sharpness (Gneiting and Raftery, 2007; Gneiting and Ranjan, 2011; Hora and Kardeş, 2015) for predictive densities, but was not used in any of the analysis-set articles. Instead, analysis-set articles emphasized point estimates and used metrics to evaluate point forecasts.
Three articles conducted an experiment but did not use any formal metrics to compare the results. Two articles used no evaluation and one article visually inspected forecasts.
4.6 Experimental design
Among all analysis-set articles, 22/53 (42%) conducted a comparative experiment. Most articles did not evaluate their forecasting methods because no ground truth data exists. For example, articles would ask experts to give predictions for events hundreds of years in the future (Zio and Apostolakis, 1997; Zio, 1996b). Articles that didn’t evaluate their combined forecast but did have ground truth data concluded that the predictive distribution they created was “close” to a true distribution. Still other articles concluded their method successful if it could be implemented at all.
4.7 Training data
Rapidly changing training data—data that could change with time—appeared in 41% of articles. Data came from finance, business, economics, and management and predicted targets like: monthly demand of products, tourist behavior, and pharmaceutical sales (Baecke et al., 2017; Petrovic et al., 2006; Wang et al., 2008; Kläs et al., 2010; Franses, 2011). In these articles, authors stress experts can add predictive power by introducing knowledge not used by statistical models, when the quality of data is suspect, and where decisions can have a major impact on outcomes. The rapidly changing environment in these articles is caused by consumer/human behavior.
Articles applied to politics stress that experts have poor accuracy when forecasting complex (and rapidly changing) systems unless they receive feedback about their forecast accuracy and have contextual information about the forecasting task (Graefe et al., 2014a; Graefe, 2015, 2018; Satopää et al., 2014). Political experts, it is argued, receive feedback by observing the outcome of elections and often have strong contextual knowledge about both candidates.
Weather and climate systems were also considered datasets that rapidly change. The Hailfinder system relied on expert knowledge to predict severe local storms in eastern Colorado (Abramson et al., 1996). Weather systems are rapidly changing environments, and this mathematical model of severe weather needed training examples of severe weather. Rather than wait, the Hailfinder system trained using expert input. Expert knowledge was important in saving time and money, and building a severe weather forecasting system that worked.
Ecology articles solicited expert opinion because of sparse training data, a lack of sufficient monitoring of wildlife populations, or to assign subjective risk to potential emerging biological threats (Li et al., 2012; Mantyka-Pringle et al., 2014; Kurowicka et al., 2010)
4.8 Number of elicited experts and number of forecasts made
Over 50% of articles combined forecasts from less than 10 experts. (Fig. 5). Several articles describe the meticulous book-keeping and prolonged time and effort it takes to collect expert judgments. The costs needed to collect expert opinion may explain the small number of expert forecasters.
Two distinct expert elicitation projects produced articles that analyzed over forecasters. The first project (Seifert and Hadida, 2013) asked experts from music record labels to predict the success (rank) of pop singles. Record label experts were incentivized with a summary of their predictive accuracy, and an online platform collected predictions over a period of weeks.
One of the most successful expert opinion forecasting systems enrolled approximately participants and was called the Good Judgement Project (GJP) (Mellers et al., 2014; Ungar et al., 2012; Satopää et al., 2014). Over a period of years, an online platform was used to ask people political questions with a binary answer (typically yes or no) and to self-assess their level of expertise on the matter. Participants were given feedback on their performance and how to improve with no additional incentives. Both projects that collected a large number of forecasters have common features. An online platform was used to facilitate data collection, and questions asked were simple, either binary (yes/no) questions or to rank pop singles. Both project incentivized participants with feedback of their forecasting performance.
Close to 80% of articles reported less than 100 total forecasts (Fig. 6) and studies reporting more than forecasts were simulation based (except the GJP). Recruiting a small number of experts did not always result in a small number of forecasts. Authors assessing the performance of the Polly Vote system collected 452 forecasts from 17 experts (Graefe et al., 2014a; Graefe, 2015, 2018), and a project assessing the demand for products produced forecasts from forecasters (Alvarado-Valencia et al., 2017).
The time and energy required to collect expert opinion is reflected in the low number of forecasters. Some studies did succeed to produce many more forecasts than recruited forecasters, and they did so by using an online platform, asking simpler questions, and giving forecasters feedback about their forecast accuracy.
Combining expert predictions for forecasting continues to shows promise, however rigorous experiments that compare expert to non-expert and statistical forecasts are still needed to confirm the added value of expert judgement. The most useful application in the literature appeals to a mixture of statistical models and expert prediction when data is sparse and evolving. Despite the time and effort it takes to elicit expert-generated data, the wide range of applications and new methods show the field is growing. Authors also recognize the need to include human intuition into models that inform decision makers.
In any combination forecast, built from expert or statistical predictions, there is no consensus on how to best combine individual forecasts or how to compare one forecast to another (Table 3). In addition to methodological disagreements familiar to any combination algorithm, expert judgemental forecasts have the additional burden of collecting predictions made by experts. The literature has not settled on how to define expertise and an entire field is devoted to understanding how experts differ from non-experts (Dawid et al., 1995; Farrington-Darby and Wilson, 2006; Ericsson and Ward, 2007; Rikers and Paas, 2005; De Groot, 2014). Methods for collecting data from experts that are unbiased and in the least time-consuming manner is also an area of open inquiry. An investigator must spend time designing a strategy to collect data from experts, and experts themselves must make time to complete this prediction task. There is a vast literature on proper techniques for collecting expert-generated data (Ayyub, 2001; Yousuf, 2007; Powell, 2003; Normand et al., 1998; Leal et al., 2007; Martin et al., 2012). Expert elicitation adds an additional burden to combination forecasting not present when aggregating purely statistical models.
Combination forecasting literature reiterated a few key themes: (i) the use of human intuition to aid statistical forecasts when data is sparse and rapidly changing, (ii) including experts because of their role as decision makers, (iii) using simpler aggregation models to combine predictive densities and more complicated models to combine point predictions, and (iv) the lack of experimental design and comparative metrics in many manuscripts.
Many articles introduced expert judgment into their models because the data needed to train a statistical model was unavailable, sparse, or because past data was not a strong indicator of future behavior. When training data was available, researchers typically used expert forecasts to supplement statistical models. Authors argued that experts have a broader picture of the forecasting environment than is present in empirical data. If experts produced forecasts based on uncollected data, then combining their predictions with statistical models was a way of enlarging the training data. Expert-only models were used when data on the forecasting target was unavailable. Authors argued context-specific information available to experts and routine feedback about their past forecasting accuracy meant expert-only models could make accurate forecasts. Though we feel this may not be enough to assume expert-only models can make accurate forecasts, without any training data these attributes allow experts to make forecasts when statistical models cannot.
Applications varied, but each field stressed the reason for aggregating forecasts from experts was due to decision-making under uncertainty. For example: deciding on how a company can improve their marketing strategy, what choices and actions can affect wildlife populations and our environment, deciding on the structural integrity of buildings and nuclear power plants. Numerous articles emphasized the role of decision making in these systems by naming the final aggregated forecast a decision maker.
A longer history of combining point forecasts (Galton, 1907; Bates and Granger, 1969; Granger and Ramanathan, 1984) has prompted advanced methods for building aggregated forecasts from point estimates. Simpler aggregation techniques, like linear pools, averaging quantiles, and rank statistics, were used when combining predictive densities. Besides the shorter history, simple aggregation models for predictive densities show comparable, and often, better results than more complicated techniques (Clemen, 1989; Rantilla and Budescu, 1999). The reasons why simple methods work so well for combining predictive densities is mostly empirical at this time (Makridakis and Winkler, 1983; Clemen, 1989; Rantilla and Budescu, 1999), but under certain scenarios, a simple average was shown to be optimal Wallsten et al. (1997a, b).
A small percentage of research took time to setup an experiment that could rigorously compare combination forecasting models. Most articles measured success on whether or not the combination scheme could produce a forecast and visually inspected the results. In some cases visual inspection was used because ground truth data was not present, but in this case, a simulation study could offer insight into the forecasting performance of a novel combination method. No manuscripts compared predictions between forecasts generated by experts only, a combination of experts and statistical models, and statistical models only. Past research is still unclear on the added value experts provide statistical forecasts, and whether expert-only models provide accurate results.
To support research invested in aggregating expert predictions and improve their rigorous evaluation, we recommend the following: (i) future work spend more time on combining probabilistic densities and understanding the theoretical reasons simple aggregation techniques outperform more complicated models, and (ii) authors define an appropriate metric to measure forecast accuracy and develop rigorous experiments to compare novel combination algorithms to existing methods. If not feasible we suggest a simulation study that enrolls a small, medium, and large number of experts to compare aggregation models.
Aggregating expert predictions can outperform statistical ensembles when data is sparse, or rapidly evolving. By making predictions, experts can gain insight into how forecasts are made, the assumptions implicit in forecasts, and ultimately how to best use the information forecasts provide to make critical decision about the future.
This work was funded by the National Institute of General Medical Sciences (NIGMS) Grant R35GM119582. The findings and conclusions in this manuscript are those of the authors and do not necessarily represent the views of the NIH or the NIGMS. The funders had no role in study design, data collection and analysis, decision to present, or preparation of the presentation.
- Abernethy and Frongillo  Jacob D Abernethy and Rafael M Frongillo. A collaborative mechanism for crowdsourcing prediction problems. In Advances in Neural Information Processing Systems, pages 2600–2608, 2011.
- Abramson et al.  Bruce Abramson, John Brown, Ward Edwards, Allan Murphy, and Robert L Winkler. Hailfinder: A bayesian system for forecasting severe weather. International Journal of Forecasting, 12(1):57–71, 1996.
- Adams et al.  Ray Adams, Anthony White, and Efe Ceylan. An acceptability predictor for websites. In International Conference on Universal Access in Human-Computer Interaction, pages 628–634. Springer, 2009.
- Al-Jarrah et al.  Omar Y Al-Jarrah, Paul D Yoo, Sami Muhaidat, George K Karagiannidis, and Kamal Taha. Efficient machine learning for big data: A review. Big Data Research, 2(3):87–93, 2015.
- Alexander Jr  John C Alexander Jr. Refining the degree of earnings surprise: A comparison of statistical and analysts’ forecasts. Financial Review, 30(3):469–506, 1995.
- Alho  Juha M Alho. Estimating the strength of expert judgement: the case of us mortality forecasts. Journal of Forecasting, 11(2):157–167, 1992.
- Alvarado-Valencia et al.  Jorge Alvarado-Valencia, Lope H Barrero, Dilek Önkal, and Jack T Dennerlein. Expertise, credibility of system forecasts and integration methods in judgmental demand forecasting. International Journal of Forecasting, 33(1):298–313, 2017.
- Amara and Lipinski  Roy C Amara and Andrew J Lipinski. Some views on the use of expert judgment. Technological Forecasting and Social Change, 3:279–289, 1971.
- Armstrong  J Scott Armstrong. Relative accuracy of judgemental and extrapolative methods in forecasting annual earnings. Journal of Forecasting, 2(4):437–447, 1983.
- Armstrong [2001a] J Scott Armstrong. Combining forecasts. In Principles of forecasting, pages 417–439. Springer, 2001a.
- Armstrong [2001b] J Scott Armstrong. Judgmental bootstrapping: Inferring experts’ rules for forecasting. In Principles of forecasting, pages 171–192. Springer, 2001b.
- Armstrong and Forecasting  J Scott Armstrong and Long-Range Forecasting. From crystal ball to computer. New York ua, 1985.
- Armstrong  Jon Scott Armstrong. Long-range forecasting. Wiley New York ETC., 1985.
- Ayyub  Bilal M Ayyub. Elicitation of expert opinions for uncertainty and risks. CRC press, 2001.
- Baecke et al.  Philippe Baecke, Shari De Baets, and Karlien Vanderheyden. Investigating the added value of integrating human judgement into statistical demand forecasting systems. International Journal of Production Economics, 191:85–96, 2017.
- Baldwin  Peter Baldwin. Weighting components of a composite score using naïve expert judgments about their relative importance. Applied psychological measurement, 39(7):539–550, 2015.
- Baran and Lerch  Sándor Baran and Sebastian Lerch. Combining predictive distributions for the statistical post-processing of ensemble forecasts. International Journal of Forecasting, 34(3):477–496, 2018.
- Baron et al.  Jonathan Baron, Barbara A Mellers, Philip E Tetlock, Eric Stone, and Lyle H Ungar. Two reasons to make aggregated probability forecasts more extreme. Decision Analysis, 11(2):133–145, 2014.
- Bassetti et al.  Federico Bassetti, Roberto Casarin, and Francesco Ravazzolo. Bayesian nonparametric calibration and combination of predictive distributions. Journal of the American Statistical Association, 113(522):675–685, 2018.
- Bates and Granger  John M Bates and Clive WJ Granger. The combination of forecasts. Journal of the Operational Research Society, 20(4):451–468, 1969.
- Berrocal et al.  Veronica J Berrocal, Adrian E Raftery, and Tilmann Gneiting. Combining spatial statistical and ensemble information in probabilistic weather forecasts. Monthly Weather Review, 135(4):1386–1402, 2007.
- Bolger and Houlding  Donnacha Bolger and Brett Houlding. Deriving the probability of a linear opinion pooling method being superior to a set of alternatives. RELIABILITY ENGINEERING & SYSTEM SAFETY, 158:41–49, FEB 2017. ISSN 0951-8320. doi: –10.1016/j.ress.2016.10.008˝.
- Bolger and Houlding  Donnacha Bolger and Brett Houlding. Deriving the probability of a linear opinion pooling method being superior to a set of alternatives. Reliability Engineering & System Safety, 158:41–49, 2017.
- Borsuk  Mark E Borsuk. Predictive assessment of fish health and fish kills in the neuse river estuary using elicited expert judgment. Human and Ecological Risk Assessment, 10(2):415–434, 2004.
- Brabham  Daren C Brabham. Crowdsourcing. Mit Press, 2013.
- Brito and Griffiths [2016a] Mario Brito and Gwyn Griffiths. A bayesian approach for predicting risk of autonomous underwater vehicle loss during their missions. Reliability Engineering & System Safety, 146:55 – 67, 2016a. ISSN 0951-8320. doi: https://doi.org/10.1016/j.ress.2015.10.004. URL http://www.sciencedirect.com/science/article/pii/S0951832015002860.
- Brito and Griffiths [2016b] Mario Brito and Gwyn Griffiths. A bayesian approach for predicting risk of autonomous underwater vehicle loss during their missions. Reliability Engineering & System Safety, 146:55–67, 2016b.
- Brito et al.  Mario Brito, Gwyn Griffiths, James Ferguson, David Hopkin, Richard Mills, Richard Pederson, and Erin MacNeil. A behavioral probabilistic risk assessment framework for managing autonomous underwater vehicle deployments. Journal of Atmospheric and Oceanic Technology, 29(11):1689–1703, 2012.
- Bunn and Wright  Derek Bunn and George Wright. Interaction of judgemental and statistical forecasting methods: issues & analysis. Management science, 37(5):501–518, 1991.
- Bunn  Derek W Bunn. The synthesis of predictive models in marketing research, 1979.
- Bunn  Derek W Bunn. Statistical efficiency in the linear combination of forecasts. International Journal of Forecasting, 1(2):151–163, 1985.
- Cabello et al.  Enrique Cabello, Cristina Conde, Isaac Diego, Javier Moguerza, and Andrés Redchuk. Combination and selection of traffic safety expert judgments for the prevention of driving risks. Sensors, 12(11):14711–14729, 2012.
- Cai et al.  Mengya Cai, Yingzi Lin, Bin Han, Changjun Liu, and Wenjun Zhang. On a simple and efficient approach to probability distribution function aggregation. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(9):2444–2453, 2016.
- Capistrán and Timmermann  Carlos Capistrán and Allan Timmermann. Forecast combination with entry and exit of experts. Journal of Business & Economic Statistics, 27(4):428–440, 2009.
- Che et al.  Dongsheng Che, Qi Liu, Khaled Rasheed, and Xiuping Tao. Decision tree and ensemble learning algorithms with their applications in bioinformatics. In Software tools and algorithms for biological systems, pages 191–199. Springer, 2011.
- Clemen  Robert T Clemen. Combining forecasts: A review and annotated bibliography. International journal of forecasting, 5(4):559–583, 1989.
- Clemen  Robert T Clemen. Comment on cooke’s classical method. Reliability Engineering & System Safety, 93(5):760–765, 2008.
- Clemen and Winkler  Robert T Clemen and Robert L Winkler. Combining economic forecasts. Journal of Business & Economic Statistics, 4(1):39–46, 1986.
- Clemen and Winkler [1999a] Robert T Clemen and Robert L Winkler. Combining probability distributions from experts in risk analysis. Risk analysis, 19(2):187–203, 1999a.
- Clemen and Winkler  Robert T. Clemen and Robert L. Winkler. Aggregating Probability Distributions. In Edwards, W and Miles, RF and VonWinterfeldt, D, editor, ADVANCES IN DECISION ANALYSIS: FROM FOUNDATIONS TO APPLICATIONS, pages 154–176. 2007. ISBN 978-0-52186-368-1. doi: –10.1017/CBO9780511611308.010˝.
- Clemen and Winkler  Robert T Clemen and Robert L Winkler. Advances in decision analysis: Aggregating probability distributions. 2007.
- Clemen and Winkler [1999b] RT Clemen and RL Winkler. Combining probability distributions from experts in risk analysis. RISK ANALYSIS, 19(2):187–203, APR 1999b. ISSN 0272-4332. doi: –10.1111/j.1539-6924.1999.tb00399.x˝.
- Cooke et al.  Roger Cooke, Max Mendel, and Wim Thijs. Calibration and information in expert resolution; a classical approach. Automatica, 24(1):87–93, 1988.
- Cooke et al.  Roger Cooke et al. Experts in uncertainty: opinion and subjective probability in science. Oxford University Press on Demand, 1991.
- Cooke  Roger M Cooke. Validating expert judgment with the classical model. In Experts and Consensus in Social Science, pages 191–212. Springer, 2014.
- Cooke  Roger M Cooke. The aggregation of expert judgment: do good things come to those who weight? Risk Analysis, 35(1):12–15, 2015.
- Cooke et al.  Roger M Cooke, Marion E Wittmann, David M Lodge, John D Rothlisberger, Edward S Rutherford, Hongyan Zhang, and Doran M Mason. Out-of-sample validation for structured expert judgment of asian carp establishment in lake erie. Integrated Environmental Assessment and Management, 10(4):522–528, 2014.
- Craig et al.  Peter S Craig, Michael Goldstein, Jonathan C Rougier, and Allan H Seheult. Bayesian forecasting for complex systems using computer simulators. Journal of the American Statistical Association, 96(454):717–729, 2001.
- Dawid  A Philip Dawid. The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 59(1):77–93, 2007.
- Dawid et al.  AP Dawid, MH DeGroot, J Mortera, Roger Cooke, S French, C Genest, MJ Schervish, DV Lindley, KJ McConway, and RL Winkler. Coherent combination of experts’ opinions. Test, 4(2):263–313, 1995.
- De Groot  Adriaan D De Groot. Thought and choice in chess, volume 4. Walter de Gruyter GmbH & Co KG, 2014.
- Dietterich et al.  Thomas G Dietterich et al. Ensemble learning. The handbook of brain theory and neural networks, 2:110–125, 2002.
Džeroski and Ženko 
Saso Džeroski and Bernard Ženko.
Is combining classifiers with stacking better than selecting the best one?Machine learning, 54(3):255–273, 2004.
- Edmundson  RH Edmundson. Decomposition; a strategy for judgemental forecasting. Journal of Forecasting, 9(4):305–314, 1990.
- Ericsson and Ward  K Anders Ericsson and Paul Ward. Capturing the naturally occurring superior performance of experts in the laboratory: Toward a science of expert and exceptional performance. Current Directions in Psychological Science, 16(6):346–350, 2007.
- Evans et al. [1994a] John S Evans, George M Gray, Robert L Sielken, Andrew E Smith, Ciriaco Valdezflores, and John D Graham. Use of probabilistic expert judgment in uncertainty analysis of carcinogenic potency. Regulatory Toxicology and Pharmacology, 20(1):15–36, 1994a.
- Evans et al. [1994b] J.S. Evans, G.M. Gray, R.L. Sielken, A.E. Smith, C. Valdezflores, and J.D. Graham. Use of probabilistic expert judgment in uncertainty analysis of carcinogenic potency. Regulatory Toxicology and Pharmacology, 20(1):15 – 36, 1994b. ISSN 0273-2300. doi: https://doi.org/10.1006/rtph.1994.1034. URL http://www.sciencedirect.com/science/article/pii/S0273230084710348.
- Failing et al.  Lee Failing, Graham Horn, and Paul Higgins. Using expert judgment and stakeholder values to evaluate adaptive management options. Ecology and Society, 9(1), 2004.
- Farrington-Darby and Wilson  Trudi Farrington-Darby and John R Wilson. The nature of expertise: A review. Applied ergonomics, 37(1):17–32, 2006.
- Fischer and Harvey  Ilan Fischer and Nigel Harvey. Combining forecasts: What information do judges need to outperform the simple average? International journal of forecasting, 15(3):227–246, 1999.
- Forlines et al.  Clifton Forlines, Sarah Miller, Leslie Guelcher, and Robert Bruzzi. Crowdsourcing the future: predictions made with a social network. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3655–3664. ACM, 2014.
- Franses  Philip Hans Franses. Averaging model forecasts and expert forecasts: Why does it work? Interfaces, 41(2):177–181, 2011.
- French  Simon French. Aggregating expert judgement. Revista de la Real Academia de Ciencias Exactas, Fisicas y Naturales. Serie A. Matematicas, 105(1):181–206, 2011.
- Galton  Francis Galton. Vox populi (the wisdom of crowds). Nature, 75(7):450–451, 1907.
- Garratt et al.  Anthony Garratt, Timo Henckel, and Shaun P Vahey. Empirically-transformed linear opinion pools. 2019.
- Genest et al.  Christian Genest, James V Zidek, et al. Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1(1):114–135, 1986.
- Genre et al.  Véronique Genre, Geoff Kenny, Aidan Meyler, and Allan Timmermann. Combining expert forecasts: Can anything beat the simple average? International Journal of Forecasting, 29(1):108–121, 2013.
- Glahn et al.  Bob Glahn, Matthew Peroutka, Jerry Wiedenfeld, John Wagner, Greg Zylstra, Bryan Schuknecht, and Bryan Jackson. Mos uncertainty estimates in an ensemble framework. Monthly Weather Review, 137(1):246–268, 2009.
- Gneiting and Raftery  Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
- Gneiting and Ranjan  Tilmann Gneiting and Roopesh Ranjan. Comparing density forecasts using threshold-and quantile-weighted scoring rules. Journal of Business & Economic Statistics, 29(3):411–422, 2011.
- Gneiting et al.  Tilmann Gneiting, Roopesh Ranjan, et al. Combining predictive distributions. Electronic Journal of Statistics, 7:1747–1782, 2013.
- Graefe  Andreas Graefe. Accuracy gains of adding vote expectation surveys to a combined forecast of us presidential election outcomes. Research & Politics, 2(1):2053168015570416, 2015.
- Graefe  Andreas Graefe. Predicting elections: Experts, polls, and fundamentals. Judgment and Decision Making, 13(4):334, 2018.
- Graefe et al. [2014a] Andreas Graefe, J Scott Armstrong, Randall J Jones, and Alfred G Cuzán. Accuracy of combined forecasts for the 2012 presidential election: The pollyvote. PS: Political Science & Politics, 47(2):427–431, 2014a.
- Graefe et al. [2014b] Andreas Graefe, J Scott Armstrong, Randall J Jones Jr, and Alfred G Cuzán. Combining forecasts: An application to elections. International Journal of Forecasting, 30(1):43–54, 2014b.
- Granger and Ramanathan  Clive WJ Granger and Ramu Ramanathan. Improved methods of combining forecasts. Journal of forecasting, 3(2):197–204, 1984.
- Gu et al.  Wei Gu, Thomas L Saaty, and Rozann Whitaker. Expert system for ice hockey game prediction: Data mining with human judgment. International Journal of Information Technology & Decision Making, 15(04):763–789, 2016.
- Guangliang  Sun Guangliang. A multi-hierarchical comprehensive evaluation model and its application [j]. Systems Engineering, 2, 1996.
- Hammond and Stewart  Kenneth R Hammond and Thomas R Stewart. The essential Brunswik: Beginnings, explications, applications. Oxford University Press, 2001.
- Hanea et al.  Anca M Hanea, Marissa F McBride, Mark A Burgman, and Bonnie C Wintle. The value of performance weights and discussion in aggregated expert judgments. Risk Analysis, 38(9):1781–1794, 2018.
- Hathout et al.  Michel Hathout, Marc Vuillet, Laurent Peyras, Claudio Carvajal, and Youssef Diab. Uncertainty and expert assessment for supporting evaluation of levees safety. In 3rd European Conference on Flood Risk Management FLOODrisk 2016, pages 6–p, 2016.
- Helmer  Olaf Helmer. Analysis of the future: The delphi method. Technical report, RAND CORP SANTA MONICA CA, 1967.
- Hogarth  Robin M Hogarth. Cognitive processes and the assessment of subjective probability distributions. Journal of the American statistical Association, 70(350):271–289, 1975.
- Hora and Kardeş  Stephen C Hora and Erim Kardeş. Calibration, sharpness and the weighting of experts in a linear opinion pool. Annals of Operations Research, 229(1):429–450, 2015.
- Hora et al.  Stephen C Hora, Benjamin R Fransen, Natasha Hawkins, and Irving Susel. Median aggregation of distribution functions. Decision Analysis, 10(4):279–291, 2013.
- Howe  Jeff Howe. The rise of crowdsourcing. Wired magazine, 14(6):1–4, 2006.
- Huang et al.  Anqiang Huang, Han Qiao, Shouyang Wang, and John Liu. Improving forecasting performance by exploiting expert knowledge: Evidence from guangzhou port. International Journal of Information Technology & Decision Making, 15(02):387–401, 2016.
- Hurley and Lior  WJ Hurley and DU Lior. Combining expert judgment: On the performance of trimmed mean vote aggregation procedures in the presence of strategic voting. European Journal of Operational Research, 140(1):142–147, 2002.
- Jana et al.  Dipak Kumar Jana, Sutapa Pramanik, Palash Sahoo, and Anupam Mukherjee. Interval type-2 fuzzy logic and its application to occupational safety risk performance in industries. Soft Computing, 23(2):557–567, 2019.
- Jin et al.  Weiliang Jin, Qingfang Lu, and Weizhong Gan. Research progress on the durability design and life prediction of concrete structures. Jianzhu Jiegou Xuebao/Journal of Building Structures, 28(1):7–13, 2007.
- Johnson et al.  Fred A Johnson, Mikko Alhainen, Anthony D Fox, Jesper Madsen, and Matthieu Guillemain. Making do with less: must sparse data preclude informed harvest strategies for european waterbirds? Ecological applications, 28(2):427–441, 2018.
- Jolliffe and Stephenson  Ian T Jolliffe and David B Stephenson. Forecast verification: a practitioner’s guide in atmospheric science. John Wiley & Sons, 2012.
- Kabak and Ülengin  Özgür Kabak and Füsun Ülengin. Aggregating forecasts to obtain fuzzy demands. In Computational Intelligence In Decision And Control, pages 73–78. World Scientific, 2008.
- Keeney  Ralph L Keeney. A group preference axiomatization with cardinal utility. Management Science, 23(2):140–145, 1976.
- Kläs et al.  Michael Kläs, Haruka Nakao, Frank Elberzhager, and Jürgen Münch. Support planning and controlling of early quality assurance by combining expert judgment and defect data—a case study. Empirical Software Engineering, 15(4):423–454, 2010.
- Kleiber et al.  William Kleiber, Adrian E Raftery, Jeffrey Baars, Tilmann Gneiting, Clifford F Mass, and Eric Grimit. Locally calibrated probabilistic temperature forecasting using geostatistical model averaging and local bayesian model averaging. Monthly Weather Review, 139(8):2630–2649, 2011.
- Kleinmuntz  Benjamin Kleinmuntz. Why we still use our heads instead of formulas: Toward an integrative approach. Psychological bulletin, 107(3):296, 1990.
- Kune et al.  Raghavendra Kune, Pramod Kumar Konugurthi, Arun Agarwal, Raghavendra Rao Chillarige, and Rajkumar Buyya. The anatomy of big data computing. Software: Practice and Experience, 46(1):79–105, 2016.
- Kurowicka et al.  Dorota Kurowicka, Catalin Bucura, Roger Cooke, and Arie Havelaar. Probabilistic inversion in priority setting of emerging zoonoses. Risk Analysis: An International Journal, 30(5):715–723, 2010.
- Lawrence and O’Connor  Michael Lawrence and Marcus O’Connor. Exploring judgemental forecasting. International Journal of Forecasting, 8(1):15–26, 1992.
- Lawrence et al.  Michael Lawrence, Paul Goodwin, Marcus O’Connor, and Dilek Önkal. Judgmental forecasting: A review of progress over the last 25 years. International Journal of forecasting, 22(3):493–518, 2006.
- Leal et al.  José Leal, Sarah Wordsworth, Rosa Legood, and Edward Blair. Eliciting expert opinion for economic models: an applied example. Value in Health, 10(3):195–203, 2007.
- Li et al.  Wei Li, Yan-ju Liu, and Zhifeng Yang. Preliminary strategic environmental assessment of the great western development strategy: safeguarding ecological security for a new western china. Environmental management, 49(2):483–501, 2012.
- Lintott et al.  Chris J Lintott, Kevin Schawinski, Anže Slosar, Kate Land, Steven Bamford, Daniel Thomas, M Jordan Raddick, Robert C Nichol, Alex Szalay, Dan Andreescu, et al. Galaxy zoo: morphologies derived from visual inspection of galaxies from the sloan digital sky survey. Monthly Notices of the Royal Astronomical Society, 389(3):1179–1189, 2008.
- Loper and Bird  Edward Loper and Steven Bird. Nltk: the natural language toolkit. arXiv preprint cs/0205028, 2002.
- Mak et al.  Brenda Mak, Tung Bui, and Robert Blanning. Aggregating and updating experts’ knowledge: an experimental evaluation of five classification techniques. Expert systems with Applications, 10(2):233–241, 1996.
- Makridakis and Winkler  Spyros Makridakis and Robert L Winkler. Averages of forecasts: Some empirical results. Management Science, 29(9):987–996, 1983.
- Mantyka-Pringle et al.  Chrystal S Mantyka-Pringle, Tara G Martin, David B Moffatt, Simon Linke, and Jonathan R Rhodes. Understanding and predicting the combined effects of climate change and land-use change on freshwater macroinvertebrates and fish. Journal of Applied Ecology, 51(3):572–581, 2014.
- Martin et al.  Tara G Martin, Mark A Burgman, Fiona Fidler, Petra M Kuhnert, SAMANTHA Low-Choy, Marissa McBride, and Kerrie Mengersen. Eliciting expert knowledge in conservation science. Conservation Biology, 26(1):29–38, 2012.
- McLaughlin  Robert L McLaughlin. The forecasters’ batting averages. Business Economics, pages 58–59, 1973.
- Meehl  Paul E Meehl. Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. 1954.
- Mellers et al.  Barbara Mellers, Lyle Ungar, Jonathan Baron, Jaime Ramos, Burcu Gurcay, Katrina Fincher, Sydney E Scott, Don Moore, Pavel Atanasov, Samuel A Swift, et al. Psychological strategies for winning a geopolitical forecasting tournament. Psychological science, 25(5):1106–1115, 2014.
- Morales-Nápoles et al.  Oswaldo Morales-Nápoles, Dominik Paprotny, Daniël Worm, Linda Abspoel-Bukman, and Wim Courage. Characterization of precipitation through copulas and expert judgement for risk assessment of infrastructure. ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part A: Civil Engineering, 3(4):04017012, 2017.
- Moran et al.  Kelly R Moran, Geoffrey Fairchild, Nicholas Generous, Kyle Hickmann, Dave Osthus, Reid Priedhorsky, James Hyman, and Sara Y Del Valle. Epidemic forecasting is messier than weather forecasting: The role of human behavior and internet data streams in epidemic forecast. The Journal of infectious diseases, 214(suppl_4):S404–S408, 2016.
- Morgan  M Granger Morgan. Use (and abuse) of expert elicitation in support of decision making for public policy. Proceedings of the National academy of Sciences, 111(20):7176–7184, 2014.
- Morris  Peter A Morris. Decision analysis expert use. Management Science, 20(9):1233–1241, 1974.
- Mu and Xianming  L Mu and W Xianming. Multi-hierarchical durability assessment of existing reinforced-concrete structures. In Proceedings of the 8th International Conference on Durability ofBuilding Materials and Components, pages 49–69, 1999.
- Neves and Frangopol  LC Neves and D Frangopol. Life-cycle performance of structures: combining expert judgment and results of inspection. In Proceedings of the 1st International Symposium on Life-Cycle Civil Engineering (Biondini F and Frangopol D (eds)). CRC Press, Boca Raton, FL, USA, pages 409–414, 2008.
- Normand et al.  Sharon-Lise T Normand, Barbara J McNeil, Laura E Peterson, and R Heather Palmer. Eliciting expert opinion using the delphi technique: identifying performance indicators for cardiovascular disease. International journal for quality in health care, 10(3):247–260, 1998.
- O’Connor et al.  Marcus O’Connor, William Remus, and Ken Griggs. Judgemental forecasting in times of change. International Journal of Forecasting, 9(2):163–172, 1993.
- O’Hagan et al.  Anthony O’Hagan, Caitlin E Buck, Alireza Daneshkhah, J Richard Eiser, Paul H Garthwaite, David J Jenkinson, Jeremy E Oakley, and Tim Rakow. Uncertain judgements: eliciting experts’ probabilities. John Wiley & Sons, 2006.
- Petrovic et al.  Dobrila Petrovic, Ying Xie, and Keith Burnham. Fuzzy decision support system for demand forecasting with a learning mechanism. Fuzzy Sets and Systems, 157(12):1713–1725, 2006.
- Polley and Van Der Laan  Eric C Polley and Mark J Van Der Laan. Super learner in prediction. 2010.
- Powell  Catherine Powell. The delphi technique: myths and realities. Journal of advanced nursing, 41(4):376–382, 2003.
- Prill et al.  Robert J Prill, Julio Saez-Rodriguez, Leonidas G Alexopoulos, Peter K Sorger, and Gustavo Stolovitzky. Crowdsourcing network inference: the dream predictive signaling network challenge, 2011.
- Ranjan and Gneiting  Roopesh Ranjan and Tilmann Gneiting. Combining probability forecasts. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(1):71–91, 2010.
- Rantilla and Budescu  Adrian K Rantilla and David V Budescu. Aggregation of expert opinions. In Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers, pages 11–pp. IEEE, 1999.
- Ren-jun and Xian-zhong  Zhou Ren-jun and Duan Xian-zhong. Optimal combined load forecast based on the improved analytic hierarchy process. In Proceedings. International Conference on Power System Technology, volume 2, pages 1096–1100. IEEE, 2002.
- Rikers and Paas  Remy MJP Rikers and Fred Paas. Recent advances in expertise research. Applied Cognitive Psychology, 19(2):145–149, 2005.
- Sakkis et al.  Georgios Sakkis, Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Constantine D Spyropoulos, and Panagiotis Stamatopoulos. Stacking classifiers for anti-spam filtering of e-mail. arXiv preprint cs/0106040, 2001.
- Sarin  Rakesh K. Sarin. Median Aggregation, Scoring Rules, Expert Forecasts, Choices with Binary Attributes, Portfolio with Dependent Projects, and Information Security. DECISION ANALYSIS, 10(4):277–278, DEC 2013. ISSN 1545-8490. doi: –10.1287/deca.2013.0284˝.
- Satopää et al.  Ville A Satopää, Shane T Jensen, Barbara A Mellers, Philip E Tetlock, Lyle H Ungar, et al. Probability aggregation in time-series: Dynamic hierarchical modeling of sparse expert beliefs. The Annals of Applied Statistics, 8(2):1256–1280, 2014.
- Seifert and Hadida  Matthias Seifert and Allègre L Hadida. On the relative importance of linear model and human judge (s) in combined forecasting. Organizational Behavior and Human Decision Processes, 120(1):24–36, 2013.
- Shin et al.  Juneseuk Shin, Byoung-Youl Coh, and Changyong Lee. Robust future-oriented technology portfolios: B lack–l itterman approach. R&D Management, 43(5):409–419, 2013.
- Song et al.  Haiyan Song, Bastian Z Gao, and Vera S Lin. Combining statistical and judgmental forecasts via a web-based tourism demand forecasting system. International Journal of Forecasting, 29(2):295–310, 2013.
- Spence and Brucks  Mark T Spence and Merrie Brucks. The moderating effects of problem characteristics on experts’ and novices’ judgments. Journal of marketing Research, 34(2):233–247, 1997.
- Stone  Mervyn Stone. The opinion pool. The Annals of Mathematical Statistics, pages 1339–1342, 1961.
Syarif et al. 
Iwan Syarif, Ed Zaluska, Adam Prugel-Bennett, and Gary Wills.
Application of bagging, boosting and stacking to intrusion detection.
International Workshop on Machine Learning and Data Mining in Pattern Recognition, pages 593–602. Springer, 2012.
- Tartakovsky  Daniel M Tartakovsky. Probabilistic risk analysis in subsurface hydrology. Geophysical research letters, 34(5), 2007.
Ting and Witten 
Kai Ming Ting and Ian H Witten.
Issues in stacked generalization.
Journal of artificial intelligence research, 10:271–289, 1999.
- Ungar et al.  Lyle Ungar, Barbara Mellers, Ville Satopää, Philip Tetlock, and Jon Baron. The good judgment project: A large scale test of different methods of combining expert predictions. In 2012 AAAI Fall Symposium Series, 2012.
- Van der Laan et al.  Mark J Van der Laan, Eric C Polley, and Alan E Hubbard. Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007.
- Wallis  Kenneth F Wallis. Combining forecasts–forty years later. Applied Financial Economics, 21(1-2):33–41, 2011.
- Wallsten et al. [1997a] Thomas S Wallsten, David V Budescu, Ido Erev, and Adele Diederich. Evaluating and combining subjective probability estimates. Journal of Behavioral Decision Making, 10(3):243–268, 1997a.
- Wallsten et al. [1997b] Thomas S Wallsten, David V Budescu, and Chen Jung Tsao. Combining linguistic probabilities. Psychologische Beitrage, 1997b.
- Wang et al.  Chun Wang, Ming-Hui Chen, Elizabeth Schifano, Jing Wu, and Jun Yan. Statistical methods and computing for big data. Statistics and its interface, 9(4):399, 2016.
- Wang et al.  Gang Wang, Jinxing Hao, Jian Ma, and Hongbing Jiang. A comparative assessment of ensemble learning for credit scoring. Expert systems with applications, 38(1):223–230, 2011.
- Wang and Zhang  Liguang Wang and Xueqing Zhang. Bayesian analytics for estimating risk probability in ppp waste-to-energy projects. Journal of Management in Engineering, 34(6):04018047, 2018.
- Wang et al.  Xiaofeng Wang, Chao Du, and Zuoliang Cao. Probabilistic inversion techniques in quantitative risk assessment for power system load forecasting. In 2008 International Conference on Information and Automation, pages 718–723. IEEE, 2008.
- Willett  Peter Willett. The porter stemming algorithm: then and now. Program, 40(3):219–223, 2006.
- Winkler  Robert L Winkler. The consensus of subjective probability distributions. Management Science, 15(2):B–61, 1968.
- Winkler  Robert L Winkler. Probabilistic prediction: Some experimental results. Journal of the American Statistical Association, 66(336):675–685, 1971.
- Winkler  Robert L Winkler. Combining probability distributions from dependent information sources. Management Science, 27(4):479–488, 1981.
- Wolpert  David H Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.
- Yaniv and Hogarth  Ilan Yaniv and Robin M Hogarth. Judgmental versus statistical prediction: Information asymmetry and combination rules. Psychological Science, 4(1):58–62, 1993.
- Yousuf  Muhammad Imran Yousuf. Using experts’ opinions through delphi technique. Practical assessment, research & evaluation, 12(4):1–8, 2007.
- Zio [1996a] E. Zio. On the use of the analytic hierarchy process in the aggregation of expert judgments. Reliability Engineering & System Safety, 53(2):127 – 138, 1996a. ISSN 0951-8320. doi: https://doi.org/10.1016/0951-8320(96)00060-9. URL http://www.sciencedirect.com/science/article/pii/0951832096000609.
- Zio [1996b] E Zio. On the use of the analytic hierarchy process in the aggregation of expert judgments. Reliability Engineering & System Safety, 53(2):127–138, 1996b.
- Zio and Apostolakis  E Zio and GE Apostolakis. Accounting for expert-to-expert variability: a potential source of bias in performance assessments of high-level radioactive waste repositories. Annals of Nuclear Energy, 24(10):751–762, 1997.