1 Introduction and background
Accurate forecasting helps decision making, especially when the future is uncertain. For example, forecasting the future demand of stock keeping units (SKUs) helps in managing a supply chain, and forecasting tourist arrivals helps in capacity planning.
Frequently, the time series to be forecast are naturally organized in hierarchical structures. For instance, although the demand for an SKU could be recorded on a storebystore level, it could also be aggregated to give the demand on a regional or national level. At the same time, the demand of similar SKUs could be included in the demand of larger categories of products. These structures led to the development of hierarchical forecasting (HF) approaches. Such approaches are proposed in the crosssectional [athanasopoulos2009hierarchical, hyndman2011optimal, wickramasuriya2019optimal], temporal [athanasopoulos2017forecasting], and crosstemporal domains [KOURENTZES2019393, Spiliotis2020hj].
The observed demands at each level will always add up to the observed demand at higher levels. It is usually desirable that the same holds true for forecasts — that is, that the aggregate of the forecasts at a lower level is equal to the forecast of the aggregates at a higher level. This property is known as forecasting “coherence” [Athanasopoulos2020cx]. If forecasting at the different levels is done independently, we usually have forecast incoherence — the forecasts do not add up.
Until the late 2010s, the problem of forecast incoherence was bypassed by modelling and producing forecasts on a single hierarchical level:

Some researchers [dangerfield1992top, Zellner2000vi] have argued for generating forecasts only on the lowest, most granular level of a hierarchy. If forecasts are needed at higher levels, these are not produced directly using the aggregated information; instead, the lower level forecasts are summed up. This approach is known as “bottomup” (BU). The BU approach can be more suitable for shortterm operational decisions, such as logistics and production planning [kahn1998revisiting]. A downside of the BU approach is the difficulty to model each bottom level series due to the high level of noise and computational concerns in the case of large hierarchies [Gross1990bl, athanasopoulos2009hierarchical].

Other researchers [Gross1990bl, Fliedner1999] have suggested that only the top level of a hierarchy be directly forecasted, and then the forecasts are disaggregated to the lower levels using historical or forecasted [athanasopoulos2009hierarchical] proportions. This approach is known as “topdown” (TD). TD is more appropriate when strategic plans and decisions such as budgeting are made. TD generally requires fewer resources and modeling decisions, with forecasts being made on a single (top) series. However, the accuracy of the forecasts drops at lower levels of the hierarchy, due to the information loss incurred while aggregating the lowerlevel data to the higher aggregation levels.

A solution between BU and TD is offered by the “middleout” (MO) approach. In MO, forecasts are produced on an intermediate level of the hierarchy. Lower and higher level forecasts are derived by disaggregation and aggregation of the MO forecasts respectively.
The BU, MO, and TD approaches are myopic in the sense that they focus on a particular aggregation level to produce forecasts, thus ignoring some useful information [pennings2017integrated] available at other levels. In the last 12 years, hierarchical forecasting approaches have significantly evolved to include combination (COM) approaches that directly tackle the challenge of coherence. COM approaches have the advantage of using the information from all hierarchical levels to produce forecasts. These forecasts are consequently combined, using weights that are obtained either statistically (see [athanasopoulos2009hierarchical, hyndman2011optimal, wickramasuriya2019optimal]) or empirically (see the crossvalidation approach in [Jeon2019xo]). Simpler combinations based on equal weights have also been shown to be useful under some settings [Abouarghoub2018wz]. The application of hierarchical combination approaches has one direct advantage: it renders forecasts across hierarchical levels coherent, a property that is desirable in aligning decision making across the different functions of an organization. Apart from its direct benefits, more often than not, COM also results in superior forecasting performance compared to simpler HF approaches.
The hierarchical combination approaches that have been explored so far in the literature are linear in nature. The only existing nonlinear approach in HF, proposed by [abolghasemi2019machine], uses ML models under the MO approach to dynamically forecast the proportions of the child nodes from their parent. However, this approach exploits information only from the parent node, ignoring the rest of the nodes that could be useful for obtaining more accurate results.
In all four aforementioned approaches (BU, TD, MO, and COM), the base forecasts can be generated using any statistical or judgmental forecasting method. Indeed, the method of choice might differ depending on the aggregation level of focus and the data availability. Popular choices include univariate forecasting models, such as exponential smoothing (ETS) and AutoRegressive Integrated Moving Average (ARIMA) models. However the baseline models could also allow for exogenous information, which may be crucial in, for example, retail settings where promotions occur often. Moreover, [Spiliotis2019hg] showed that combining the forecasts across methods in order to obtain more accurate base forecasts will also increase the performance of the final, reconciled hierarchical forecasts. The efficacy of the different HF approaches depends on the time series features, the level of forecasting, the forecasting horizon, the structure of the hierarchy, and the relationships of the series. We may consider these variables when choosing the most appropriate HF approach [Fliedner1999, Fliedner2001, gross1990disaggregation, nenova2016determining, abolghasemi2019machine].
In this paper we offer a nonlinear perspective to the problem of hierarchical reconciliation and forecast coherence. We propose the use of Machine Learning (ML) techniques to derive the combination weights for the forecasts across the various aggregation levels. We focus on two ML algorithms that have been shown to perform well in time series contexts and crosslearning: Random Forests (RF) and XGBoost (XGB). Such decision tree algorithms allow the exploitation of nonlinear relationships across a number of series. This is particularly useful in hierarchical structures, especially when exogenous variables are available only on some of the hierarchical levels, the series are not all characterized by the same patterns, or the relationships of the series change through time. The contributions of this paper are threefold:

We propose nonlinear approaches to the problem of hierarchical forecast reconciliation. These approaches are more general compared to their linear counterparts and are expected to enhance the forecasting performance across all hierarchical levels.

The majority of the existing HF reconciliation approaches are, strictly speaking, designed to result in coherence under particular assumptions, with improvements in terms of forecasting performance being an allwelcome sideeffect. In contrast, our proposed approaches structurally combine the objectives of postsample empirical forecast accuracy and coherence in the training phase of the ML algorithm. The only other approach in the literature that has this property is HF via crossvalidation [Jeon2019xo]. Other methods that optimize forecast accuracy under the constraint of linear coherence, like the one proposed by [wickramasuriya2019optimal], do so using the onestepahead insample errors of the baseline forecasting methods, which may not be representative of postsample accuracy.

Unlike existing HF approaches, our proposed approaches selectively combine the forecasts across the different nodes of the hierarchy in a direct and automated way, without requiring that all forecasts need to be used.
We benchmark the performance of the proposed ML HF approach against various stateoftheart methods on two datasets coming from the tourism and retail industries, using ARIMAlike approaches to estimate the base forecasts. Our results suggest that ML reconciliation approaches are superior to existing, linear ones, both in terms of accuracy and bias.
The remainder of the paper is organized as follows. Section 2 describes the most popular HF methods found in the literature, while Section 3 presents the proposed ML reconciliation approach. Section 4 presents the two datasets used for the empirical evaluation of the proposed method, describes the experimental setup, and discusses our results and findings. Section 5 concludes the paper.
2 Hierarchical forecasting models
In this section, we discuss the TD, BU, and COM methods as three wellestablished HF approaches. The following indices, notations, and parameters are used throughout this paper:
=  total number of series in the hierarchy; 
=  total number of the series for level ; 
=  total number of the levels in hierarchy; 
=  number of the observations in each series; 
=  the observation of series ; 
=  stepahead independent base forecast of series based on observations; 
=  the vector of all observations at level ; 
=  stepahead forecast at level ; 
=  a column vector including all observations; 
=  stepahead independent base forecast of all series based on observations; 
=  the final reconciled forecasts of all series. 
The hierarchical time series can be expressed as , where is a summing matrix of order that aggregates the bottom level series. Consider the hierarchy shown in Figure 1 that shows a three level hierarchy.
The hierarchy shown in Figure 1 can be expressed as:
The various HF approaches can then be expressed with a unified structure , where is a matrix of order which elements depend on the type of the HF method used [fpp2].
2.1 Bottomup
The BU approach considers just the base forecasts produced on the bottom level of the hierarchy and sums them appropriately to obtain forecasts at higher levels. In this approach, , where is an null matrix. Thus, extracts the bottom level forecasts and combines them with the summing matrix to generate the final forecasts of the hierarchy.
2.2 Topdown
In the TD approach, base forecasts are produced just at the top level of the hierarchy and are then disaggregated to the lower levels with an appropriate factor. Gross and Sohl [gross1990disaggregation] investigated 21 different disaggregation methods for the TD approach. They concluded that Equations (1) and (2) indicate two disaggregation methods that give reasonable forecasts at the bottom level.
(1)  
(2) 
In Equation (1), each proportion reflects the average of the historical proportions of the bottom level series , while in Equation (2), each proportion reflects the average of the historical value of the bottom level series relative to the average value of the total aggregate . These proportions can be used to form the vector so that . In this regard, disaggregates the forecast at the top level to the lower levels.
Athanasopoulos et al. [athanasopoulos2009hierarchical] proposed the TD forecasted proportions (TDFP) approach that disaggregates the top level forecasts based on the forecasted proportions of lower level series rather than the historical proportions. According to this method,
for , where is the step ahead forecast of the series that corresponds to the node which is levels above , and is the sum of the step ahead forecasts below node that corresponds directly to the node . These will form the vector so that . Similarly to the rest of the TD methods, TDFP approach will generate biased forecasts even if the base forecasts are unbiased [athanasopoulos2009hierarchical].
We use the td function in hts package to implement the TD method [htspackage] that utilizes the proportions of Equation (1).
2.3 Linear combination
The COM method uses a completely different approach for HF. This approach was developed over a series of papers by Hyndman et al. [hyndman2011optimal, hyndman2016fast] and Wickramasuriya et al [wickramasuriya2019optimal]. Let the step reconciled forecasts be given by
They showed that the covariance matrix of the stepahead reconciled forecast errors is given by
where
is the variancecovariance matrix of the
step ahead base forecast errors. Moreover, they demonstrate that if the base forecasts are unbiased, these reconciled forecasts will also be unbiased if and only if . Finally, they showed that the matrix that minimizes the trace of such that is given bywhere is the generalized inverse of . Hence, the optimal unbiased forecasts from a linear reconciliation are given by
This is known as MinT (minimum trace) reconciliation . Note that it can be considered a generalized least squares estimator for a corresponding regression problem.
The challenge is the estimation of , especially for very large hierarchies, and different approximate estimates have been proposed.

Set . This is known as the OLS estimator [hyndman2011optimal]. This ignores the scale of each series and the relationships between the series.

Set where is the sample covariance matrix of the onestep ahead base forecast errors given by . This is known as the WLS estimator [hyndman2016fast]. It ignores the relationships between the series, but takes account of the scale of each series.

Set where is a unit vector. This method ignores the relationships between the series and assumes that the bottom level series have errors with equal variances [athanasopoulos2017forecasting]. Because this method only depends on the structure of the hierarchy, it is known as structural scaling. It is particularly useful when residuals are not available.

Set . This is a shrinkage estimator with diagonal target , and shrinkage parameter
where is the element of the one step ahead insample correlation matrix [schafer2005shrinkage]. The main advantage of this method is that it considers the relationships between the series.
In this paper, we consider the latter two methods: structural scaling (COMSS) and shrinkage (COMSHR). We use the MinT function in hts package in R to implement the COMSHR and COMSS methods [htspackage].
3 ML hierarchical forecasting approach
In this section we present a ML reconciliation approach that exploits the potential of decision treebased algorithms. It is designed to deal with the limitations of the existing HF methods, highlighted in Section 1, and allow for the base forecasts produced for the complete hierarchy to be effectively combined in a nonlinear fashion to yield coherent forecasts. We consider the Random Forest (RF) and the XGBoost (XGB) algorithms as they are intuitively easy to understand and have shown promising results in time series forecasting, especially in applications where information is extracted from large time series data sets in order for the relationships of the series to be learned and the overall forecasting accuracy to be enhanced [MONTEROMANSO202086].
3.1 Proposed algorithm
The proposed ML reconciliation algorithm uses time series crossvalidation [fpp2] to measure the outofsample forecast accuracy, which is then used in an optimization procedure to tune the ML method.
Assume a time series hierarchy that consists of levels and series, with each series of length . We summarize our approach as follows.

The series are split into a series of training sets and test sets, with each training set comprising the first observations (for ) and the corresponding test set comprising only the observations at time .

A forecasting model is fitted to each series in each training set and onestepahead forecasts are produced for each test set.

A separate ML model (either a RF or XGB) is built for predicting each of the bottom series of the hierarchy. The training set of each model consists of observations and variables. The first variables (used as predictors or inputs) are the onestep ahead forecasts produced during the rolling origin process for the
series of the hierarchy, and the last variable (the response or target) is the actual value of the bottomlevel series at the corresponding times. The loss function of the models is the sum of squared errors, and the hyperparameters of the ML models are determined either arbitrarily by the user or through an optimization procedure.

The complete sample of the series (all observations) is used to produce stepahead base forecasts for the series of the hierarchy, where is the forecasting horizon of interest.

The models that were built in Step 3 are used to provide forecasts for the series of the bottom level of the hierarchy, using the base forecasts produced in Step 4 as input. This process is repeated times, each time for a different forecasting horizon.

The forecasts produced by the ML models in step 5 are aggregated (summed) so that reconciled forecasts are produced for the rest of the hierarchical levels.
The proposed approach is demonstrated in Figure 2 for the case of a simple, twolevel hierarchy with one parent and two child nodes.
As seen, the proposed ML HF approach provides coherent forecasts by exploiting the information available at all hierarchical levels, following the approach used by the COM methods. The main difference between the COM methods and the proposed framework is that the base forecasts are not all necessarily used for deriving the reconciled ones, being selectively handled by the ML models built for this purpose. Moreover, even if all base forecasts are to be used by the ML models, the combination of the base forecasts will be done in a nonlinear fashion with the weights not being directly related to the structure of the hierarchy or the residuals reported for the individual series/levels. Most importantly, since the ML models are trained with the explicit objective of minimizing the forecasting error for each series of the bottom level of the hierarchy, the reconciliation performed may lead to more accurate forecasts when compared to standard HF methods. Finally, note that each bottomlevel series is predicted by a separate ML model, meaning that the reconciliation performed is highly specialized and, therefore, able to adapt to different patterns in each series.
Observe that the proposed approach is easy to generalize and is model independent. For example, a Neural Network (NN) or a Support Vector Machine (SVM) could be used to replace RF and XGB. Similarly, any model of choice could be used for producing the base forecasts being reconciled. Moreover, the onestepahead forecasts produced for constructing the training sets of the ML models, could be easily expanded to
stepahead ones to better simulate the forecasting task under examination. Our proposal of using onestepahead forecasts is mainly based on the fact that by increasing the forecasting horizon of the base models, the observations of the training set, i.e., , are accordingly decreased. Thus, when dealing with low frequency data (e.g., monthly or quarterly) or relatively short time series, such an approach could significantly reduce the potential of the developed ML models.The following subsections present the ML models used in this study for reconciliation. This includes information about the way the models were trained, optimized, and implemented.
3.2 XGBoost
XGB is an ensembling method based on decision trees that uses a gradient boosting approach to generate unbiased and robust forecasts
[chen2016xgboost]. This algorithm has been applied to various forecasting and classification problems with promising results [nielsen2016tree, chatzis2018forecasting, demolli2019wind].XGB uses a number of hyperparameters that play a critical role in generating the final forecasts. There are various techniques for optimizing these hyperparameters, including grid search, sequential model based algorithm configuration, and Bayesian optimization. Since grid search is computationally expensive, we tuned the hyperparameters using a Bayesian optimization approach with 10fold crossvalidation [snoek2012practical]. The Bayesian approach starts with a priori values for the hyperparameters and then iteratively updates to identify the best values for the investigated problem. We considered intervals with different lower and upper bounds for each hyperparameter. We set the prior values of the learning rate, eta, between (0.01, 0.05), sub sample size prior values between (0.3, 1), colsamplebytree prior values between (0.3, 1), minchild weight between (0, 10), maxdepth between (2, 10), and gamma
between (0, 5). The values for the maximum number of boosting iterations rolled over the range of 50 and 200. We used a linear regression model as the objective function and chose the best results by minimizing the root mean squared error (RMSE). We tuned the hyperparameters using the
rBAyesianOptimization package for R [rBayesianopt]. Due to the notable variations present in the “Sales” dataset (see subsection 4.1), the optimization of the hyperparameters was performed for each hierarchical time series and set of childparents separately, while for the “Tourism” dataset we optimized the hyperparameters for the hierarchy as a whole in order to accelerate the whole process.3.3 Random Forest
RF is an ensembling method that combines a large number of decision trees and takes an average of the trees to generate the final forecast [breiman2002manual]. Each tree of the RF is based on a random draw from the training dataset. The trees are built using the bootstrapping method and splitting criteria in nodes. We consider the weighted variance as the splitting criteria which minimizes the sum of squared errors. This method has been successfully applied to numerous forecasting problems such as energy [smyl2019machine, cheng2012random] and sales [jimenez2017multi] forecasting.
RF is fast to run and it only has a few hyperparameters: the number of trees (ntree), node size (nodesize), and number of variables sampled at each split (mtry). Of these, the number of constructed trees is the most important feature to be tuned. The problem of optimally selecting the number of trees has been intensively discussed in the literature [probst2019hyperparameters, breiman2002manual, barman2014prediction, breiman1984classification]. The main problem is that although creating more trees is computationally more demanding, it does not guarantee a better forecast. This is because each tree is trained individually and so by adding more trees, overfitting may occur [breiman2002manual]
. On the other hand, since the individual trees constructed do not have the learning capacity of XGB, RF is typically more robust to outliers and overfitting, especially for limited samples of data
[FRIEDMAN2002367]. The hyperparameter mtry denotes the number of variables sampled at each split and controls the randomness of the model. The nodesize hyperparameter determines the minimal number of observations in a terminal node to be split.We used grid search, an automated method that explores a set of different hyperparameters values and computes the error on the validation set, with 10fold crossvalidation to find the optimal number of trees by minimizing the RMSE. We ran ntree on a sequence of intervals of width 5 ranging from 50 to 150 and fitted the best model using the randomForest package for R [randomForest]. We tuned the other two hyperparameters, mtry and nodesize, using the mlr package in R [mlrpackage]. The lower and upper bound values for mtry were set between 2 and 6, respectively. The lower and upper bound values for nodesize were set on 10 and 50, respectively. Once again, the optimization of the hyperparameters for the case of the “Sales” dataset was performed for each hierarchical time series and set of childparents separately, while for the “Tourism” dataset it was done for the hierarchy as a whole.
4 Empirical results
4.1 Data
In order to empirically evaluate the performance of the proposed ML HF methods, we consider two different datasets, to be named the “Tourism” and the “Sales” dataset.
Hierarchical level  Number of series 

Level 0  1 
Level 1  7 
Level 2  27 
Level 3  76 
Total  111 
The Tourism dataset involves a fourlevel hierarchy with the domestic visitor nights of Australia, measured in millions, across 76 regions (level 3). The regions can be grouped into 27 zones (level 2), which can be further aggregated into 7 states and territories (level 1), as well as into the total domestic visitor nights (level 0). Thus, based on these geographic divisions, the Tourism dataset comprises 111 time series. The series have a duration of 240 months (20 years) and span from January 1998 to December 2017.
Table 1 summarizes the number of series present per hierarchical level, while Figure 3 visualizes some indicative series from each level. Observe that the trend and seasonal patterns differ among the series, especially for different states and territories. Moreover, the trend of some series (e.g., A, C, and E) changes through the years, in contrast to others (e.g. G) that remain quite constant. This indicates that considering a dynamic, nonlinear HF method instead of a linear one, could prove beneficial for predicting these series. For more information about the dataset, please see [KOURENTZES2019393].
The Sales dataset involves 55 threelevel hierarchies that present the sales of the cereal and breakfast products sold by a company in various locations of Australia, along with the corresponding prices. Each hierarchy refers to a different product, with the first level (level 0) representing the total sales of the manufacturer, the second level (level 1) the way these sales are disaggregated into two retailers, and the third level (level 2) the sales reported for each of the six distribution centers (DCs) used by each retailer. Thus, the Sales dataset includes 55 hierarchies, each consisting of 15 time series. The series have a duration of 120 weeks and span from September 2016 to December 2018.
Table 2 summarizes the number of series present per hierarchical level, while Figure 4 visualizes the series of each level for one indicative product of the dataset. Note that, although the retailers display different demand patterns, DCs have a similar pattern to their retailers in terms of promotions. Moreover, different entities of the hierarchy may experience different levels of uplifts in sales. Thus, a ML HF method, which effectively captures sales variations, could be more effective for reconciling the base forecasts of these series than a traditional, linear one.
Hierarchical level  Number of series 

Level 0  1 
Level 1  2 
Level 2  12 
Total  15 
4.2 Experimental setup
Given that each dataset displays its own particular characteristics, we consider a different forecasting model in each case for producing the base forecasts. More specifically, for the Tourism dataset we consider ARIMA models, as implemented in the forecast package for R [hyndmanforpack], while for the Sales dataset we use regression models with ARIMA errors (RegARIMA) using price as an regressor variable. In this respect, the effect of the promotions, which typically increase sales and drive major changes in the underlying demand behaviour [Nikolopoulos2015], is effectively taken into account. We should note that ETS [Hyndman2002b] and Theta [assimakopoulos2000theta] were also tested for the case of the Tourism dataset, providing similar insights to the ones reported for ARIMA. Thus, for reasons of brevity, and in order for the baseline models used in both cases to be similar in nature, we proceed by reporting the results only for the ARIMA models.
We use the ARIMA models for producing 12stepahead forecasts for the monthly series of the Tourism dataset and the RegARIMA models for producing 8stepahead forecasts for the weekly series of the Sales dataset. Then, we utilize the BU, TD (using the disaggregation method depicted in Equation 1), COMSS, and COMSHR methods for reconciling the base forecasts of these models, as well as the MLRF and MLXGB HF methods proposed in this study. The first four methods are used as benchmarks as they involve both standard and stateoftheart HF approaches.
We evaluate the forecasting performance of the HF methods both in terms of accuracy (absolute deviation of the forecasts around the true values) and bias (consistent distance observed between the forecasts and the true values), using the mean absolute scaled error (MASE) [Hyndman2006a], as well as the root mean squared scaled error (RMSSE) and absolute mean scaled error (AMSE). The measures can be calculated as
MASE  
RMSSE  
AMSE 
where and are the observation and the forecast for period , is the sample size (observations used for training the forecasting model), is the length of the seasonal period, and is the forecasting horizon. In all cases, lower values indicate better forecasts.
Note that all measures are scaleindependent, meaning that averaging across series is possible. Moreover, given that the median minimizes the sum of the absolute errors [Neil199044138], while the mean minimizes the sum of the squares of these errors [KOLASSA2016788], MASE and RMSSE are appropriate for evaluating the accuracy of the examined HF method in approximating the median and the mean of the future values, respectively. Accordingly, AMSE is appropriate for measuring the bias of the reconciled forecasts.
In order for our results to represent reality as close as possible and approximate the actual performance of the examined HF methods in a longterm run, we consider the rollingorigin evaluation approach [TASHMAN2000437]. According to this approach, the first observations of each series are used for producing stepahead forecasts, with the following observations used for evaluating them. Then, the forecasting origin is increased by one and new forecasts are produced from the updated origin, this time using observations for training the forecasting model and the following ones for testing. This process is repeated times, until there are no observations left for evaluating the forecasts, i.e., while .
Given that the length and the frequency of the series of the two datasets differ, we consider a different, yet indicative implementation of the rollingorigin evaluation approach per case. Specifically, in the Tourism dataset we begin to produce forecasts at the end of the year of the dataset ( months) and use the remaining 6 years for testing, thus performing a total of evaluations. Accordingly, in the Sales dataset, we start producing forecasts at the end of the year of the dataset ( weeks) and use the remaining 60 weeks of each sales time series for testing, thus performing a total of evaluations. The overall performance of the HF methods in each dataset is computed by averaging the scores reported across all and evaluation periods.
Note that in order for the ML HF methods to be effectively trained to derive accurate reconciled forecasts when provided with a set of base forecasts, we require a dataset that includes an adequate sample of past, actual time series values (target variables) and the corresponding base forecasts produced for these periods by the forecasting model (regressor variables). In order to obtain such a dataset, we produce multiple onestepahead forecasts in a rollingorigin fashion, starting from an initial point, , and finishing at the forecast origin considered in each repetition of the rollingorigin evaluation approach, as described in Section 3 (steps 1–4). We set equal to and for the Tourism and Sales dataset, respectively, so that a reasonable amount of full seasonal periods is available for producing the base forecasts to be used for training the ML HF methods. In this regard, in the first evaluation performed for the Tourism dataset, a sample of records will be available for training the ML HF methods, with the records becoming in the last evaluation. Accordingly, a sample of records will be available in the first evaluation of the Sales dataset for each of the 55 hierarchies, with their length reaching records in the last evaluation.
4.3 Results
Tables 3 and 4 summarize the performance of the HF methods considered in this study in terms of accuracy (MASE and RMSSE) and bias (AMSE) for the Tourism and the Sales dataset, respectively. The first column of each table indicates the HF methods considered, while the rest of the columns present the performance of the method for each aggregation level separately, as well as across all levels (average of measure values reported for Level 0 to Level ). All levels are weighted equally since we do not focus on a particular decisionmaking problem, aimed at a specific hierarchical level.
Method  Level 0  Level 1  Level 2  Level 3  Average 

MASE  
BU  1.184  1.050  0.923  0.857  1.003 
TD  1.048  1.297  1.124  0.978  1.112 
COMSS  1.094  0.968  0.887  0.843  0.948 
COMSHR  1.047  0.956  0.872  0.824  0.925 
MLRF  1.045  0.964  0.859  0.812  0.920 
MLXGB  1.043  0.965  0.859  0.812  0.920 
RMSSE  
BU  1.439  1.314  1.186  1.124  1.266 
TD  1.238  1.630  1.460  1.297  1.406 
COMSS  1.308  1.225  1.137  1.109  1.195 
COMSHR  1.265  1.214  1.120  1.086  1.171 
MLRF  1.261  1.208  1.104  1.066  1.159 
MLXGB  1.255  1.208  1.101  1.064  1.157 
AMSE  
BU  1.066  0.639  0.443  0.350  0.624 
TD  0.845  0.594  0.404  0.341  0.546 
COMSS  0.988  0.611  0.426  0.349  0.593 
COMSHR  0.935  0.599  0.417  0.337  0.572 
MLRF  0.780  0.526  0.366  0.319  0.498 
MLXGB  0.779  0.526  0.365  0.317  0.497 
Method  Level 0  Level 1  Level 2  Average 

MASE  
BU  0.491  0.516  0.540  0.516 
TD  0.522  0.785  0.971  0.759 
COMSS  0.497  0.529  0.629  0.552 
COMSHR  0.495  0.520  0.542  0.519 
MLRF  0.433  0.449  0.465  0.449 
MLXGB  0.447  0.447  0.473  0.455 
RMSSE  
BU  0.653  0.710  0.741  0.701 
TD  0.684  1.118  1.314  1.039 
COMSS  0.655  0.720  0.844  0.739 
COMSHR  0.654  0.713  0.742  0.703 
MLRF  0.625  0.675  0.703  0.668 
MLXGB  0.654  0.719  0.759  0.711 
AMSE  
BU  0.300  0.323  0.330  0.318 
TD  0.320  0.423  0.627  0.456 
COMSS  0.301  0.327  0.372  0.334 
COMSHR  0.305  0.327  0.331  0.321 
MLRF  0.290  0.312  0.308  0.303 
MLXGB  0.308  0.301  0.299  0.303 
Before proceeding with the evaluation of the results, we highlight that two of the benchmarks considered, namely COMSS and COMSHR, are considered stateoftheart in the field of hierarchical time series forecasting as they have been proven to significantly improve the base forecasts provided to them as input. Moreover, although much more simplistic in nature, the BU and TD methods are highly competitive and, in some applications, difficult benchmarks to beat. Thus, further improving the performance of HF based on ML approaches becomes a promising, yet challenging task.
The results for the Tourism data presented in Table 3 show that, on average, MLXGB is the most accurate HF method in terms of MASE, doing slightly better than MLRF. Specifically, MLXGB is 17% and 8% more accurate on average when compared to the TD and the BU method, respectively, being also 3% and 0.6% more precise than the COMSS and COMSHR methods. The same stands in general for the individual hierarchical levels, with the exceptions of the TD method for which results are comparable to the ones of the ML methods at the top level, as well as the COMSHR that displays the best performance at Level 1. This can be partially explained by reviewing the particularities of these two methods: TD builds on the base forecasts produced for the top level of the hierarchy, thus omitting any information provided from the rest of the series, while COMSHR combines the forecasts from all the series of the hierarchy in a linear way. As a result, if the fully aggregated series is predictable enough, the TD method will provide accurate results at Level 0. Accordingly, if the information required for accurately predicting a level in the middle of the hierarchy, like Level 1, is not complicated and sufficiently provided by the neighboring levels (Levels 0 and 2), COMSS and COMSHR will result in improved forecasting accuracy. Note however that COMSS is always outperformed by COMSHR due to the latter incorporating information about the correlation structure of the series.
The results are similar in terms of RMSSE, with just two differences worth reporting. First, in this case, the performance of the TD method at the top level is not only comparable to the one of the ML methods, but actually better by about 2%. However, TD continues to produce significantly less accurate results for the rest of the hierarchical levels. Second, at Level 1, COMSHR is no longer the best performing method, being outperformed by both ML approach to a similar extent. Thus, we conclude that ML approaches are generally better in approximating the mean of the future values of the series than their median, a phenomenon which can be possibly attributed to the way these methods learn: Both RF and XGB are optimized by minimizing the sum of squared errors produced. Thus, these models learn how to properly approximate the mean and not necessarily the median of the series being predicted.
This last argument may also explain the bias reported for each method, as measured by AMSE. Given that mean squared error can be decomposed into a bias and an accuracy term [Taieb7064712], both MLRF and MLXGB are indirectly trained so that they minimize the bias of the reconciled forecasts. In this regard, in contrast to MASE and RMSSE, the ML HF methods always provide significantly less biased forecasts than the benchmarks, especially for the higher levels of the hierarchy. In particular, MLXGB, the best performing method in terms of AMSE, is on average 15% better than the benchmark methods, being also less biased by 8% for the bottom level, 18% for the top level, and 14% for the two levels in the middle. Observe also that the worst performing method in terms of AMSE is the BU, with the TD doing also much better than the COMSS and COMSHR methods. This indicates that when the base forecasts produced at the bottom level of the hierarchy are biased, reconciliation methods should put more emphasis on the top level where forecasts are more likely to be robust and, therefore, less biased. On the other hand, the fact that ML methods, which exploit the base forecasts produced at all hierarchical levels in a similar fashion to COMSS and COMSHR, are still able to provide unbiased results, highlights the potential of dynamic, nonlinear reconciliation approaches.
The results are even more encouraging for the case of the Sales dataset. According to MASE, the MLRF method is considered the most accurate approach on average, being also the best HF method for all levels apart from Level 1. However, even at Level 1, MLRF is outperformed only to a small extent by MLXGB, which is also a ML approach. Moreover, in this dataset, the differences reported between the ML methods and the benchmarks are always significant, with the improvements being around 14% at the top level, 21% at the middle, and 26% at the bottom. In other words, not only the improvements reported for the Sales dataset are greater than those of the Tourism dataset, but can be also observed across all levels, becoming more significant for the lower levels of the hierarchy. This could be due to the major differences reported in the Sales dataset between the retailers, meaning that combining the base forecasts from the complete hierarchy to produce forecasts for a particular series is inappropriate when the series do not share the same patterns, at least to some extent. On the contrary, the results highlight that when a ML HF method is utilized for this purpose, being able to selectively combine the base forecasts, the information from the complete hierarchy could still be relevant. This conclusion is also supported by observing that COMSS and COMSHR do similarly in terms of MASE to the relatively much simpler BU method.
The results of MASE are in a general agreement to those reported for the case of the RMSSE. Again, MLRF, the best performing ML HF method, outperforms all of the benchmarks, with the improvements reported being higher for the lower levels of the hierarchy (6%, 10%, and 20% on average for Levels 0,1 and 2, respectively). However, MLXGB manages to provide slightly less biased results than MLRF at all levels apart from the top one. Again, the differences between the two ML approaches are small, with their performance being also much better than that of the benchmarks. For example, according to AMSE, MLRF is on average 6% less biased than the benchmarks at the top level, 14% at the middle level, and 18% at the bottom level.
Figure 5 provides further insight about the relative accuracy of the HF methods for 55 hierarchical sales data at different levels in terms of MASE, AMSE, and RMSSE. It demonstrates that both ML HF methods generate more accurate forecasts than their counterparts with MLRF being the top performing method in terms of MASE. While MLXGB has performed more consistently across different series at Levels 0 and 1, the MLRF method has generated more consistent forecasts at Level 2. This notion also holds for RMSSE. This might be due to different features of time series, such as seasonality, entropy, and the trend of the base time series [abolghasemi2019machine, abolghasemi2020demand]. Finally, it is apparent that the RF and XGB methods performed quite similarly in terms of AMSE.
By summarizing the results of both datasets, the following conclusions can be drawn:

ML HF methods, combining the base forecasts for the complete hierarchy in a nonlinear way, provide on average significantly better forecasts than existing methods, both in terms of accuracy and bias. Whether these results can be generalized to other data sets remains to be seen.

The information of other series at different levels of the hierarchy (crosssectional information) can be useful in forecasting the future values of a series regardless of the reconciliation methodology used.

The expected improvements from using a ML HF method instead of the existing linear methods are higher when the series in the hierarchy are characterized by different patterns. The greater the differences between the series, at all levels, the higher the potential of using a selective, nonlinear reconciliation approach.

When a particular ML framework is considered for reconciling the base forecasts produced for an hierarchy, the algorithm selected for determining the combination weights of these forecasts does not greatly affect the final results. Note however that this conclusion is drawn based on an experiment where two decision tree algorithms are used, both utilizing the same framework for performing the reconciliation. Thus, further investigation is required to confirm that this is also the case (i) when different types of ML algorithms (e.g., NNs and SVMs) are used for combining the base forecasts and (ii) different reconciliation frameworks are utilized.
5 Conclusions
The challenge of hierarchical forecast reconciliation, to produce coherent forecasts across the various hierarchical levels, has so far been tackled with various linear approaches. Early solutions focused on producing forecasts at a single aggregation level with the forecasts of the other levels being derived by aggregation/disaggregation, thus essentially avoiding the incoherence issue. Current stateoftheart solutions linearly combine the forecasts across all levels. In this study, we have proposed the use of nonlinear combination approaches to achieve reconciliation using ML methods.
Our results suggest that, on average, the proposed hierarchical reconciliation approaches based on ML perform well in practice, both in terms of forecast accuracy and bias. Not only can they outperform simple hierarchical approaches, such as BU and TD, but they also show improvements over robust stateoftheart linear combination approaches. The good performance of HF ML is more evident on the Sales dataset compared to the Tourism dataset, possibly due to the importance of the bottomlevel information where our algorithm primarily focuses. The promising empirical results are driven from the design of our approach. HF ML not only results in consistent forecasts across aggregation levels, as is the case with more traditional hierarchical approaches, but also explicitly takes into account the outofsample forecast accuracy. The derived combination weights of the HF ML approach provide a selective pooling of the forecasts across the various aggregation levels.
It would be interesting to explore if our insights stand for other ML methods and other data sets. In the following, we discuss additional, alternative paths for future investigation.

In this study, we focused on the case of crosssectional hierarchical structures. However, forecasting with hierarchies has been extended to the temporal and the crosstemporal dimensions [athanasopoulos2017forecasting, KOURENTZES2019393, Spiliotis2020hj]. Future work could apply our approach to these dimensions as well and benchmark against standard, linear reconciliation approaches. One challenge, though, has to do with the size of the task, especially in the crosstemporal domain, and the ability to apply the ML approaches described here when the time series are not long enough.

Our approach focused on optimizing the performance of the bottomlevel series, building models in total. Further research could generalize this optimization objective to other (or multiple) levels of aggregation.

We showed that HF ML approaches perform better in the case of point forecasting. Future research could extend our results to include evaluation on the forecast uncertainty [Jeon2019xo].

Our empirical study included two datasets, sampled in monthly and weekly frequencies. We expect that the performance improvements observed by applying nonlinear approaches to hierarchical forecast reconciliation would be amplified for higher data frequencies (e.g., daily or hourly).

Despite the improved forecasting performance, the computational complexity should be also examined. It is important to tradeoff any gains on the forecast accuracy against additional computational cost/resources [Gilliland2019xl, Nikolopoulos2018wa].
Comments
There are no comments yet.