A Multi-Phase Approach for Product Hierarchy Forecasting in Supply Chain Management: Application to MonarchFx Inc

06/16/2020 ∙ by Sajjad Taghiyeh, et al. ∙ NC State University 0

Hierarchical time series demands exist in many industries and are often associated with the product, time frame, or geographic aggregations. Traditionally, these hierarchies have been forecasted using top-down, bottom-up, or middle-out approaches. The question we aim to answer is how to utilize child-level forecasts to improve parent-level forecasts in a hierarchical supply chain. Improved forecasts can be used to considerably reduce logistics costs, especially in e-commerce. We propose a novel multi-phase hierarchical (MPH) approach. Our method involves forecasting each series in the hierarchy independently using machine learning models, then combining all forecasts to allow a second phase model estimation at the parent level. Sales data from MonarchFx Inc. (a logistics solutions provider) is used to evaluate our approach and compare it to bottom-up and top-down methods. Our results demonstrate an 82-90 proposed approach. Using the proposed method, supply chain planners can derive more accurate forecasting models to exploit the benefit of multivariate data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The efficient movement of goods in a supply chain depends on the ability to accurately forecast product demands. Oftentimes, these forecasts must be produced within a hierarchical structure which may represent geographic regions, product families, (Hyndman et al., 2011), or time periods (Athanasopoulos et al., 2017). The value of hierarchical forecasting is that it can provide decision support information to different stakeholders across various organizational functions and managerial levels (Fliedner and Mabert, 1992). For instance, hierarchical forecasts can be used to improve market positioning, inventory planning, facility layouts, or increased efficiency of operational logistics and transportation networks, leading to increased customer satisfaction and lower costs. Muir (1979) explained how hierarchical forecasting can increase overall forecast accuracy, noting that combining data from two or more homogeneous items can produce a stabilizing effect.

Two dominant approaches exist in the hierarchical forecasting literature: top-down and bottom-up. In the top-down approach, a forecast is initially created at an aggregated level, then disaggregated to lower levels of the hierarchy (Boylan, 2010). A common disaggregation approach involves proration (Fliedner, 1999; Strijbosch et al., 2007) in which the aggregate demand forecast is multiplied by the ratio of corresponding demand to aggregate demand, resulting in an estimate for the next lower-level in the hierarchy. In the bottom-up approach, the steps are reversed. The lowest level of the hierarchy is forecasted first (i.e. SKU level), then aggregated to estimate higher levels in the hierarchy (Hyndman et al., 2011). A third approach called middle-out combines aspects of top-down and bottom-up. In middle-out, the forecast is performed at a middle level of the hierarchy, then aggregated up and disaggregated down to estimate the forecasts for other levels of the hierarchy.

With respect to top down forecasting, Gross and Sohl (1990) argued that two simple disaggregation techniques can be effective; “average historical proportions” and “proportions of the historical averages” (Athanasopoulos et al., 2009). In the “average historical proportions” approach, the share of each lower level time series of the aggregated series is calculated across all periods, i.e. a linear average share is used. In the“proportions of the historical averages” approach, a volume weighted share across all time periods is employed. The authors also mention that for the “average historical proportions” approach, one is not required to only use the historical proportions, but can utilize the forecasted proportions instead. Promising results were derived using this approach and it is offered in some forecasting software (Boylan, 2010).

In practice, there may be multiple features in the input data (e.g. date, time, holidays, seasonal discounts, etc.) that can be leveraged to improve forecast accuracy within supply chains. To the best of our knowledge, most of the research in the supply chain hierarchical forecasting literature is univariate. We found no documented multivariate hierarchical forecasting models that employ lower level forecasts as features in parent level modeling. In this research, we employ multiple features (in contrast to univariate time series data) and child level (SKUs) and parent level (brand) forecasts in a hierarchical supply chain model to improve forecast accuracy at the parent level in the hierarchy. We utilize Machine Learning (ML) techniques including Multi-Layer Perceptron (MLP), Random Forest (RF), Gradient Boosting (GB), and Extreme Gradient Boosting (XGB) to build competing forecasting models. The rest of the paper is organized as follows: In section 2, we briefly review the various existing hierarchical forecasting methods and the aggregation approaches in use. In section 3, we present the details of our proposed Multi-Phase Hierarchical forecasting approach (MPH). We then describe numerical experiments that demonstrate the performance of MPH using sales data from MonarchFx Inc. (a logistics solutions provider) that is representative of a mid-tier supply chain customer in section 4. We summarize our conclusions and discuss practical aspects of our work in section 5.

2 Literature Review

The performance of top-down and bottom-up forecasting approaches in the literature are mixed (Syntetos et al., 2016). Some authors found top-down approaches to be superior (Barnea and Lakonishok, 1980; Gross and Sohl, 1990; Fliedner, 1999), while others found bottom-up methods to be more accurate. (Dangerfield and Morris, 1992; Gordon et al., 1997). These conflicting results occur because the performance of each approach depends on the nature of demand for the products involved. To illustrate, consider a three-level product hierarchy, with product sales at the lowest level, group sales at the middle level, and category sales at the top level. Since group sales are determined by the sum of product sales (given the additive nature of the hierarchy), and the sum of group sales determines category sales, the underlying demand process is transformed at different levels of the hierarchy. When aggregating, a significant loss of information can occur, which tends to render bottom-up forecasting more favorable. Conversely, in the top-down approach, benefits can occur due to random noise cancellation (Fliedner, 1999). Because the performance of each approach depends on the demand generation process within the data, a wide range of conflicting results appears in the literature. Thus, depending on the demand process and parameter settings, one approach may perform better than the other in different contexts (Widiarta et al., 2007, 2009).

An early study comparing both top-down and bottom-up approaches was conducted by Grunfeld and Griliches (1960), in which they found the top-down approach more accurate, with the explanation that disaggregated data is more susceptible to error. Fogarty and Hoffmann (1983) and Narasimhan et al. (1995) derived similar conclusions in their work. Conversely, the loss of information in a top-down approach was considered substantial in Orcutt et al. (1968) and leading to the conclusion that the bottom-up approach is superior. In Shlifer and Wolff (1979), the authors identified conditions on the hierarchy’s structure and forecast horizon, under which they concluded that the bottom-up approach is favorable. The robustness and bias of both approaches were investigated in Schwarzkopf et al. (1988). The authors concluded that the bottom-up approach is more favorable unless there exist unreliable or missing data at the bottom of the hierarchy.

A significant characteristic of the underlying demand process involves the dependencies between demand produced at each level, which can be a reason for the performance differences between top-down and bottom-up approaches (Chen and Boylan, 2007).

Sbrana and Silvestrini (2013)

summarizes the arguments that are often made against top-down approaches. First, he states that the a high (or low) variance in one level in a hierarchy may be indicative of high (or low) variance at other levels. In such cases, allocating measures of variance from higher levels to lower levels in a hierarchy may yield better results. Second, since different products may be classified in different segments, the aggregation of data will lead to a loss of information, making the top-down approach less appealing.

On the other hand, there are examples in the supply chain forecasting literature where the authors favor the top-down approach. In Boylan (2010), the author found that aggregated data can lead to more accurate sales forecasts when dealing with change policies (e.g. change in pack sizes), compared to individual level forecasts. In such cases, common disaggregation techniques (“average historical proportions” and “proportions of the historical averages”) may not be useful, and judgmental estimates are required in disaggregation methods to handle such changes in policy.

One method to overcome these drawbacks involves analysis of the conditions in which each approach produces superior forecasting accuracy outcomes. In Widiarta et al. (2008), the top-down and bottom-up approaches are compared in the context of production planning. Their goal was to estimate requirements at the SKU level. The aggregate demand series were assumed to have correlated sub-aggregate components, each of which were assumed to follow a first order univariate moving average (stationary) process correlated over time. They concluded that both methods have nearly identical performances. Later, Widiarta et al. (2009) investigated the relative effectiveness of bottom-up and top-down approaches to forecast demand at the aggregate level rather than the SKU level. They concluded that when all sub-aggregate components of the time series follow a first-order univariate moving average process with identical coefficients of the serial correlation term, the relative performance of both top-down and bottom-up approaches are similar. Additionally, the different coefficients of the serial correlation term among sub-aggregate components were examined in a simulation study. The result was that the differences in the performance are relatively insignificant when there are small or moderate correlations between the sub-aggregate components. Sbrana and Silvestrini (2013) found that when moving average parameters are not identical, the performance of top-down and bottom-up approaches is similar.

More recently, Rostami-Tabar et al. (2015) analyzed theoretically and by means of simulation (using theoretically generated data) the relative performance of top-down and bottom-up forecasting methods for both aggregate and SKU level demand. The latter was assumed to follow a non-stationary ARIMA (0,1,1) demand process and exponential smoothing (which is optimal for this demand process). An important finding was that the forecast accuracy improvements achieved by bottom-up and top-down methods for non-stationary demands are higher than those associated with stationary cases. The theoretical findings were validated through empirical analysis on data from a European superstore.

A limitation observed in this work is that the generation of forecasts is dominated by the time series at a single level of aggregation (the point at which forecasts are created). To overcome this issue, a regression-based approach was introduced by Hyndman et al. (2011)

. In their approach, they estimated the time series at multiple hierarchy levels and then optimized this combination using linear regression. This approach sought to derive the benefits of an ensemble of bottom-up and top-down approaches, employing a linear combination of both. Their method demonstrated a significant improvement in forecast accuracy compared to the traditional approaches. This improvement was believed to be a function of employing a combination of forecasts that reduced the variance of forecast error

(Timmermann, 2006; Barrow and Kourentzes, 2016). Hyndman et al. (2011) conclude that their proposed combination method is “optimal”, and compared to all combination forecasts, leads to the least variance. Their work is inspired by earlier research in economics focusing on revising measurements of macro-economic indicators (Zellner and Tobias, 2000; Espasa et al., 2002; Hubrich, 2005). Other research focuses on using different sources to combine forecasts, e.g. utilizing different available information provided by human experts (Budescu and Chen, 2014; Lamberson and Page, 2012). Additionally, the combination of forecasts may reduce model specification and estimation uncertainty (Kourentzes et al., 2014). In a later work, Hyndman et al. (2014) demonstrate the extendibility of their combination approach for hierarchical forecasting to non-hierarchical time series, and time series with partial hierarchical structure. They also proposed a solution to solve the scalability problem that existed in their previous paper Hyndman et al. (2011). They use a linear model structure for a more efficient coefficient estimation.

In Pennings and van Dalen (2017)

, the authors utilize all the series in a hierarchy in contrast to a top-down or bottom-up approach. They then incorporate a Kalman filter and state space model to comprehend the dependencies between products (e.g. product substitution, product complementarity). Using a multi-variate state space, one is able to estimate the hierarchical time series efficiently using a Kalman filter as a prediction error decomposition tool

(Durbin and Koopman, 2012). In this manner, multiple methods for forecasting hierarchical time series exist (Hyndman and Khandakar, 2008; Snyder et al., 2012). In their approach, forecasts for the aggregate level is derived by summing the forecast of product sales at the base level. The Kalman filter is then used to track the forecast error of individual series at each level of the hierarchy back to the associated states. In this manner, the forecast leverages the information from all series. The authors conclude that their approach is superior to the traditional top-down and bottom-up approach since they incorporate information from all levels of the hierarchy.

Our work builds on the research by Hyndman et al. (2011), and Pennings and van Dalen (2017) (discussed previously), in which they combine information at all levels of hierarchy to improve forecasting accuracy. However, these authors only employ univariate data as their input, and do not leverage multiple features. Our main contribution in this paper is to propose a novel approach which 1) utilizes forecasts at lower levels to improve forecasts at higher levels, 2) uses multivariate data at each level of the hierarchy instead of univariate data, which is more commonly seen in the literature, and 3) leverages machine learning models. The latter component is, to the best of our knowledge, a novel application in the supply chain forecasting literature. To achieve our goal, we propose an MPH approach which is discussed in the following section.

3 Multi-Phase Hierarchical Forecasting Approach

Our goal is to find a small loss value, , in the parent level of the hierarchy, to optimize:


where is the matrix of the weights,

is the vector of the inputs from the

instance, is the dependent variable, e.g., demand (sales) values, and

is the output function defined by the forecasting model. The well-known Mean Absolute Error (MAE) is being used as our loss function,

, in which, the average of differences between the actual demands and estimated demand is calculated.

To achieve a higher level of accuracy in the parent level of the hierarchy, we utilize an MPH approach. In the first phase, we forecast at both child level (SKU level) and parent level (brand) demands using several machine learning approaches. Then, for each individual time series, we select the most promising forecast method in terms of MAE. MAE is calculated based on a cross-validation technique. In the second phase, we aggregate the forecasts at the child level and parent level and use them as an input for the multi-feature forecasting approach.

3.1 Overview of Forecasting Methods

Conventional parametric forecasting techniques include ARIMA, GARCH, and TRANSFER models (Box et al., 2015; Shumway and Stoffer, 2011). Moreover, Taylor (2000)

forecasts the demand for time steps ahead using a normal distribution. However, in the situations where demand values are volatile and correlated over time, their model does not yield good performance. One way to overcome this issue is to use a class of algorithms called “universal approximators”. This class of algorithms is based on machine learning techniques and is able to approximate any function given an arbitrary forecast accuracy. These approximators can learn any function of past and future data and therefore other forecasting models can be considered as a subset of the functions which they are able to learn. Machine Learning (ML) techniques, such as Multi-Layer perceptron (MLP), Random Forest (RF), Gradient Boosting (GB), and eXtreme Gradient Boosing (XGB) are some of these universal approximators, which are able to be used to learn any function and have many applications in practice

(Belgiu and Drăguţ, 2016; Mei et al., 2014; Rahmati et al., 2016; De’Ath, 2007; Moisen et al., 2006; Chen and Guestrin, 2016; Cigizoglu, 2004; Hippert et al., 2001; Deo et al., 2018).

Supply chain forecasting is a field which generally consists of very “noisy” data, thus it is important to control for noise and learn the true underlying demand patterns which are likely to be repeated in the future. The universal approximators discussed earlier have two desirable features which make them suitable for the supply chain forecasting problem, while dealing with noise. The first is that they are capability of learning any arbitrary function, while the second feature is the capability to control the learning process.

Since we want to exploit additional information provided by multiple input features, we are faced with a multi-dimensional input data vector. The traditional parametric forecasting models such as ARIMA are not able to integrate multi-dimensional inputs, thus we exploit the ability of universal approximators to take multi-dimensional inputs and utilize them in our forecasting model. The details of the MPH approach are explained in the following subsection.

3.2 MPH Algorithm

Phase :

  • Step 0: Choose forecasting model types which support multi-feature inputs, e.g. MLP, RF, GB, XGB, etc. Suppose we have chosen forecasting approaches. Set .

  • Step 1: Use the forecasting approach to forecast parent level demand (brand demand) and child level demands (SKU demand).

  • Step 2: Optimize the hyperparameters of the

    forecasting method using a search approach, e.g. Bayesian optimization method, grid search, successive halving.

  • Step 3: Set . If , go to step 4. Otherwise, go to step 1.

  • Step 4: Using the outputs of the previous steps, record the best forecasting approach and the associated outputs for demands at all levels.

Phase :

  • Step 5: Append the recorded outputs of step 4 to input data of the parent level, as additional features.

  • Step 6: Repeat steps 1 and 2 once more, using the new input data with additional features. The only modification is to only forecast the parent level.

  • Step 7: Choose the best forecasting output among the forecasting methods used in step 6.

Figure 1: Phase of MPH forecasting model

Figure 1 provides an overview of Phase

for this procedure, in which two forecasting models were chosen as base forecasting approaches. Model A can be a tree-based forecasting model, e.g. RF, GB, or XGB, and model B can represent an exploration-capable model, e.g. artificial neural network models such as MLP. As depicted in figure 1, we have a two-level hierarchical structure with 1 parent and n children. In Phase

of the model, we use the selected models (i.e. models A and B) to forecast demands at both parent and child levels. Since we are dealing with universal approximators, they have several hyperparameters, on which the model is very sensitive in term of accuracy. Hence, one needs to find an approach to optimize the hyperparameters of the forecasting models, which is illustrated below:

3.3 Hyperparameter Optimization

There are several approaches in the literature that address hyperparameter optimization in machine learning (Maclaurin et al., 2015; Feurer et al., 2015; Li et al., 2017; Bergstra et al., 2013). In this paper, we use the hyperOpt algorithm proposed by Bergstra et al. (2013), and combine it with the successive halving approach (Jamieson and Talwalkar, 2016) to obtain a more efficient search. In the following a summary of the HyperOpt algorithm is provided and the details of the proposed hyperparameter optimization algorithm is explained.

3.3.1 HyperOpt:

HyperOpt is a module proposed by Bergstra et al. (2013) , and is focused on intelligently searching through the hyperparameter space. One approach is to use the Tree-structured Parzen Estimator (TPE) algorithm (Bergstra et al., 2011), in which the search space is explored in an intelligent way, while the parameter values are narrowed down to the best estimated parameters. In contrary to the Grid Search, in which the hyperparameters must be pre-determined and the increment steps fixed, HyperOpt is an oriented random search and is proven to work efficiently (Bergstra et al., 2013). Hence, it serves as a good candidate to tune and optimize hyperparameters for universal approximators, as adopted in this paper.

3.3.2 Proposed Hyperparameter Optimization Algorithm:

  • Initializing the sample space by HyperOpt. Suppose that we choose to start with parameter settings to search more rigorously among them. In our modified approach, we use HyperOpt to search intelligently through the search space, and we store the first parameter settings that are used by HyperOpt. Note that each HyperOpt iteration is only performed on one set of train/test data. Now we use the generated parameter settings as an input for a more rigorous search by successive halving (Jamieson and Talwalkar, 2016).

  • We follow the idea of successive halving proposed by Jamieson and Talwalkar (2016). Using parameter settings generated by HyperOpt, the well-known K-fold algorithm (Kohavi and others, 1995) is used to evaluate each parameter settings for a fixed amount of time/budget, e.g. T. Then, we select the top-performing half of the parameter settings , and again, we evaluate them via k-fold for time 2T. This procedure is repeated until the search space is singular or the designated budget is exhausted.

Figure 2: Phase of MPH forecasting model

The idea behind the above algorithm is quite intuitive. Initially, HyperOpt is used as the screening procedure on the search region by expending a small budget of processing time. After initial candidates (parameter settings) were selected, successive halving is utilized for a more rigorous evaluation. This procedure spends computational budget more efficiently by focusing on the parts of search region which have more potential.

In the second phase we add the best performing forecasts as additional features to the input data of the parent (See Figure 2). Next, a parent level forecasting model is re-estimated using the new input and then the hyperparameter optimization process is conducted. After identifying the best parameter settings, we select the best performing forecast as our final forecasting model.

4 Numerical Experiment

The MPH forecasting algorithm was implemented on sales data provided by MonarchFx Inc., which consists of 935 days of data for ten Stock Keeping Units (SKUs) and aggregated data which represents total brand sales. This data is representative of one of MonarchFx’s mid-tier supply chain customers. In addition to the historical sales data, the input also contained additional features including:

  • Promotion: a binary variable indicating if a promotion was present.

  • Holiday: a binary variable indicating holiday periods.

  • Day of the week: seven dummy variables corresponding to day of the week.

  • Date: in the format day/month/year

Each of these factors may increase the predictive power of forecasting models, both independently as well as in combinations.

To measure model accuracy, the Mean Absolute Error (MAE) is used:


Where corresponds to the actual values of sales, and is the forecasted sales on day .

The well-known k-fold cross-validation method (Kohavi and others, 1995) with k=5 is used to test each forecasting model and measure the forecast accuracy. The parameter k refers to the number of groups that the input data will be divided to. We chose this method because it provides a less biased estimate of the model compared to a single train/test split of the data. In this procedure, initially the data is randomly shuffled and is divided into k different groups. Then, for each group, it is selected as the test dataset and the remaining data is considered as the train set. The forecasting model is trained on the train set and the accuracy is measured on the test set. This procedure is repeated for every group and the average MAE across k train/test splits is reported as the final MAE.

Multi-Layer Perceptron (MLP), Random Forest (RF), Gradient Boosting (GB), and Extreme Gradient Boosting (XGB) are the forecasting models selected for the experiments, due to their popularity in machine learning forecasting literature (Mei et al., 2014; De’Ath, 2007; Cigizoglu, 2004; Deo et al., 2018; Zieba et al., 2016). Using these four forecasting models, the MPH algorithm in conjunction with the hyperparameter optimization method (explained in section 3-2) is implemented on the data and the forecasting error at the parent level is compared to the top-down and bottom-up approach.

Tables 1 and 2 show the results of Phase of algorithm for the lower level and parent level of the hierarchy, respectively. Table 1 contains 10 rows corresponding to each SKU. MAE is reported for each of four forecasting methods (after performing k-fold cross-validation), and the forecasting method with the minimum MAE is selected for phase .

Series MLP RF GB XGB Best Range Min MAE
1 370 339 366 350 RF 31 339
2 404 381 405 388 RF 24 381
3 607 557 609 588 RF 52 557
4 681 684 725 708 MLP 44 681
5 364 343 389 360 RF 46 343
6 408 385 397 405 RF 23 385
7 676 691 732 709 MLP 56 676
8 446 421 449 451 RF 30 421
9 537 537 550 537 MLP 13 537
10 395 363 385 375 RF 32 363
Table 1: Phase child-level results
MLP RF GB XGB Best Range Min MAE
3972 3182 3068 3118 GB 904 3068
Table 2: Phase Parent level results

The results of phase are added as additional features to input data at the parent level. As phase of the algorithm suggests, MLP, RF, GB, and XGB models are estimated again using the new input data. Table 3 reports the MAE (after performing k-fold cross-validation) at the end of phase for each of the forecasting models. The minimum MAE is selected as the final MAE at the parent level, which is 303.

MLP RF GB XGB Best Range Min MAE
445 610 528 303 XGB 307 303
Table 3: Phase results

The final MAE of MPH algorithm is compared to the MAE of top-down and bottom-up approach, in tables 4 and 5, respectively.

Top-down MPH % of Improvement
3068 303 90%
Table 4: Top-down vs MPH MAE
Bottom-up MPH % of Improvement
1672 303 82%
Table 5: Bottom-up vs MPH MAE

As the final MAE results suggest (tables 4 and 5), comparing MPH algorithm to both top-down and bottom-up approached, 90% and 82% improvement is gained, respectively. These outcomes demonstrate the advantages of MPH in substantially improving forecasting accuracy. The reason lies in the fact that the information at the child level is leveraged to improve forecasting accuracy at the parent level, which was previously ignored in both top-down and bottom-up approaches.

To show the accuracy improvement we can get by using MPH algorithm, we compare our results to output of machine learning models that we used as the basis of MPH. The results are shown in table 6. As the results suggest, we are gaining at least 90% improvement in forecast accuracy over popular machine learning models, which shows the significant improvement in the results obtained from MPH.

Machine learning model MAE MAE from MPH Improvement
MLP 3972 303 92%
RF 3182 303 90%
GB 3068 303 90%
XGB 3118 303 90%
Table 6: Comparing results of MPH to forecasts from machine learning models

For the sake of completeness, we also compare the results of our algorithm to traditional time series forecasting methods, namely naiv̈e forecasting, moving average, simple exponential smoothing, Holt’s linear trend, Holt-Winter’s additive method, ARIMA, theta and ARIMAX. Tables 7 and 8 show the results of comparing the aforementioned time series forecasting methods’ results with phase I and Phase II output of MPH algorithm.

Forecasting method MAE MAE from MPH (Phase ) Improvement
Naiv̈e forecasting 24974 3068 88%
Moving average 20647 3068 85%
Simple exponential smoothing 10120 3068 70%
Holt’s linear trend 18681 3068 84%
Holt-Winter’s additive method 12076 3068 75%
ARIMA 3979 3068 23%
Theta 19743 3068 84%
ARIMAX 3364 3068 9%
Table 7: Comparing phase results of MPH to traditional time series forecasting methods
Forecasting method MAE MAE from MPH (Phase ) Improvement
Naiv̈e forecasting 24974 303 99%
Moving average 20647 303 99%
Simple exponential smoothing 10120 303 97%
Holt’s linear trend 18681 303 98%
Holt-Winter’s additive method 12076 303 97%
ARIMA 3979 303 92%
Theta 19743 303 98%
ARIMAX 3364 303 91%
Table 8: Comparing phase results of MPH to traditional time series forecasting methods

As we can from the results of tables 7 and 8, MPH algorithm performs significantly better than traditional time series forecasting methods. The main reason behind this significant improvement is twofold. First, in contrast to traditional forecasting methods, which mostly use univariate time series, we use multiple features as input variables in MPH algorithm. The second reason is that MPH algorithm uses information at both levels of the hierarchy (SKU level and brand level), which helps the algorithm to provide significantly more accurate forecasts.

5 Conclusions

In this paper, we develop a novel multi-phase hierarchical approach (MPH) for supply chain forecasting using machine learning techniques supporting multi-feature input data (e.g. MLP, RF, GB, and XGB). In the proposed two-phase model, the information at the child level is leveraged to improve forecasting accuracy at the parent level, by adding the results of the best forecasting model for each child as additional features at the parent level. The MPH algorithm is implemented on sales data provided by MonarchFx Inc. and the results were compared to a top-down and bottom-up approach. The results demonstrate that a considerable improvement can be achieved by utilizing the MPH algorithm (90% improvement in comparison with top-down approach, and 82% improvement comparing to bottom-up approach). This improvement is possible due to the fact that the MPH algorithm leverages information both at the child level and parent level.

Based on the experience of one of the co-authors who leads the supply chain analytics function at MonarchFx Inc., there are multiple applications possible for our approach. Indeed, the majority of companies employing supply chain forecasting solutions generally apply top-down and bottom-up approaches and use traditional models that only support single feature input data. However, in practice, multiple factors can impact future sales and can be controlled for in this manner to improve forecast accuracy. Using the machine learning forecasting models developed in this paper, supply chain planners can derive more accurate forecasting models to exploit the benefit of multivariate data.

There are multiple possible future extensions to this work. One is to use the MPH algorithm on hierarchies with more than two levels. The other is to utilize the reconciliation techniques used by Hyndman et al. (2011). Another possible path is to characterize the situations under which each of the forecasting models perform best in the child level and parent level. Researchers may also use the forecasting model selection method developed by Taghiyeh et al. (2020) to select the best forecasting model among existing machine learning models. The noisy optimization method in Taghiyeh and Xu (2016) may also be used to find the parameters for optimal reconciliation for the levels of the hierarchy. To improve the speed of the model, the parallelization method proposed in Rosen et al. (2016) can be utilized. We believe there lies great promise for using these approaches in the future.



  • G. Athanasopoulos, R. A. Ahmed, and R. J. Hyndman (2009) Hierarchical forecasts for australian domestic tourism. International Journal of Forecasting 25 (1), pp. 146–166. Cited by: §1.
  • G. Athanasopoulos, R. J. Hyndman, N. Kourentzes, and F. Petropoulos (2017) Forecasting with temporal hierarchies. European Journal of Operational Research 262 (1), pp. 60–74. Cited by: §1.
  • A. Barnea and J. Lakonishok (1980) An analysis of the usefulness of disaggregated accounting data for forecasts of corporate performance. Decision Sciences 11 (1), pp. 17–26. Cited by: §2.
  • D. K. Barrow and N. Kourentzes (2016) Distributions of forecasting errors of forecast combinations: implications for inventory management. International Journal of Production Economics 177, pp. 24–33. Cited by: §2.
  • M. Belgiu and L. Drăguţ (2016) Random forest in remote sensing: a review of applications and future directions. ISPRS Journal of Photogrammetry and Remote Sensing 114, pp. 24–31. Cited by: §3.1.
  • J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pp. 2546–2554. Cited by: §3.3.1.
  • J. Bergstra, D. Yamins, and D. D. Cox (2013) Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in Science Conference, pp. 13–20. Cited by: §3.3.1, §3.3.
  • G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung (2015) Time series analysis: forecasting and control. John Wiley & Sons. Cited by: §3.1.
  • J. Boylan (2010) Choosing levels of aggregation for supply chain forecasts. Foresight: The International Journal of Applied Forecasting (18), pp. 9–13. Cited by: §1, §1, §2.
  • D. V. Budescu and E. Chen (2014) Identifying expertise to extract the wisdom of crowds. Management Science 61 (2), pp. 267–280. Cited by: §2.
  • H. Chen and J. E. Boylan (2007) Use of individual and group seasonal indices in subaggregate demand forecasting. Journal of the Operational Research Society 58 (12), pp. 1660–1671. Cited by: §2.
  • T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §3.1.
  • H. K. Cigizoglu (2004) Estimation and forecasting of daily suspended sediment data by multi-layer perceptrons. Advances in Water Resources 27 (2), pp. 185–195. Cited by: §3.1, §4.
  • B. J. Dangerfield and J. S. Morris (1992) Top-down or bottom-up: aggregate versus disaggregate extrapolations. International Journal of Forecasting 8 (2), pp. 233–241. Cited by: §2.
  • G. De’Ath (2007) Boosted trees for ecological modeling and prediction. Ecology 88 (1), pp. 243–251. Cited by: §3.1, §4.
  • R. C. Deo, M. A. Ghorbani, S. Samadianfard, T. Maraseni, M. Bilgili, and M. Biazar (2018) Multi-layer perceptron hybrid model integrated with the firefly optimizer algorithm for windspeed prediction of target site using a limited set of neighboring reference station data. Renewable energy 116, pp. 309–323. Cited by: §3.1, §4.
  • J. Durbin and S. J. Koopman (2012) Time series analysis by state space methods. Vol. 38, Oxford University Press. Cited by: §2.
  • A. Espasa, E. Senra, and R. Albacete (2002) Forecasting inflation in the european monetary union: a disaggregated approach by countries and by sectors. The European Journal of Finance 8 (4), pp. 402–421. Cited by: §2.
  • M. Feurer, J. T. Springenberg, and F. Hutter (2015) Initializing bayesian hyperparameter optimization via meta-learning.. In AAAI, pp. 1128–1135. Cited by: §3.3.
  • E. B. Fliedner and V. A. Mabert (1992) Constrained forecasting: some implementation guidelines. Decision Sciences 23 (5), pp. 1143–1161. Cited by: §1.
  • G. Fliedner (1999) An investigation of aggregate variable time series forecast strategies with specific subaggregate time series statistical correlation. Computers & Operations Research 26 (10-11), pp. 1133–1149. Cited by: §1, §2.
  • D. W. Fogarty and T. R. Hoffmann (1983) Production and inventory management. Thomson South-Western. Cited by: §2.
  • T. P. Gordon, J. S. Morris, and B. J. Dangerfield (1997) Top-down or bottom-up: which is the best approach to forecasting?. The Journal of Business Forecasting 16 (3), pp. 13. Cited by: §2.
  • C. W. Gross and J. E. Sohl (1990) Disaggregation methods to expedite product line forecasting. Journal of Forecasting 9 (3), pp. 233–254. Cited by: §1, §2.
  • Y. Grunfeld and Z. Griliches (1960) Is aggregation necessarily bad?. The Review of Economics and Statistics, pp. 1–13. Cited by: §2.
  • H. S. Hippert, C. E. Pedreira, and R. C. Souza (2001) Neural networks for short-term load forecasting: a review and evaluation. IEEE Transactions on power systems 16 (1), pp. 44–55. Cited by: §3.1.
  • K. Hubrich (2005) Forecasting euro area inflation: does aggregating forecasts by hicp component improve forecast accuracy?. International Journal of Forecasting 21 (1), pp. 119–136. Cited by: §2.
  • R. Hyndman and Y. Khandakar (2008) Automatic time series forecasting: the forecast package for r, ʻjournal of statistical softwareʼ 26 (3): 1-22. Google Scholar. Cited by: §2.
  • R. J. Hyndman, R. A. Ahmed, G. Athanasopoulos, and H. L. Shang (2011) Optimal combination forecasts for hierarchical time series. Computational Statistics & Data Analysis 55 (9), pp. 2579–2589. Cited by: §1, §1, §2, §2, §5.
  • R. J. Hyndman, G. Athanasopoulos, et al. (2014) Optimally reconciling forecasts in a hierarchy. Foresight: The International Journal of Applied Forecasting (35), pp. 42–48. Cited by: §2.
  • K. Jamieson and A. Talwalkar (2016) Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics, pp. 240–248. Cited by: item 1, item 2, §3.3.
  • R. Kohavi et al. (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14, pp. 1137–1145. Cited by: item 2, §4.
  • N. Kourentzes, D. K. Barrow, and S. F. Crone (2014) Neural network ensemble operators for time series forecasting. Expert Systems with Applications 41 (9), pp. 4235–4244. Cited by: §2.
  • P. Lamberson and S. E. Page (2012) Optimal forecasting groups. Management Science 58 (4), pp. 805–810. Cited by: §2.
  • L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017) Hyperband: a novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research 18 (1), pp. 6765–6816. Cited by: §3.3.
  • D. Maclaurin, D. Duvenaud, and R. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122. Cited by: §3.3.
  • J. Mei, D. He, R. Harley, T. Habetler, and G. Qu (2014) A random forest method for real-time price forecasting in new york electricity market. In PES General Meeting— Conference & Exposition, 2014 IEEE, pp. 1–5. Cited by: §3.1, §4.
  • G. G. Moisen, E. A. Freeman, J. A. Blackard, T. S. Frescino, N. E. Zimmermann, and T. C. Edwards Jr (2006) Predicting tree species presence and basal area in utah: a comparison of stochastic gradient boosting, generalized additive models, and tree-based methods. Ecological modelling 199 (2), pp. 176–187. Cited by: §3.1.
  • J. W. Muir (1979) The pyramid principle. In Proceedings of 22nd Annual Conference, American Production and Inventory Control Society, pp. 105–7. Cited by: §1.
  • S. L. Narasimhan, D. W. McLeavey, and P. Billington (1995) Production planning and inventory control. Prentice Hall Englewood Cliffs. Cited by: §2.
  • G. H. Orcutt, H. W. Watts, and J. B. Edwards (1968) Data aggregation and information loss. The American Economic Review, pp. 773–787. Cited by: §2.
  • C. L. Pennings and J. van Dalen (2017) Integrated hierarchical forecasting. European Journal of Operational Research 263 (2), pp. 412–418. Cited by: §2, §2.
  • O. Rahmati, H. R. Pourghasemi, and A. M. Melesse (2016) Application of gis-based data driven random forest and maximum entropy models for groundwater potential mapping: a case study at mehran region, iran. Catena 137, pp. 360–372. Cited by: §3.1.
  • S. Rosen, P. Salemi, B. Wickham, A. Williams, C. Harvey, E. Catlett, S. Taghiyeh, and J. Xu (2016) Parallel empirical stochastic branch and bound for large-scale discrete optimization via simulation. In 2016 Winter Simulation Conference (WSC), pp. 626–637. Cited by: §5.
  • B. Rostami-Tabar, M. Z. Babai, Y. Ducq, and A. Syntetos (2015) Non-stationary demand forecasting by cross-sectional aggregation. International Journal of Production Economics 170, pp. 297–309. Cited by: §2.
  • G. Sbrana and A. Silvestrini (2013) Forecasting aggregate demand: analytical comparison of top-down and bottom-up approaches in a multivariate exponential smoothing framework. International Journal of Production Economics 146 (1), pp. 185–198. Cited by: §2, §2.
  • A. B. Schwarzkopf, R. J. Tersine, and J. S. Morris (1988) Top-down versus bottom-up forecasting strategies. The International Journal Of Production Research 26 (11), pp. 1833–1843. Cited by: §2.
  • E. Shlifer and R. Wolff (1979) Aggregation and proration in forecasting. Management Science 25 (6), pp. 594–603. Cited by: §2.
  • R. H. Shumway and D. S. Stoffer (2011) Time series regression and exploratory data analysis. In Time series analysis and its applications, pp. 47–82. Cited by: §3.1.
  • R. D. Snyder, J. K. Ord, and A. Beaumont (2012) Forecasting the intermittent demand for slow-moving inventories: a modelling approach. International Journal of Forecasting 28 (2), pp. 485–496. Cited by: §2.
  • L. Strijbosch, R. Heuts, and J. Moors (2007) Hierarchical estimation as a basis for hierarchical forecasting. IMA Journal of Management Mathematics 19 (2), pp. 193–205. Cited by: §1.
  • A. A. Syntetos, Z. Babai, J. E. Boylan, S. Kolassa, and K. Nikolopoulos (2016) Supply chain forecasting: theory, practice, their gap and the future. European Journal of Operational Research 252 (1), pp. 1–26. Cited by: §2.
  • S. Taghiyeh, D. C. Lengacher, and R. B. Handfield (2020) Forecasting model selection using intermediate classification: application to monarchfx corporation. Expert Systems with Applications, pp. 113371. Cited by: §5.
  • S. Taghiyeh and J. Xu (2016)

    A new particle swarm optimization algorithm for noisy optimization problems

    Swarm Intelligence 10 (3), pp. 161–192. Cited by: §5.
  • J. W. Taylor (2000)

    A quantile regression neural network approach to estimating the conditional density of multiperiod returns

    Journal of Forecasting 19 (4), pp. 299–311. Cited by: §3.1.
  • A. Timmermann (2006) Forecast combinations. Handbook of economic forecasting 1, pp. 135–196. Cited by: §2.
  • H. Widiarta, S. Viswanathan, and R. Piplani (2007) On the effectiveness of top-down strategy for forecasting autoregressive demands. Naval Research Logistics (NRL) 54 (2), pp. 176–188. Cited by: §2.
  • H. Widiarta, S. Viswanathan, and R. Piplani (2008) Forecasting item-level demands: an analytical evaluation of top–down versus bottom–up forecasting in a production-planning framework. IMA Journal of Management Mathematics 19 (2), pp. 207–218. Cited by: §2.
  • H. Widiarta, S. Viswanathan, and R. Piplani (2009) Forecasting aggregate demand: an analytical evaluation of top-down versus bottom-up forecasting in a production planning framework. International Journal of Production Economics 118 (1), pp. 87–94. Cited by: §2, §2.
  • A. Zellner and J. Tobias (2000) A note on aggregation, disaggregation and forecasting performance. Journal of Forecasting 19 (5), pp. 457–465. Cited by: §2.
  • M. Zieba, S. K. Tomczak, and J. M. Tomczak (2016) Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Systems with Applications 58, pp. 93–101. Cited by: §4.