1 Introduction
Demand forecasting is an important use case in supply chain, inventory management, retail, etc. Predicting the sales of various products within the portfolio is crucial for making business decisions and other downstream tasks such as inventory optimization and target calculation, etc. One of the major challenges in making such forecasts is taking the effect of product cannibalization into account. Product cannibalization occurs when demand for a certain product within the portfolio increases that may be due to launch of a new product. This consequently reduces the sales of older products. This interaction between different data samples leads to the fact that total demand of all products remains stable but with large variations in the demand of individual products within the portfolio.
Machine learning allows us to model complex dynamics and capture large number of input variables over traditional statistical models. Generally, machine learning models try to optimize the cost function by using input features to the model and updating the model parameters accordingly. However, in product cannibalization the demand of a given product is being impacted by the demand of a different product that is not a part of the input feature set. In this work, the proposed framework is to make accurate sales forecast of old products that are cannibalized due to launch of newer products. Consequently, off the shelf machine learning models like XGBoost or neural networks are not able to capture the interactions between different data samples during the training.
The work presented in [Zara] focuses on the product demand forecast for the distribution team, and therefore only tackles one week ahead demand forecast. One of the important work [Remy] assumes short sale cycle for a given product and hence is useful in making short term forecasts. We tackle both short term and long term cannibalisation. Also, we have focused on long term forecasting for 8 or more weeks. The work [Zara] uses multinomial model hence output is discrete, i.e. it is for categorical output. Our usecase focuses on continuous variable since we are trying to forecast product sales that is a real number.
The second major hurdle in the above use case is when making long term predictions under limited training data. One of the standard techniques is to use Recursive Multistep Forecast [ML_mastery] technique where prediction at a given timestep is used as an input feature to make forecast for next time step. The problem with this approach is that it causes error propagation when making long term forecast.
On the other hand, [Hossein] provides with multiple approaches to modelling long term forecast. However, these are standard approaches that do not cater our use case completely. The reason being, the paper’s standard approaches do not take cannibalization into account and thus the performance is severely deteriorated.
The paper [Carlos] deals with cannibalization due to promotional impact whereas in our use case we are concerned with cannibalization due to new product launches. In both cases however the underlying assumption is on casual relationship between cannibalized and cannibalizing product.
Our work improves the forecast accuracy of old products whose sales have been affected by the addition of newer products to the portfolio. In this study any product after 4 weeks of launch is assumed to be an old product. Thus, all products within 4 weeks of launch period are NPI products. The machine learning algorithm we introduce in this work handles these limitations.
2 Data Preparation
The dataset is a tabular data with n rows and d+1 columns. The rows are divided between train set, indexed 1 to m rows, and test set, indexed m + 1 to n rows. The input matrix to the model is defined as X and is in ddimensional space. The target variable is Y, which is a continuous real value corresponding to number of units sold each week. We train and predict using our threestage framework for each product category separately independent of one another. Thus for our experiment we have 3 datasets corresponding to 3 separate product categories. Each product category contains count as the number of products sold for a given specific week. Also, the number of products within a product category can vary weekoverweek. The general overview of the dataset is as shown in Table 1.











Category_A  Date_1  A_1  Promo_1  Season_1  1000  1110  1150  1200  
Category_A  Date_2  A_1  Promo_1  No_Seasonality  1110  1150  1200  1500  
Category_A  Date_3  A_1  Promo_2  No_Seasonality  1150  1200  1500  1300  
Category_A  Date_4  A_1  Promo_4  Season_2  1200  1500  1300  2000  
Category_A  Date_5  A_1  No_Promo  Season_3  1500  1300  2000  1650  
Category_A  Date_1  A_2  Promo_4  Season_1  2000  4200  4000  5000  
Category_A  Date_2  A_2  Promo_1  No_Seasonality  4200  4000  5000  3600  
Category_A  Date_3  A_2  Promo_6  No_Seasonality  4000  5000  3600  3900  
Category_A  Date_4  A_2  No_Promo  Season_2  5000  3600  3900  4200  
Category_A  Date_5  A_2  No_Promo  Season_3  3600  3900  4200  5500 
Each row corresponds to number of units the product sold in the corresponding week as given by the column sale(t). Hence, sale(t) is the target variable Y. Every other column in above table corresponds to input feature X which impacts the sales of column sale(t). Our objective is to make accurate prediction of sale(t) for a given product on any given date. Since, we are implementing a forecasting model the sales of a product in a given week, i.e. sale(t), is affected by sales of the same product in previous weeks. Hence, we have included sales of previous three weeks of the same product as a part of the input feature set X. For historical data when the sales is available, it is possible to fill sale(t1), sale(t2) and sale(t3). However, when making long term predictions on future dates the previous weeks sales is not available to us. For example, if we have historical sales available upto August 31, 2020 in which case when making a prediction for September 28, sales of September 07, September 14 and September 21 is not available to us. Hence, sale(t3), sale(t2) and sale(t1) will be unfilled in the input matrix X. To overcome the issue we will borrow an idea from natural language prediction. While making prediction on future dates for testing, we will use the prediction made on previous dates and append it to the columns in X. In the above example, this would mean we would first make a prediction on September 07, 14, and 21 and append these predictions to columns of sale(t3), sale(t2) and sale(t1)
respectively. This above concept of using previous week sale prediction as an input to make a prediction on next week is referred to as backpadding. Later, we will discuss how this widely used concept of backpadding in timeseries forecasting leads to the problem of error propagation for longer time horizon during prediction.
3 Motivation
The standard XGBoost model for regression modelling uses SE loss function given by,
(1) 
where, m is the number of training samples. X is an m x d
input vector to the XGBoost
Y is the target variable which takes on continuous real value and has dimension m x 1The XGBoost model is trained on X to approximate the target variable Y by a function f. The usual assumption here is that the behaviour of Y can be completely modelled using X. However, in product cannibalization we observe also depends on the values of where, . Here, i and j is the row number. Here, for a given target data point , cannot be made as a part of independent feature for the following reason.
For testing, all the target variables to are unknown. Hence,for test dataset between m+1 to n the independent variable will have certain columns empty. Hence, in the test dataset for a given target variable data point on which we want to make prediction we cannot have as a part of input matrix, since and are both unknown. Consequently, the XGBoost model based on SE will lose important information pertaining to cannibalization and thus we need an alternative approach to incorporate the cannibalization information to the model. In time series forecasting to forecast over long term horizon, we depend on the predictions made by previous time periods. Hence, the models are autoregressive. Unfortunately, this leads to the problem of error propagation that become more severe as the prediction time horizon increases. This is true in our study also, since we are forecasting sales for 8 to 14 weeks into the future respectively.
To overcome the two challenges we introduce a sum constraint based modelling technique that involves predicting for target variable datapoints to and also simultaneously solving constraint based equation that incorporates cannibalization information. The available dataset for this work is quite small with individual category containing approximately 3000 records and large number of independent variables. X
. Hence, feed forward or other deep learning frameworks have not been used to construct the framework. XGBoost is an obvious algorithm of choice for implementing threestage framework mainly because it has superior performance on small datasets and implementing custom objective functions is relatively straight forward.
4 Algorithm
We make a fundamental assumption that, sum of sales over the week for a given category is easily obtainable from domain knowledge and is already known to us. What we are left with is to make week over week predictions for individual products belonging to the specific category. Mathematically it can defined as,
(2) 
Where, S is the aggregate sum of sales over all products in that week for the category. The S can be easily obtained from domain knowledge. is the weekly sale of product 1, is weekly sale of product 2 and so on. Here, count represents the number of products in the given week in our dataset and can vary week over week.
For the week i, represents the sales of a high demand product that would suppress the weekly sales of other products accordingly. It is because, S is approximately constant in a given week. The idea is to exploit this useful information to make predictions for the cannibalized products. Even though the cannibalization information is not a part of input matrix X, we construct a three XGBoost models given by , , . Each of these models is trained on their corresponding objective functions , , respectively. According to their usage in our framework they have been named as follows. XGBoost1 as baseline model, XGBoost2 as constraint model, XGBoost3 as finetune model as shown in flowchart 4.1.
In stage 1, we train XGBoost on 1 to m samples on SE objective function. We refer to this model as XGBoost1. The SE objective function is given by,
(3) 
Where is the number of training samples with actual sales. is the weekly sale of an individual product. is the predicted weekly sale by the model for an individual product. The trained XGBoost1 model is then used to predict for test set m+1 to n samples. The predictions on test set for samples m+1 to n in the dataset are used to update the columns sales(t3), sales(t2), sales(t1) and sales(t). Here, sales(t3), sales(t2), sales(t1) are a part of input feature set X. This process of using previous week sales to make current prediction is called backpadding. The sales sales(t) is the output label Y that we are predicting.
The prediction made at stage 1 are of poor accuracy as it does not take into account the cannibalization of sales due to launch of new products. These predicted values are appended to column which initially only contained m training samples. This updated train dataset containing n samples is used as an input to stage 2 (stage 2 is explained in the next paragraph) of the algorithm. Also, XGBoost1 acts as a baseline model that gives the upcoming stages of the model a general guideline to make a better prediction.
In stage 2, we train XGBoost on entire dataset containing n samples based on the updated dataset obtained from stage 1. We refer to it as XGBoost2. Here, 1 to m datapoints contains actual sales and m+1 to n contain the predicted weekly sales by XGBoost1. XGBoost2 is trained on the below objective function,
(4) 
where is the actual weekly sales of each product for i in [1,2,…m]. It also includes the predicted sales on test set by XGBoost1 model for i in [m+1, m+2,…,n]. is the weekly sales for each product predicted by XGBoost2 model. is the aggregate sum of sales of all products for a given week. In equation 4 is given by,
(5) 
Where, is the aggregate sum of sales for a given category for week i and is assumed to be known. in 4 is given by,
(6) 
where is the number of products on sale in the given week i. The XGBoost2 model makes predictions and updates the columns sales(t3),sales(t2), sales(t1) for records m+1 to n. Here, we do not update the label column Y, i.e. sale(t). We only update the input feature columns of X.
As evident from equation 4, the model in stage 2 is trained simultaneously on an objective function consisting of two parts namely, SE as well as categorical sum constraint. The from stage 1 that are appended to means that the new prediction tries to get close to as per SE. But it is forced to increase or decrease the prediction of all the products within the category for the week based on the categorical sum constraint part of the equation 4 which has been expanded in 5. This ensures that if the products are overforecasting or under forecasting then it self adjusts due to the sum constraint term in the objective function of equation 4.
In stage 3, we train XGBoost on entire dataset containing 1 to n samples with the objective function given below and refer to it as XGBoost3.
(7) 
where is obtained from output of stage 1. It is given by,
(8) 
4.1 Flow Chart
The algorithm consists of three stages, with each stage utilizing the information from previous stages and gets trained on a different objective function. In stage 2 the model increased or decreased the predictions of all products based on categorical sum. Stage 3, is to guide the prediction of each product within the category based on equation 7. To do this we use the feature device_prediction_ratio, which provides information of how much sale is contributed by individual device within category sum . The objective function in stage 3 is given by equation 7 and consists of two terms. The first term is same as sum constraint in equation 4. This ensures that the sum of predictions of all the devices in the category is still adhering to the category actuals obtained from domain knowledge. The second term in the objective function is to make sure that individual device sale prediction within category increases or decreases based on the information obtained from stage 1, respectively. The stage 3 is the final stage in our framework that insures that the final prediction obtained for each device sums upto the categorical sum obtained from domain knowledge and also adjusting the individual device level predictions according to their input features.
5 Experimentation and Results
Forecasting experiment has been carried out on A, B and C categories. Each category has set of devices where forecasting has been done upto 14 weeks.
The experiments have been performed on devices of A, B and C categories once NPI devices have been launched in their respective category.
Table 2 enlists the details about the devices and the experiments performed.
The category column lists the devices belonging to a certain category based on their pricing and device features. As discussed above in equation 2, it is considered that total sales volume of each category is known. The objective is to predict the sales for existing devices of each category.
The Date column tells us the starting date of the forecasts made by the models.
Regular XGBoost Accuracy column is the weighted accuracy over the entire forecasting horizon of 8 weeks or 14 weeks depending on the category based on the prediction made by XGBoost using SE as the objective function in equation 3.
Similarly, the ThreeStage XGBoost column is the weighted accuracy over the entire forecasting horizon for the given category based on the prediction made by our threestage XGBoost framework given by equation 7.
Lags represent the number of weeks from the staring date for which forecasting has been carried out.
Existing Devices are the devices for which sales forecasting has been carried out. Sales for these devices are impacted by launch of new devices belonging to the same category.
New devices are the devices that cause cannibalization of existing devices. The forecasting of new devices is out of scope for this study and is a good direction for future research.
The train data consists of historical data of about 4 years depending on the experimental setup. We have used weighted accuracy to calculate the performance of the models over different experiments. The weighted accuracy is given by,
(9) 
Where, . The important input features to the model are indicators to showcase holiday seasonality, promotional features, device specifications and other features such as weeks from device launch and pricing, etc.
For the figures in this section, the horizontal axis represents the time horizon over which sales and forecast have been plotted. The vertical axis represents the corresponding actual sales and forecasts generated by the model, the sales and forecasts have been normalized. Figure 1 represents the weekly normalized sales belonging to category A. We see that threestage forecast from XGBoost model is much more closer to actual sales compared to SE based XGBoost forecast. We also observe that after week 8 there is reduction in actual sales as well as in the forecast produced by threestage XGBoost model. However, the SE XGBoost model is unable to capture drop in sales correctly and thus we see that long term forecast has been accurately captured.
We can observe that prediction with our framework yields better results compared to existing machine learning model. It is because it bears a much closer resemblance with the actual trend. We have identified week 3 and week 4 as Thanksgiving dates where we expect to see a spike in sales. Similarly, the newer model has the ability to capture spike on week 8 which is the Christmas week as seen in figure 1.
Figures 3 and 3 represent the results for category_B. Actual sales in B1 gradually decreases over 14 week horizon. The reason being, NPI devices B11, B12, B13 are launched on Aug 17, 2020 respectively. However, SE objective function based XGBoost model is unable to capture the same as it produces an increase in forecast instead of decreasing. The threestage XGBoost model captures the dynamics very well especially over longer time horizon. More importantly, the performance of the threestage framework outperforms the performance of SE XGBoost after 5th week.
Actual and forecasts for devices C1 and C2 belonging to category_C are shown in figure 5 and 5, respectively. We see that overall accuracy is greater in the threestage algorithm compared to standard SE objective function based XGBoost.
We observe that for product C1 the SE and the threestage XGBoost framework both perform poorly in capturing the market dynamics, especially after week 7. Both models tend to significantly underforecast the sales after week 7 compared to actuals.
For product C2 belonging to category C we see threestage XGBoost model performs much better compared to SE XGBoost model. The threestage XGBoost for C3 product has forecasts very close to actuals. On the contrary, the SE XGBoost performs quite poorly.
It can be observed, the threestage framework based XGBoost consistently performs better than SE based XGBoost in all of our experiments.
It can be seen that our model exploits categorical sales that is a input to the threestage framework. Hence, the threestage framework uses the total categorical sale to adjust the sales of individual devices within the
category. This helps to overcome the error propagation problem when making long term forecasts.
Hence, in all the above cases we observe that the proposed threestage framework consistently outperforms existing state of the art XGBoost model significantly.
The performance of the models has been calculated by using the weighted accuracy given by equation 9.








Category_A  02 Nov, 2020  38.65%  67.09%  8 



Category_B  17 Aug, 2020  44.60%  51.50%  14 



Category_C  17 Aug, 2020  6.70%  52.90%  14 




6 Conclusion
In this work, we have developed an algorithm to improve the sales forecasting accuracy of older devices that are impacted by cannibalization due to launch of new devices. The other problem statement that has been addressed is to improve week over week long term forecasting accuracy of the old devices. To address the above two issues we developed a threestage framework using XGBoost algorithm that consists of three stages. We compared the threestage XGBoostbased framework with the regular XGBoost model that uses SE as the objective function. Our experiments show that the proposed threestage framework performs consistently better for long term forecasts. We used weighted accuracy as a metric to quantify the performance comparison. From our experiments, we observe a significant increase in overall prediction accuracy for old products from 38% in the baseline model, to 67% after using the proposed framework. A good direction for future work is to extend the framework to make accurate forecasts for newly launched devices that cause the cannibalization of sales of old devices.
Comments
There are no comments yet.