1 Introduction
In many practical applications one is concerned with predicting a quantity (target or label) from a set of features
, i.e. one needs to estimate the conditional
of the joint probability density distribution
when the values of the feature variablesare observed. In supervised machine learning, one uses e.g. neural networks like NeuroBayes
[NeuroBayes] to build a model for the conditional, i.e. , whereis a set of parameters learned from the training data and represents all observed knowledge of the joint probability distribution
.Although the predictions obtained from the machine learning model are in general very accurate, the exact path how an individual prediction was calculated is typically not observable in complex ensemble or deep learning models. Being able to explain how individual predictions and decisions were made can be a mandatory legal requirement in many sectors such as finance and insurance and is highly desirable in others such as medicine or retail to build trust in the machine learning model.
In addition, most machine learning algorithms struggle to learn rare events, as they are not representative of the bulk of the data and are often overregularized, even though in practical applications these effects may play a major role.
In order to address both the explainability and handling of rare effects, a novel machine learning algorithm called ”Cyclic Boosting” is proposed, which is able to learn a model efficiently and accurately, while allowing to precisely follow the path how individual predictions were made.
As will become apparent below, Cyclic Boosting can be categorized as a generalized additive model [GAM], where the target
belongs to the family of exponential distributions such as the Poisson, Gaussian, or Bernoulli distribution, together with a suitable link function. As such, the Cyclic Boosting algorithm can be used in the following three scenarios:

Multiplicative regression mode:

Additive regression mode:

Classification mode:
In the following, the multiplicative regression mode is described in detail first and for the other two modes only modifications are highlighted afterwards.
2 Literature Review
Reaching humanlevel performance using machine learning techniques in specific applications has also increased the interest in interpretability of algorithmdriven decisions in recent years. However, due to the opaque nature of most of the complex models it remains largely unexplained how these decisions are reached.
Several approaches exist to address this situation, e.g. Molnar [molnar2019] gives a good overview. In general, one can either use a complex black box model and subsequently apply modelagnostic interpretation tools or build an explainable model.
For black box models, the importance of individual input features can be determined using e.g. Shapley’s gametheoretic approach [Shapeley1953, SHAP] or permutation importance [Breiman2001, eli5]. Google’s WhatIf [GoogleWhatIf] allows to probe the behavior of a machine learning model if certain aspects or inputs are changed. Partial dependency plots [friedman2001] can be used to visualize the relationship between the target and selected features and illustrate if e.g. the dependency is nonlinear or monotonous. Multiple visualization techniques have been developed in particular to understand the way deep neural networks process images, such as partial occlusion [Zeiler2013] or saliency maps [Simonyan2013]. These approaches are undoubtedly invaluable tools to understand the inner workings of black box models as well as the correlation between input features and predictions. However, they cannot address the black box character on a fundamental level and therefore don’t allow for fully explainable models.
Surrogate models can be used to build explainable models out of black box models. For instance, LIME [lime] uses linear models to derive an explainable model which is faithful locally around each prediction. While these predictions are locally explainable, the model itself remains a black box model.
The simplest fully explainable model is the linear or logistic regression where the target is represented by a sum of linear features and a Gaussian noise term
: . Once all coefficientsare determined from data, each prediction can be evaluated in terms of the features and their coefficients. However, this approach suffers from many shortcomings, e.g. all features are assumed to be linear with constant variance, as well as independent from each other. Consequently, this simple approach is rarely sufficient in practical applications.
General Linear Models (GLM) [GLM] extend this approach and are useful for a wider range of models. However, they still retain their linear character. More generally, General Additive Models (GAM) [GAM] replace the linear term with a function which allows for the modeling of nonlinear effects. In general, GAMs are described by , where is called the link function and is some function which operates on the features . In case of GLMs,
is constrained to be linear. While GAMs are not as easily interpretable as a simple linear regression, they retain most of the benefits while allowing to model complex relationships found in concrete application scenarios. Recently, Microsoft released
Interprete [Interprete], which includes a GAM with pairwise interactions for each feature variable [GAM_Microsoft], i.e. .Considering other supervised learning approaches, single decision trees
[Breiman2001]are fully explainable, but are often not sufficiently performant in practical applications. While amending them to ensemble methods by means of e.g. bagging or boosting techniques can improve the performance of treebased methods significantly, it greatly reduces the explainability of the models as well. The same holds for support vector machines (SVM)
[Boser1992]: For a linear kernel, the SVM weights define the hyperplane which separates two classes. In case of low dimensional feature space, this can be used to gain insights into the relative importance of the input variables. However, in case of highdimensional spaces this becomes more difficult and more general nonlinear kernels do not allow for easy interpretation of the SVM decision boundaries. Artificial neural networks are typically considered as black box models due to their nonlinear transfer function modifying the output of individual neurons.
3 The Cyclic Boosting algorithm
3.1 General approach
The main idea behind Cyclic Boosting is that each individual feature from contributes in a specific way to the prediction of the target . If all contributions can be calculated on a granular level, each prediction for a given observation can be transparently interpreted by analyzing how much each feature for the observed values contributes to the prediction.
To achieve the required granularity, each feature is first binned appropriately: Categorical features retain their original categories, whereas continuous features are discretized such that each bin has the same width (equidistant binning) or contains the approximately same number of observations. In the following, bins are denoted by , i.e. bin for feature . During the training of the supervised machine learning model, each feature, containing its various bins, is considered in turn and an appropriate modification to the prediction of the target is calculated. This process is repeated iteratively until a stopping criterion is met, e.g. the maximum number of iterations or no further improvement of an error metric such as the mean absolute deviation (MAD) or mean squared error (MSE).
3.2 Multiplicative regression mode
In the multiplicative regression mode of Cyclic Boosting, the target variable is in the range . The predicted values of the target variable, denoted by , are calculated from given observations of a set of feature variables in the following way.
(1) 
Here, are the model parameters for each feature and bin . For any concrete observation , the index of the bin is determined by the observation of and the subsequent lookup into which bin this observation falls. The global average is calculated from all observed target values taken across the entire training data.
If one assumes that the target variable
is generated as the mean of a Poisson (or more general, negative binomial) distribution and the logarithm
is used as the link function, eqn. 1 can be inferred from the structure of a generalized additive model by applying the inverse link function.The model parameters are determined from the training data according to the following metaalgorithm:

Calculate the global average from all observed across all bins and features .

Initialize the factors

Cyclically iterate through features and calculate in turn for each bin the partial factors and corresponding aggregated factors , where indices (current iteration) and (current or preceding iteration) refer to iterations of full feature cycles as the training of the algorithm progresses:
(2) This means is the factor that is multiplied to in each iteration. Here, is calculated according to eqn. 1 with the current values of the aggregated factors :
(3) To be precise, the determination of for a specific feature employs in the calculation of . For the factors of all other features, the newest available values are used, i.e., depending on the sequence of features in the algorithm, either from the current () or the preceding iteration ().

Quit when stopping criteria are met at the end of a full feature cycle.
This iterative cyclic optimization corresponds to a coordinate descent algorithm [Wright2015] with a boostinglike update of the factors and intrinsically supports the modeling of hierarchical causal dependencies in the data by means of choosing an appropriate feature sequence.
In order to increase robustness of the optimization and, if desired, reduce dependency on the sequence of features, a learning rate can be added to the calculation of the factors in eqn. 2 (with as link function):
(4) 
Here, is chosen as a small value at the beginning of the training () and is then increased after each full feature cycle according to a linear or logistic function until it reaches for the maximal number of iterations, hence as the algorithm converges.
If the values
follow a Poisson distribution, the Cyclic Boosting algorithm corresponds to optimizing
, i.e. , with for all observations in each bin of feature . Since the bins of each feature variable are considered independently of each other, the optimization is performed locally in each bin . This has the benefit that rare events can be learned effectively by the algorithm. While most machine learning algorithms tend to overregularize these effects, especially when they are far away from the bulk of the respective distribution of observed feature variables , choosing a suitable binning allows to treat rare observations separately from the bulk of the distribution of observed feature variables and hence allow accurate predictions even in this case. However, the potentially low numbers of observations in such bins increase the need for regularization methods in order to avoid learning wrong or spurious relationships from data, i.e. reduce the risk of overfitting.Although the cyclic consideration of all variables already accounts for correlations between the different features, the learning of correlations between specific features can be further improved by adding composed features with multidimensional binning, e.g. built out of two or three of the original features. An example for this is described later on in figure 2.
The binned featurewise optimization of the Cyclic Boosting method also enables a natural access to the introduction of sample weights. As an example, this can be used to put more emphasis on the most recent past when predicting a target available as times series data. Such a procedure can help to improve the forecast quality in case of trends or other temporal changes in the data.
Owing to its straightforward structure based on fundamental arithmetic operations, Cyclic Boosting can be trained efficiently on a large amount of data and parallelization of the algorithm is possible without major obstacles.
3.3 Regularization
The factors are iteratively updated according to eqn. 2, where the update rule has the form
. As the Gamma distribution is the maximum entropy probability distribution for a random variable
for which is fixed and greater than zero, the Gamma distribution is assumed as a prior for the distribution of the factors in each bin of feature . Furthermore, the numerator and denominator of eqn. 2 have the form of the maximum likelihood estimator for an i.i.d. random variable following a Poisson (or more general negative binomial) distribution. These considerations motivate the description of the individual contributions, i.e. the factors, to the prediction of a target variableas conjugate distributions, the Gamma distribution being the conjugate prior to the Poisson (or more general negative binomial) likelihood. Eqn.
2 can hence be written as:(5) 
with
(6) 
The numerical values of the parameters of the prior Gamma distribution are chosen such that the median of the Gamma distribution is 1, i.e. , .
3.4 Smoothing
In most realistic applications, the observed data will be noisy and subject to statistical fluctuations, assuming that missing, incomplete or wrong data have already been corrected and common best practices for improving data quality have been observed. Regularizing the factors across bins for each feature will therefore improve the numerical stability of the algorithm training. For categorical features, the factors in each category can be regularized by determining appropriate Bayesian a priori probabilities for each occurrence of the specific category of feature variable . For continuous features, smoothing functions such as splines or a suitable base of orthogonal polynomials can be applied, which is equivalent to applying a lowpass filter to remove highfrequency noise.
It should be noted that the range of the factors need to be transformed from to before these smoothing approaches can be applied. This can be achieved by taking the logarithm of the factors, i.e. . In order to be able to fit a smoothing function to the factors, the uncertainties of each factor in each bin for feature
can be estimated from moment matching of the Gamma distribution to the lognormal distribution, i.e. assuming that the uncertainties follow a Gaussian distribution after the logarithmic transformation has been applied. This means the variance of the Gamma distribution is set equal to the variance of the lognormal distribution:
(7) 
The mean of the lognormal distribution is then substituted by the mean of the Gamma distribution: .
And finally, this leads to the following formula for the uncertainties:
(8) 
After the smoothing of the factors has been performed, the factors are transformed back to the original range (i.e. ) by applying the exponential function as the inverse of the natural logarithm.
3.5 Additive regression mode
In the additive regression mode, with the range of the target variable being , the formulae are modified such that:
(9) 
(10) 
The conjugate distributions for the individual contributions to the prediction, in this case the summands, follow a Gaussian function. Therefore, no transformation is needed before smoothing.
3.6 Classification mode
In the case of (binary) classification, one aims to identify whether a given observation belongs to a certain class or not. Hence the range of the target variable is in , which can be interpreted as the probability that this observation belongs to the class () or doesn’t belong to the class (). In practical applications, a suitable cutoff has to be defined which separates the two cases.
Noting that the odds, i.e. the ratio
, has the range , the same approach as the multiplicative regression mode can be used:(11) 
Instead of a Gamma function, the conjugate prior for the factors is now a Beta function, due to the binary nature of the setting, and the corresponding likelihood is a Bernoulli distribution. Choosing and
results in a uniform Beta distribution for the prior that drops sharply to zero at either end of the interval
, which is helpful to avoid overconfidence with extreme predictions. The parameters of the posterior Beta distribution are then calculated as:(12) 
The factors and their uncertainties are in turn estimated from the mean (or median) and variance of this Beta distribution, similar to the approach taken for the multiplicative regression mode.
The performance of the algorithm can be improved by the inclusion of sample weights according to the following scheme:
(13) 
Similar to the approach taken in boosting, i.e. the combination of several weak learners into a strong one, this definition enforces the training process to put more emphasis on observations that have been misclassified in the current state of the algorithm. Eqn. 12 then reads:
(14) 
(15) 
Like in the multiplicative regression mode, the logarithm is then used to transform the range to and in turn the same approach to regularization and smoothing can be taken.
4 Example: Demand Forecasting
A very useful application of Cyclic Boosting’s multiplicative regression mode is to forecast future demand of individual products sold in a retail location. Hereby, demand is influenced by promotions, price changes, rebates, coupons, and even cannibalization effects within the assortment range. Furthermore, customer behavior is not uniform but varies throughout the week and is influenced by seasonal effects and the local weather, as well as many other contributing factors. Hence, even though demand generally follows a negative binomial distribution [Ehrenberg1959], the exact values of the parameters are specific to a single product to be sold on a specific day in a specific location or sales channel and depend on the wide range of frequently changing influencing factors mentioned above.
Cyclic Boosting allows to efficiently calculate all relevant parameters to model the demand of individual products, taking a wide range of influencing factors into account, while at the same time allowing the operational business to track and understand how each individual prediction was made.
4.1 Data and algorithm training
We use data from a Kaggle online competition [kaggle_data] to demonstrate demand forecasting with Cyclic Boosting and showcase its properties. The data set consists of the fields date, store, item, and sales, the latter being the target to predict. There are five years of historical data, from beginning of 2013 until end of 2017, for 10 different stores and 50 different items.
Besides store and item, we include several features describing trend and seasonality, namely days since beginning of 2013 as linear trend as well as day of week, day of year, month, and week of month. A list of all used features (one and twodimensional) can be found in the legend of fig. 3. For example, twodimensional features including the variable ”item” allow to learn characteristics of time series of individual products.
Exemplarily, fig. 1 shows a detailed analysis of the factors for the feature variable ”item”. Shown are the mean values of the prediction after completion of the training as well as the observed, true values in each bin divided by the global mean. This visualization directly indicates possible deviations from the optimal fit results in the different bins. Here, no significant deviations are present across the whole range of values. Furthermore, the smoothed values of the factors, i.e. the actual fitted parameters of the model, are shown. These differ from the mean values of the target and prediction in the different bins divided by the global mean due to correlations with other features.
An example for a twodimensional feature combination, namely ”store” and trend ”td”, is shown in fig. 2
. The upper lefthand plot shows a binned, twodimensional, colorcoded visualization of the deviations between final predictions and truth. The lower lefthand plot shows the smoothed values of the twodimensional factors, again visualized by means of colorcoding. Here, one of the features is categorical (”store”) and the other one continuous (”td”), and the twodimensional smoothing is performed by means of grouping by the categorical feature dimension and smoothing the continuous one. An alternative for a twodimensional smoothing in case of two continuous features consists in performing a truncated singularvalue decomposition. The two righthand plots show the two corresponding marginal smoothed factor distributions for the mean of the respective other dimension (solid red) and its individual categories (transparent blue) as well as the marginal distributions for final predictions and observed (true) values.
These examples show how Cyclic Boosting supports model development in terms of feature engineering by means of analysis and individual preprocessing.
4.2 Explanation of the predictions
As stated above, one of the advantages of the Cyclic Boosting algorithm is that each individual prediction can be interpreted and related to the feature variables used as input, as shown in fig. 3 for three different predictions . A value of implies that the importance of this particular feature is neutral compared to the others, the strength of the deviation indicates how important a given feature is for the individual prediction. As the figure illustrates, the importance of the individual features, from which the final prediction is calculated, can vary significantly from one observation to the next.
4.3 Results
The published results of the competition correspond to a test period from beginning of January to end of March 2018 [kaggle_data]. However, the true observed values in this test period are not publicly available. Since the main aim of this example is to demonstrate that the Cyclic Boosting algorithm achieves at least comparable performance to other machine learning approaches while retaining the benefit of fully explainable predictions, the data until the end of 2016 were used for training the model and the first three months of 2017 were taken as an independent test sample. This reproduces the conditions of the competition as closely as possible and allows for a likeforlike comparison of the published results and scores with the results obtained from this model. Using the observed sales in the first three months of 2017 and comparing these to the predicted values, the approach using Cyclic Boosting results in a symmetric mean absolute percentage error of . The same approach with a training period until end of 2015 and prediction of the first three months of 2016 yields . The winning models of the competition have scores of and for and of the data set for the first three months in 2018. Given the upward trend of sales and the larger training data set, we expect the of our model in 2018 to be at least comparable to the winning method. One can therefore conclude that Cyclic Boosting can compete with other available algorithms in terms of forecast quality while retaining full explainability of the individual predictions.
In order to demonstrate the robustness of Cyclic Boosting against variations of the number of feature bins, which could be considered as hyperparameters, we both halved and doubled the number of bins (from default value of 100) for the two utilized continuous features, what resulted in changes in fourth place after the comma for
values.As an additional remark, this simulated data set includes no information on prices, promotions, or product hierarchy and also shows no dependency on events, like holidays, weather, or other exogenous variables, for which the full potential of Cyclic Boosting would come to fruition.
5 Conclusion
A new machine learning algorithm, called Cyclic Boosting, was presented, which can be categorized as a generalized additive model with a cyclic coordinate descent optimization featuring a boostinglike update of parameters.
Cyclic Boosting addresses the challenge of prediction explainability on a fundamental level: Rather than relying on black box approaches, individual predictions for single observations should not only be accurate but also explainable. Each prediction calculated using the Cyclic Boosting algorithm can be explained in terms of the strength of each feature variable contributing to the prediction.
Furthermore, Cyclic Boosting facilitates (multidimensional) feature engineering and enables the modeling of hierarchical causal dependencies and the prediction of rare effects in the data. Thereby, overfitting is effectively avoided by means of regularization and smoothing extensions.
Accepted at ICMLA 2019.
© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Comments
There are no comments yet.