The economic and banking importance of the small and medium enterprise (SME) sector is well recognized in contemporary society [Biggs2002]. Business loans are very important for the operation of SMEs. However, it is also acknowledged that these actors in the economy may be under-served, especially in terms of finance [Lloyd-Reason and Mughan2006]. This has led to significant debate on the best methods to serve this sector. A substantial portion of the SME sector may not have the security required for conventional collateral based bank lending, nor high enough returns to attract formal venture capitalists and other risk investors. The effective management of lending to SMEs can contribute significantly to the overall growth and profitability of banks [Abbott and others2011]
. Banks have traditionally relied on a combination of documentary sources of information, interviews and visits, and the personal knowledge and expertise of managers in assessing the risk of business loans. But today, financial institutions have also begun to use big data and machine learning to manage credit risk for the credit loans[Khandani et al.2010]. Revenue is a key indicator of credit limit management. Therefore, it is very beneficial to construct an effective revenue forecasting model for credit limit management.
Forecasting the revenue of SMEs is a very challenging task. Traditional machine learning methods for financial regression tasks like revenue forecasting, such as Gradient Boosting Machines (GBMs) [Friedman2001]
, utilize the nonlinear transformation of decision tree to get more robust predictions. But for regression tasks, current popular models such as GBMs can only provide point estimates (forecast expectations or medians) and cannot quantify the predictive uncertainty. In financial tasks, it is crucial to estimate the uncertainty in forecast. The real revenue of SMEs is heteroscedastic distribution. The small enterprise with relatively unstable operating conditions have more large variance then medium enterprise with relatively stable operating conditions. A proper credit limit cannot be granted if the uncertainty of an enterprise’s revenue forecasting cannot be estimated. This is especially the case when the predictions are directly related to automated decision making, as probabilistic uncertainty estimates are important in determining manual fall-back alternatives in the workflow[Kruchten2016]
. In order to quantify the uncertainty, we need to upgrade from point estimation models to probabilistic prediction models. Probabilistic prediction, which is the approach where the model outputs a full probability distribution over the entire outcome space, is a natural way to quantify those uncertainties.
Bayesian method and non-Bayesian method are state-of-the-art methods in probabilistic uncertainty estimation. Bayesian methods naturally generate predictive uncertainty by integrating predictions over the posterior, but we are only interested in predictive uncertainty and do not focus upon the concrete procedure of generating uncertainty in predicting revenue of SMEs. In practice, Bayesian methods are often harder to implement and computationally slower to train compared to no-Bayesian method, such as Neural Network models and Bayesian Additive Regression Trees (BART)[Chipman et al.2010]. Moreover, sampling-based Bayesian method generally requires good statistical expertise and thus leads to poor ease-of-use. Nature Gradient Boosting (NGBoost) [Duan et al.2019] as the state-of-the-art algorithm of non-Bayesian method uses the natural gradient to address the challenge that simultaneous boosting of multiple parameters from the base learners. They demonstrate empirically that NGBoost performs competitively relative to other models in its predictive uncertainty estimates as well as on traditional metrics. They use decision tree from scikit-learn [Pedregosa et al.2011] as the base learner, which is a single machine algorithm supports the exact greedy splitting. NGBoost can only work on small data sets due to the single machine limit.
In this paper, We further derive the natural gradient to make it suitable for large-scale financial scenarios. We study the fisher information of normal distribution and find that the updating procedure of its natural gradient can be further optimized. For normal distribution, we propose a more efficient updating method for the natural gradient, which can dramatically improve computational efficiency. The base learner of SN-GBM is classification and regression trees(CART)[Breiman2017]
, which is the most popular algorithm for tree induction. Compared with NGBoost, SN-GBM adapts a more efficient distributed decision tree based on approximate algorithm as the tree-based learner, which can improve the computational efficiency and robustness. We derive an uncertainty quantification function to distinguish between samples that are accurate and inaccurate. In financial scenarios, interpretability is always demanded because of the transparency requirements of financial scenarios. So we provide two kinds of interpretability include uncertainty. Through the uncertainty interpretability, we can know the factors that cause the predictive uncertainty. In addition, we utilize the uncertainty outcome to optimize the procedure of solving regression problems, such as feature selection.
We summarize our contributions as follows:
We propose SN-GBM for large-scale uncertainty estimation in real industry and provide interpretability of the model.
We apply uncertainty estimation algorithm to revenue forecasting of SMEs for the first time.
We explore a range of uses of uncertainty estimation in regression tasks, which can bring a new modeling perspective.
2 Related Work
Sales Forecast. Since there have been fewer published works about revenue forecast, we refer to some research about sales forecast. Sales often determine revenue. Sales forecast plays a prominent role in business strategy for generating revenue. Previous month sale is found to be more prominent parameters influencing the sales forecast in [Sharma and Sinha2012]. Previous revenue is also an important factor in our revenue forecast. The most commonly used techniques for sales forecasting include statistically based approaches like time series, regression approaches and computational intelligence method like fuzzy back-propagation network (FBPN). [Chang and Wang2006] and [Sharma and Sinha2012]
both use FBPN for sales forecasting. FBPN algorithm performs more robust than traditional multiple linear regression algorithms in[Sharma and Sinha2012], which indicates nonlinear models are more appropriate for non-linear regression tasks such as sales forecast.
Gradient Boosting Machines. Gradient Boosting Machines [Friedman2001] is a widely-used machine learning algorithm, due to its efficiency, accuracy, and interpretability. It has been shown to give state-of-the-art results in structured data (such as Kaggle Competitions). Popular scalable implementations of tree-boosting methods include [Chen and Guestrin2016] and [Ke et al.2017]. We are motivated in part by the empirical achievement of tree-based methods, although they only provide homoscedastic regression. One of the key problems in tree boosting is to find the best split feature value. [Chen and Guestrin2016] efficiently supports exact greedy for the single machine version, as well as approximate splitting algorithm. Also, we refer to some engineering optimizations in xgboost and lightgbm.
Uncertainty Estimation. Approaches to probabilistic forecasting can be broadly be distinguished as Bayesian or non-Bayesain. Bayesian approaches (which include a prior and a likelihood) that leverage decision trees for structured input data include [Chipman et al.2010], [Lakshminarayanan et al.2016] and [He et al.2019]. Bayesian NNs learn a distribution over weights to estimate predictive uncertainty [Lakshminarayanan et al.2017]. Bayesian approaches cost expensive computational resource and are not easy to develop distributed algorithm. We are only interested in predictive uncertainty and do not pay attention to the concrete process of generating uncertainty. So bayesian approaches are not in our consideration. A non-Bayesian approach is similar to our work is [Duan et al.2019] which takes a natural gradient method to solve the problem that multi-parameter boosting. Such a heteroskedastic approach to capturing uncertainty has also been called aleatoric uncertainty estimation [Kendall and Gal2017]. As well as NGBoost, uncertainty that arises due to dataset shift or out-of-distribution inputs [Shift] is not in the scope of our work.
In the theory of algorithm, we mainly refer to the work of NGBoost. Firstly, We will clarify how NGBoost uses natural gradient to implement probabilistic prediction. Then we will demonstrate the improvements we have made on the basis of NGBoost, include more efficient updating method for the natural gradient. Traditional models can only output the interpretability of expectation. While SN-GBM can output two kinds of interpretability include uncertainty. Finally, we implement robust and interpretable Scalable Nature Gradient Boosting based on the decision tree from Spark, which is significantly faster than NGBoost.
The target of traditional regression prediction methods is to estimate . While the target of probabilistic forecast is to estimate , where
is a vector of observed features andis the prediction target, are parameters of target distribution. Take normal distribution for example, (To be more specific, different have different parameters , that is, ).
3.1.1 Proper Scoring Rules
Fitting different targets need different loss functions. Probabilistic estimation requires ”proper scoring rule” as optimization objective. A proper scoring ruletakes as input a forecasted probability distribution and one observation , and the true distribution of the outcomes gets the best score in expectation [Gneiting and Raftery2007]. In mathematical notation, a scoring rule is a proper scoring rule if and only if it satisfies
where represents the true distribution of outcomes , and is any distribution. When a proper scoring rule is used as loss functions during model training, the convergence direction of model is to output the calibration probability finally. In fact, maximum likelihood estimation (MLE), a method of estimating the parameters of a probability distribution, which satisfies above property. The difference from one distribution to another is common KL divergence:
It has a nice property that is invariant to the choice of parametrization [Dawid and Musio2014]. We will talk about importance of this property in later sections.
3.1.2 Natural Gradient
Gradient descent is the most commonly used method to optimize the objective function. The ordinary gradient of a scoring rule is the direction of steepest ascent (fastest increase in infinitesimally small steps). That is,
However, ordinary gradient is not invariant to reparametrization. To be more specific, if we transform into , . Therefore, different reparametrization approaches will affect the updating path of parameter. Again, we will talk about why we need invariant of the reparametrization.
The generalized natural gradient is the direction of steepest ascent in Riemannian space, which is invariant to parametrization, and is defined:
when choosing MLE as proper scoring rule, we get:
where is the Fisher Information carried by an observation about . Note that a Fisher Information matrix is calculated for each sample.
3.2 Scalable Natural Gradient Boosting
In this section, we take the normal distribution as an example to demonstrate how to implement efficient large-scale distribution estimation.
3.2.1 Simplify Computation
The key of NGBoost is to calculate , which is equal to calculate . NGBoost calculates by solving system of linear equations, whose time complexity is (where ). This time complexity is relatively high for a single machine algorithm. Moreover, solving the system of linear equations is also not conducive to implementing distributed parallel algorithms. We find that on the premise of the normal distribution, a more direct method for calculating natural gradient can be derived.
The normal distribution is the most commonly used probability distribution. Many forecasting targets follow the normal distribution or can be transformed into a normal distribution (such as log-normal distribution). So we optimize the natural gradient calculation for the normal distribution. For normal distribution, the distribution parameters are, where . By further derivation, we get:
Actually, the inverse of a fisher information matrix also can be derived simply. The fisher information of normal distribution is as follow:
Then, we can get:
Finally, we derive the result of natural gradient:
The second term is finally transformed into multiplication because the CPU of the computer calculates multiplication operations much faster than division. As we can see from the first term, NGBoost calculates the expectation in the same way as a normal gradient boosting machines that targets Mean squared error (MSE).
3.2.2 Scalable Natural Gradient Boosting
Gradient boosting is effectively a functional gradient descent algorithm. In order to fit multiple parameters of the distribution, we need multiple sets of trees, and each set of trees fits one parameter. Take normal distribution as an example, we use two sets of trees to fit and . Because the range of GBM output is , but the range of is . Reparameterizing to , is consistent with GBM output. This is one of the important reasons why natural gradient is needed: Natural gradient has the desirable property of being invariant to reparameterization. Another reason to use natural gradient is to enable using the same updating step size for two new trees when two sets of trees update at each stage. This because through the adjustment of , the gradient is scaled to the same scale whether it is between samples or parameters (”optimally pre-scaled”).
Apache Spark is a popular open-source platform for large-scale data processing, which is specially well-suited for iterative machine learning tasks[Zaharia et al.2010]. The MLlib [Meng et al.2016] ensemble decision trees for classification and regression problems. Decision trees use many state-of-the-art techniques from the PLANET project [Panda et al.2009], such as data-dependent feature discretization to reduce communication costs. Based on decision trees from Spark ML, we implement scalable natural gradient boosting machines, which is a tree and feature parallelization system. Since there is no dependency between two base learners at each iteration, two trees for two parameters can be constructed in parallel.
The overall training procedure is summarized in Algorithm 1. For normal distribution, and are the natural gradient of and , respectively. In each iteration, two tree learners and will be constructed in parallel. The scaling factor is chosen to minimize MLE in the form of a line search. We multiply it by global update step , then update the parameters to and .
3.2.3 Interpretability of Uncertainty
For each sample, SN-GBM will output two prediction results, which are forecast expectation and variance . Theoretically, the smaller the variance, the narrower its distribution, and the more accurate the prediction. The heteroscedasticity of data often arises uncertainty. Heteroscedasticity often occurs when there is a large difference among the sizes of the observations. So we use variance to estimate the uncertainty of prediction results. In tree-based model, feature importance is often used as a factor in making decisions in interpreting models. SN-GBM is composed of two sets of trees, one is expectation set and another is variance set. We provide two approaches to getting the feature importance of variance:
Weight: The number of times a feature is used to split the data across variance trees.
Gain: The average gain of the feature when it is used in variance trees.
By the feature importance of the variance, we can know which features affect the uncertainty of the prediction and the correlation score.
4 Application in Revenue Forecast
We propose an approach for quantifying the uncertainty of the forecasting target of the non-normal distribution of the original distribution. To provide a reliable and accurate prediction, we derive an uncertainty quantification function for revenue forecasting. Through the uncertainty quantification function, we can know the approximate probability of accurate predictions for each sample. In addition, we propose a bran-new feature selection based on the feature importance of variance, which can improve the precision of uncertainty quantification.
4.1 Uncertainty Quantification
The normal distribution is the most commonly used probability distribution. According to the central limit theorem, if an object is affected by multiple factors, no matter what the distribution of each factor is, the average of the results is a normal distribution. The normal distribution is symmetric, but many real-world distributions are asymmetric. Actually, if effects are independent but multiplicative rather than additive, the result may be approximately log-normal rather than normal. A Box-Cox transformation is a way to transform nor-normal dependent variables into a normal shape. One of the Box-Cox transformation is the log transformation. The real revenue distribution is close to a log-normal distribution. After log transformation, the revenue distribution has become a normal distribution, as shown in Figure 1.
In regression tasks, not only the error between the prediction and the observation is usually considered, but also the ratio between the error and the observation needs to be considered. In revenue forecasting, we train SN-GBM model to fit ln, where ln. So the and of model output is the expectation and variance of ln
. In fact, we need to estimate the uncertainty of the original revenue forecast by relative standard deviation, that is the ratio of the standard ofto the expectation deviation of
. If the random variable lnhas a normal distribution, then the exponential function of ln, =exp(ln), has a log-normal distribution. notates relative standard deviation.
Because , we can also use to measure the relative standard deviation of .
4.2 Feature Selection
Data from many real-world applications can be high dimensional and features of such data are usually highly redundant. Identifying informative features has become an important step for data mining to not only circumvent the curse of dimensionality but to reduce the amount of data for processing. Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in. One of the commonly used approaches is to use the feature importance output by the tree-based model to filter features. The traditional tree-based models select the features that have a large contribution to the forecasting expectation based on the feature importance of the expectation. This method often ignores the correlation between features and uncertainty (here is variance). Based on the feature importance of the variance of SN-GBM, we can select features that are highly correlated with predictive uncertainty. Some features may have low expectation importance but have high variance importance. Such features may not improve the point estimation performance but may improve the accuracy of the distribution estimation. So in the future, we can combine the feature importance of expectation and variance to select features.
Our experiments use datasets from a large fintech services group. This group has served tens of millions of SMEs. One of the most significant scenarios is credit limit management. Our goal is to forecast the revenue of SMEs in the next six months. This is a time series regression task, so we mainly choose historical revenue and trade data of SMEs as the feature for constructing model. We extract twelve sub-datasets from January to December 2018 and five sub-datasets from January to May 2019. The first twelve months is the training set and the second five months is the test set. We extract 215 features related to the revenue for each enterprise. The size of training sample is 10 million.
5.2 Evaluation Criteria
Traditional evaluation metric is mean absolute percentage error (MAPE) of forecasted expectations (i.e.). It usually expresses accuracy as a percentage. Because the of SN-GBM output is the expectation of , so our MAPE formula is defined:
However, the MAPE does not capture predictive uncertainty. The quality of predictive uncertainty is captured in the average negative log-likelihood (NLL) as measured on the test set. NLL is calculated as follows:
is the probability density function of the normal distribution.
In addition, in order to more intuitively apply the results of predictive uncertainty to credit limit management, we have added an evaluation metric ACCURACY, which indicates the proportion of samples with prediction errors within 30%. In the later section, we will briefly introduce how to utilize this metric for more refined credit limit management.
5.3 Empirical Results
5.3.1 Results of Uncertainty Quantification
We compare SN-GBM with several regression models commonly used in financial scenarios such as XGBoost [Chen and Guestrin2016] and GBDT [Friedman2001]. For fair comparison, we set learning_rate = 0.3, the number of iterations = 300, the depth of trees = 6 for all algorithms. Our experimental results show that SN-GBM is comparable to state-of-the-art tree-based models in the performance of point estimation, as shown in Table 1.
We sort the prediction results by the uncertainty quantification function , and then divide them into 10 buckets at equal samples. The uncertainty vs accuracy results is shown in Figure 2. The curve of accuracy and predictive uncertainty is monotonically decreasing. If the application demands an accuracy x%, we can trust the model only in cases where the uncertainty is less than the corresponding threshold. For example, the ACCURACY of top 50% (from uncertainty level 1 to 5) samples is above 90%. Because we have great confidence in prediction results of the top 50% samples, for these enterprises we can directly use the predicted revenue as a reference factor for their credit limit. For other enterprises, we need to multiply the predicted revenue by a factor before using it.
The interpretability of SN-GBM include expectation feature importance and variance feature importance. An example of the top 20 important features about expectation in revenue scenario is shown in Figure 3. From this figure, we observe that features with high expectation feature importance do not necessarily have high variance feature importance.
5.3.3 Feature Selection
For the time series regression, the variance type features often better able to describe the predictive uncertainty. We append three time-series variance features to the 215 original features, which are the revenue variance in the past 3 months, the revenue variance in the past 6 months and the revenue variance in the past 12 months. We compare the performance of point estimation and distribution estimation of models with 215 features and 218 features (append three features about revenue variance), respectively. The results of point estimation are shown in Figure 4. After appending the features of revenue variance, the accuracy of model prediction has not brought a significant improvement. But from Figure 5 we can see, the accuracy of the distribution estimation is significantly improved, relative speaking.
In this paper, we propose a large-scale uncertainty estimation approach named SN-GBM to predict the revenue of SMEs. The revenue distribution of SMEs is log-normal distribution. After log transformation, the revenue distribution is close to normal distribution. For normal distribution, we further derive the natural gradient to make it suitable for large-scale financial scenarios. We derive an uncertainty quantification function for the original distribution that is log-normal. Specially, we provide the interpretability for predictive uncertainty. Through the uncertainty interpretability, we can know the factors that cause the predictive uncertainty. Experimental results show that we can effectively distinguish between accurate and inaccurate samples on a large-scale real-world dataset, which is significantly beneficial for refined credit limit management. The features of the variance type can improve the accuracy of the distribution estimation. In the future, it is worth considering to retain features of the variance type when constructing regression models.
- [Abbott and others2011] Lewis F Abbott et al. The Management of Business Lending: A Survey, volume 2. Industrial Systems Research, 2011.
- [Biggs2002] Tyler Biggs. Is small beautiful and worthy of subsidy? literature review. 2002.
- [Breiman2017] Leo Breiman. Classification and regression trees. Routledge, 2017.
- [Chang and Wang2006] Pei-Chann Chang and Yen-Wen Wang. Fuzzy delphi and back-propagation model for sales forecasting in pcb industry. Expert systems with applications, 30(4):715–726, 2006.
- [Chen and Guestrin2016] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
- [Chipman et al.2010] Hugh A Chipman, Edward I George, Robert E McCulloch, et al. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010.
- [Dawid and Musio2014] Alexander Philip Dawid and Monica Musio. Theory and applications of proper scoring rules. Metron, 72(2):169–183, 2014.
- [Duan et al.2019] Tony Duan, Anand Avati, Daisy Yi Ding, Sanjay Basu, Andrew Y Ng, and Alejandro Schuler. Ngboost: Natural gradient boosting for probabilistic prediction. arXiv preprint arXiv:1910.03225, 2019.
- [Friedman2001] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
- [Gneiting and Raftery2007] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
[He et al.2019]
Jingyu He, Saar Yalov, and P Richard Hahn.
Xbart: Accelerated bayesian additive regression trees.
The 22nd International Conference on Artificial Intelligence and Statistics, pages 1130–1138, 2019.
- [Ke et al.2017] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pages 3146–3154, 2017.
- [Kendall and Gal2017] Alex Kendall and Yarin Gal. In Advances in neural information processing systems, pages 5574–5584, 2017.
- [Khandani et al.2010] Amir E Khandani, Adlar J Kim, and Andrew W Lo. Consumer credit-risk models via machine-learning algorithms. Journal of Banking & Finance, 34(11):2767–2787, 2010.
- [Kruchten2016] N Kruchten. Machine learning meets economics. 2016.
- [Lakshminarayanan et al.2016] Balaji Lakshminarayanan, Daniel M Roy, and Yee Whye Teh. Mondrian forests for large-scale regression when uncertainty matters. In Artificial Intelligence and Statistics, pages 1478–1487, 2016.
- [Lakshminarayanan et al.2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
- [Lloyd-Reason and Mughan2006] Lester Lloyd-Reason and Terry Mughan. Removing barriers to sme access to international markets. 2006.
- [Meng et al.2016] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235–1241, 2016.
- [Panda et al.2009] Biswanath Panda, Joshua S Herbach, Sugato Basu, and Roberto J Bayardo. Planet: massively parallel learning of tree ensembles with mapreduce. Proceedings of the VLDB Endowment, 2(2):1426–1437, 2009.
- [Pedregosa et al.2011] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
- [Sharma and Sinha2012] Rashmi Sharma and Ashok K Sinha. Sales forecast of an automobile industry. International Journal of Computer Applications, 53(12), 2012.
- [Shift] Evaluating Predictive Uncertainty Under Dataset Shift. Can you trust your model’s uncertainty?
- [Zaharia et al.2010] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10-10):95, 2010.