Task-wise Split Gradient Boosting Trees for Multi-center Diabetes Prediction

by   Mingcheng Chen, et al.
Shanghai Jiao Tong University

Diabetes prediction is an important data science application in the social healthcare domain. There exist two main challenges in the diabetes prediction task: data heterogeneity since demographic and metabolic data are of different types, data insufficiency since the number of diabetes cases in a single medical center is usually limited. To tackle the above challenges, we employ gradient boosting decision trees (GBDT) to handle data heterogeneity and introduce multi-task learning (MTL) to solve data insufficiency. To this end, Task-wise Split Gradient Boosting Trees (TSGB) is proposed for the multi-center diabetes prediction task. Specifically, we firstly introduce task gain to evaluate each task separately during tree construction, with a theoretical analysis of GBDT's learning objective. Secondly, we reveal a problem when directly applying GBDT in MTL, i.e., the negative task gain problem. Finally, we propose a novel split method for GBDT in MTL based on the task gain statistics, named task-wise split, as an alternative to standard feature-wise split to overcome the mentioned negative task gain problem. Extensive experiments on a large-scale real-world diabetes dataset and a commonly used benchmark dataset demonstrate TSGB achieves superior performance against several state-of-the-art methods. Detailed case studies further support our analysis of negative task gain problems and provide insightful findings. The proposed TSGB method has been deployed as an online diabetes risk assessment software for early diagnosis.



There are no comments yet.


page 2

page 3

page 4

page 5

page 6

page 9

page 10

page 11


MT-GBM: A Multi-Task Gradient Boosting Machine with Shared Decision Trees

Despite the success of deep learning in computer vision and natural lang...

Information gain ratio correction: Improving prediction with more balanced decision tree splits

Decision trees algorithms use a gain function to select the best split d...

Unifying Decision Trees Split Criteria Using Tsallis Entropy

The construction of efficient and effective decision trees remains a key...

Minimal Variance Sampling in Stochastic Gradient Boosting

Stochastic Gradient Boosting (SGB) is a widely used approach to regulari...

Collaborative Training of Balanced Random Forests for Open Set Domain Adaptation

In this paper, we introduce a collaborative training algorithm of balanc...

Gradient Boosting on Decision Trees for Mortality Prediction in Transcatheter Aortic Valve Implantation

Current prognostic risk scores in cardiac surgery are based on statistic...

Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression

Gradient Boosting Machines (GBM) are hugely popular for solving tabular ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Co-first authors; corresponding authors.

As one of the deadliest diseases with many complications in the world, diabetes111We use diabetes to refer to type 2 diabetes in this paper. is becoming one major factor that influences people’s health in modern society (Breault et al., 2002). To help prevent diabetes, predicting diabetes in the early stage according to demographic and metabolic data becomes an important task in the healthcare domain (Koh et al., 2011). In this paper, we study the problem of diabetes prediction, specifically, predicting whether one person will be diagnosed with diabetes within three years based on the collected data, which can be regarded as a binary classification problem. However, there exist two main challenges in this diabetes prediction task, i.e., data heterogeneity and data insufficiency.

Data heterogeneity

means the features contained in the collected data are of different types (e.g., “glucose in urine” is numerical while “marriage” is categorical) and distributions (e.g., “gender” tends to be evenly distributed while “occupation” normally follows a long-tail distribution). Although deep neural networks (DNNs) have shown promising performance in vision and language domains

(Goodfellow et al., 2016), it is much harder to train DNNs with mixture types of input (Qu et al., 2018), especially when the input distribution is unstable (Ioffe and Szegedy, 2015). In contrast to DNNs, decision trees (Breiman, 2017) are insensitive to data types and distributions, and thus it is appealing to deal with heterogeneous data using decision trees (Breiman, 2017; Qin et al., 2009). More importantly, tree-based methods can implicitly handle the problem of features missing, which is common in medical follow-up data submitted by users, as we will mention in Section 6. Beyond that, decision trees are also easy to visualization and further interpreted (Goodman et al., 2016; Walker et al., 2017) as we mention in Appendix B, which is another superior advantage over neural networks. Therefore, in this paper, to handle the data heterogeneity challenge in diabetes prediction, we construct our method based on the gradient boosting decision trees (GBDT) (Friedman, 2002), which is one of the most popular and powerful tree algorithms.

Data insufficiency is another core challenge in the healthcare domain. Since data collection is costly in medical centers, the volume of data used to train models is usually limited, making it challenging to achieve adequate performance. The data distribution of different medical centers can vary greatly. Thus training a model over the multi-center dataset cannot lead to a satisfactory average or separate prediction result. Based on this observation, decoupling multi-center diabetes prediction and separately considering prediction tasks for each center is necessary. Unfortunately, due to the data insufficiency, it is still hard to train high-performance models for every single center. Multi-task learning (MTL) (Zhang and Yang, 2017) aggregates knowledge from different tasks to train a high-performance model. Based on this consideration, we are able to treat predictions for a single center as separate tasks and build the model leveraging knowledge shared among them to improve prediction results.

However, it is non-trivial to directly apply MTL on GBDT since most of the existing MTL methods are either feature-based or parameter-based (Ji and Ye, 2009; Zhou et al., 2011)

but GBDT does not perform feature extraction and is a non-parametric model. One existing solution is Multi-Task Boosting (MT-B)

(Chapelle et al., 2010), which simultaneously trains task-specific boosted trees with samples from each task, and task-common boosted trees with samples from all the tasks. The final prediction of one task is determined by combining the predictions of both task-specific boosted trees and task-common boosted trees. Although MT-B is easy to train and deploy, one significant drawback of MT-B is its high computational complexity, since independent trees for each task need to be learned besides the global one.

To avoid introducing additional computational complexity, seeking a more elegant way to address both challenges and combing MTL with GBDT is meaningful. To begin with, we analyze that directly training one GBDT for all the tasks may have a negative impact on specific tasks after a certain splitting since the task-wise difference is ignored. For better understanding, we decompose the gain into a summation over task gains for each task and adopt the task gain to measure how good the split condition is for each task at a node. We demonstrate that the gain being overall positive does not necessarily guarantee the task gains of all tasks being also positive, which means that the greedy node split strategy directly based on gain might be harmful to some tasks.

Figure 1. Workflow of TSGB for diabetes prediction.

To tackle this issue, inspired by MT-ET (Multi-Task ExtraTrees) (Simm et al., 2014), we propose TSGB (Task-wise Split Gradient Boosting Trees). TSGB introduces task-wise split according to task gain instead of traditional feature-wise split (Chen and Guestrin, 2016) to mitigate the negative task gain problem while still keeping the same order of computational complexity as GBDT (all the tasks share trees). Specifically, task-wise split separates tasks into two groups (see Fig. 3), i.e., tasks with positive and negative gains. In this way, some branches of the trees are only dedicated to a subset of tasks, which preserves the similarity between related tasks while alleviating the deficiency of sharing the knowledge between unrelated tasks.

The general workflow of applying our TSGB in diabetes prediction is illustrated in Fig. 1. Experiments on multi-center diabetes prediction datasets and multi-domain sentiment classification dataset show the effectiveness of the proposed TSGB, compared with not only the tree-based MTL models (Chapelle et al., 2010; Simm et al., 2014) but also several other state-of-the-art MTL algorithms (Ji and Ye, 2009; Zhou et al., 2011).

The predictive model has been deployed as an online diabetes risk assessment software to offer the patients key risk factors analysis and corresponding personalized health plan, helping early prevention and daily health management for healthy users.

To sum up, our contributions are mainly threefold:

  • [leftmargin=10pt]

  • We analyze GBDT in the MTL scenario and introduce task gain to measure how good the tree structure is for each task. To solve the negative task gain problem, we propose a novel algorithm TSGB that effectively extends GBDT to multi-task settings.

  • We obtain 0.42% to 3.20% average AUC performance improvement on the 21 tasks in our diabetes prediction dataset, comparing with the state-of-the-art MTL algorithm. Our proposed TSGB is shown it can also be used in a wide range of non-medical MTL scenarios.

  • We deploy TSGB on a professional assessment software, Rui-Ning Diabetes Risk Assessment, for fast and convenient diabetes risk prediction. The software already has around users from different organizations, such as physical examination centers, human resource departments, and insurance institutes.

2. Preliminaries

We provide a brief introduction to gradient boosting decision trees (Chen and Guestrin, 2016). Suppose we have a dataset of samples with -dimensional features. The predicted label of GBDT given by the function is the sum of all the additive trees:


is the space of regression trees (CART (Breiman, 2017)), is the tree structure which maps a sample to the corresponding leaf index in the tree with leaves, and is the leaf weight. Each is an independent tree with its own structure and leaf weight . The functions (trees) will be learned by minimizing the regularized objective (Chen and Guestrin, 2016):



is the loss function (e.g., MSE, logloss), and

is a regularization term that penalizes the complexity of to alleviate the over-fitting problem. Specifically, penalizes the number of leaves as well as the weight values (Johnson and Zhang, 2014; Chen and Guestrin, 2016):


Following the GBM (Friedman, 2001) framework, the functions in Eq. (1) are learned additively to minimize the objective in Eq. (3). With some transformation and simplification (see details in (Chen and Guestrin, 2016)), the -th tree is learned by minimizing the following objective as

where and are the first-order and second-order gradient on the loss function. Note that each sample will be mapped to a leaf via , thus we define as the indices set of training samples at leaf where is the corresponding tree structure. Recall the definition of in Eq. (4), we have:


Denoting and , the optimal for leaf is easy to calculate since is a single variable quadratic function for , thus the optimal is


Although the optimal value of given tree structure can be calculated, to make a trade-off between computational complexity and model performance, a greedy strategy that constructs a tree starting from a single leaf (root) and splitting the leaf into two child leaves iteratively is commonly used (Breiman, 2017; Johnson and Zhang, 2014; Chen and Guestrin, 2016). The samples at a leaf will be separated by the split condition defined as a threshold value of one feature, which is the so-called feature-wise split. Such a greedy search algorithm is included in most GBDT implementations (Ridgeway, 2007; Pedregosa et al., 2011; Chen and Guestrin, 2016), it selects the best split node by node, and finally construct a decision tree.

Formally, to find the best split condition on an arbitrary leaf , let be the sample set at leaf , and and are the samples for left and right child leaves ( and ) after a split. The corresponding negative loss change after the split, denoted as gain , is


where are optimal weights (defined in Eq. (6)) for leaf , respectively. There is an optimal split found for each feature by enumerating all the possible candidate feature values and picking one with the highest gain.

3. Negative Task Gain Problem

We find that the tree structure learned by GBDT can be harmful to a subset of tasks when the MTL technique is directly applied. When training vanilla GBDT on multi-task data, where samples from different tasks may be far from identically distributed (e.g., multi-center diabetes dataset), the objective is to improve its average performance over all the tasks against individual learning. To be specific, since the objective is defined on all the training instances, GBDT will pick features that are generally “good” for all the tasks in the feature-wise splitting process of growing a single tree.

To illustrate this finding, we need to analyze the tree structure measurement in task level. Assume there are learning tasks in the MTL scenario with the whole dataset divided into parts (). For each task , denote the samples belonging to it as , thus have . We now introduce a new metric, task gain (), to measure how good a feature-wise split is to each task.

Considering all the tasks explicitly at each leaf, the learning objective at -step in Eqs. (3) and (5) can be rewritten as


where . Then according to the objective above, we can decompose in Eq. (7) by as


where denotes the set of samples from task in leaf , as well as and .

With the above decomposition of the original gain of a feature-wise split at a leaf in GBDT, we obtain the task gain for each task. The task gain represents how good the specific feature-wise split at this leaf is for task . The larger is, the better the split at this leaf is for task . When is negative, it means the feature-wise split at this leaf will actually increase part of the objective loss consisting of samples in task as this leaf: , which is the opposite of the optimization objective.

Figure 2. Distribution of non-leaf nodes’ with logarithm, when trained traditional GBDT on multi-center diabetes dataset. A spot in darker blue has more nodes.

In GBDT, we search over all the feature-wise split conditions and select the one with the highest gain at a leaf. Consider the decomposition in Eq. (9), we can conclude that there is no guarantee that the optimal feature-wise split is a good split for all the tasks. Formally, according to the greedy algorithm for split finding in GBDT, we have

at a leaf, but unfortunately,

We dub this observation negative task gain problem. For the tasks with the task gain , although the feature-wise split is good in general (), the newly constructed tree structure is even worse. Empirically, we find there are about 96.47% of nodes in GBDT that have negative task gains when trained on our diabetes dataset.

To get a better measurement for “how good is a feature-wise split in multi-task settings”, we introduce negative task gain ratio as


to indicate the severity of the negative task gain problem, where is the number of samples with negative task gains, is the total number of samples at node . We plot the distribution of in Fig. 2 and find that (i) a large amount of nodes have , which means the greedy search algorithm in GBDT is far from optimum in multi-task settings. (ii) Nodes with more samples are more likely to have larger , which means in the early stage of training, nodes closer to the root are more likely to find a harmful feature-wise split. And different tasks sharing the same harmful tree structure will, of course, lead to performance decline. (iii) There are 11.24% nodes have , which means a minority of the tasks dominates the feature-wise split, and the other tasks will gain better results if the split is not performed.

Figure 3. (a) Illustration of negative task gain problem caused by a traditional feature-wise split, (b) while proposed TSGB can handle such problem with a task-wise split.

To better illustrate this problem, we show a simple but common case found in GBDT in Fig. 3 (a). At node , samples of tasks are already pure (all positive), while the positive and negative samples of task are still mixed. The optimal split condition found here successfully divides task ’s samples into two branches, and the right branch has the most negative samples of task while the left branch has most of the positive ones. Unfortunately, some samples of tasks are also divided into the right branch, although they are positive samples. In such a case, we find the optimal and , but leave the rest of the tasks with negative gains ( and ).

Input: , training data from tasks
Input: , number of boosted trees
Input: , maximum ratio of samples with negative task gain
1 initialize
2 for  to  do
3        Calculate by Eq. (1)
4        while not meet the split stop criterion do
5               Find the best feature-wise split rule greedily at a leaf
6               Calculate Corresponding task gain defined in Eq. (9)
7               if  then
8                     Split samples of task to left branch
9                      Split samples of task to right branch
11               end if
12              else
13                     Split samples following split rule
14               end if
16        end while
18 end for
Algorithm 1 Task-wise Split Gradient Boosting Trees

4. Task-wise Split Gradient Boosting Trees

The ultimate objective of MTL is to improve the model’s performance on all the tasks, while the aforementioned negative task gain problem makes the traditional GBDT not suitable for MTL. To make full use of the data of all tasks through MTL and extend GBDT to multi-task settings, we propose Task-wise Split Gradient Boosting Trees (TSGB). The key idea of TSGB is that we avoid severe negative task gain problem by conducting a task-wise split instead of feature-wise split at nodes with high negative task gain ratio .

We follow the main procedure of GBDT. However, when the best feature-wise split condition is found at an arbitrary leaf , the task gain for each task is calculated. Since most of the nodes in GBDT has the negative task gain problem (as discussed in Fig. 2). We can handle this problem by introducing the task-wise split.

If the negative task gain ratio of node , as defined in Eq. (10), is higher than a threshold ratio (i.e., meets), instead of splitting the leaf feature-wisely using the found split condition, TSGB performs a task-wise split of samples, splits the samples of tasks with negative task gain to the left branch and those with positive task gain to the right branch. Alg. 1 is a pseudo-code for TSGB, Fig. 3 (right) provides an illustration of proposed task-wise split.

is considered as a hyperparameter in practice, which is set to different values for different MTL datasets.

A key characteristic of TSGB is that it is task-level objective-oriented while training all the tasks in the same trees in a homogeneous MTL setting, which makes TSGB easy to train and elegant in MTL. The empirical results also show the effectiveness of TSGB. Previous works either ignore the task-specific objective by simply splitting the tasks with pre-defined task features on randomly selected leaf nodes (Simm et al., 2014) or train both task-common trees and task-specific trees at the same time. The former can not make full use of the correlation of different task data, while the latter always derive a huge redundant model since forests are needed (Chapelle et al., 2010).

What if replacing task-wise split by selecting the sub-optimal feature-wise split condition at a node with lower so that more of the tasks have positive task gain? We argue that (i) the primary cause of the negative task gain problem comes from the difference of feature distributions on different tasks. Moreover, this problem cannot be solved by traditional greedy search feature-wise split since its underlying assumption is identical data distribution. There is an irreparable gap between the assumptions of GBDT feature split and MTL. (ii) The computational complexity of task gain calculation under sub-optimal feature-wise split conditions is much higher. Variables are not provided by the original GBDT. Thus if we want to calculate the task gain under sub-optimal feature-wise split, additional computation is needed. This problem even becomes worse given that the searching of sub-optimal feature-wise split brings external complexity.

Accordingly, we only perform a task-wise split when calculated task gain under the optimal feature-wise split condition indicates that the negative task gain problem meets a certain condition.

5. Experiments

In this section, we empirically study the performance of the proposed TSGB. We compare TSGB with several state-of-the-art tree-based and other MTL algorithms on our diabetes dataset and the multi-domain sentiment dataset. To get deeper insights into how task-wise split helps improve the performance, we also discuss a case study on a specific task.

5.1. Dataset

We first evaluate TSGB on a multi-center diabetes dataset provided by the China Cardiometabolic Disease and Cancer Cohort (4C) Study, which was approved by the Medical Ethics Committee of Ruijin Hospital, Shanghai Jiao Tong University. The dataset is collected from the general population recruited from 21 medical centers in different geographical regions of China, including the baseline data from 2011 to 2012 and the follow-up data from 2014 to 2016 of 170,240 participants. Each center contributes the data with the size ranging from 2,299 to 7,871. At baseline and follow-up visits, standard questionnaires were used to collect demographic characteristics, lifestyle, dietary factors, and medical history. We finally obtained 100,000 samples from the 21 different medical centers for TSGB evaluation with data cleaning and pre-processing. Each of the samples retains the most important numerical and categorical features of 50 dimensions.

To further claim the effectiveness of TSGB under a non-medical scenario, we also conduct an empirical study on a commonly used MTL benchmark dataset, the multi-domain sentiment dataset222Sentiment dataset: https://www.cs.jhu.edu/~mdredze/datasets/sentiment/ (Blitzer et al., 2007). This dataset is a multi-domain sentiment classification dataset, containing positive and negative reviews from four different (product) domains from Amazon.com. The four product domains are books, DVDs, electronics, and kitchen appliances. Following (Chen et al., 2012), we use the 5000 most frequent terms of unigrams and bigrams as the input.

5.2. Baselines

All the compared models are listed as follows.

  • [leftmargin=10pt]

  • ST-GB (Single Task GBDT) trains a GBDT model for each task separately.

  • GBDT (Sec. 2) trains a GBDT model on the whole dataset of all tasks.

  • MT-ET (Multi-Task ExtraTrees) (Simm et al., 2014) is a tree-based ensemble multi-task learning method based on Extremely Randomized Trees.

  • MT-TNR (Ji and Ye, 2009) is a linear MTL model with Trace Norm Regularization.

  • MT-B (Multi-Task Boosting) (Chapelle et al., 2010) is an MTL algorithm with boosted trees. It trains task-common forest on all tasks, and trains task-specific boosted forest on each task separately. The final output of sample of task is .

  • CMTL (Zhou et al., 2011) is a clustered MTL method that assumes the tasks may exhibit a more sophisticated group structure.

  • TSGB is a variant of TSGB proposed by us. Instead of using a threshold

    , it decides whether to conduct a task-wise split with a fixed probability

    . TSGB picks a node with a certain probability and sort the tasks, then split the samples task-wisely as how TSGB do.

  • TSGB is the novel method proposed in this paper. It decides whether to perform a task-wise split by comparing a threshold ratio with the negative task gain ratio of the current node. Then it separates samples task-wisely according to the positive and negative of their task gains.

For a fair comparison, all the boosting tree models used in our experiments are implemented based on XGBoost

(Chen and Guestrin, 2016), which is one of the most efficient and widely used implementations of GBDT with high performance. We make TSGB publicly available333Reproducible code for TSGB: https://github.com/felixwzh/TSGB to encourage further research in tree-based MTL.

5.3. Evaluation Results

We randomly generate training-validation-testing sets at a ratio of 3:1:1. The proportion of positive and negative samples can be very different for each task. Therefore the accuracy, recall, and precision are not suitable indicators to measure the performance of models. As we known, AUC (Area Under the Curve of ROC) can be directly compared between tasks with different positive ratios. We take it as the primary indicator and report the average AUC over 10 random seeds in the experiment. For each algorithm, the best hyperparameters adopted are provided in Appendix A.

5.3.1. Multi-center Diabetes Prediction

The experimental results are presented in Tab. 1.

AVG 77.98
Table 1. AUC Scores Under Multi-center Diabetes Dataset.

The main conclusions can be summarized as follows. (i) We find ST-GB achieves competitive performance compared to other tree-based MTL models. ST-GB has much better performance than linear MTL models CMTL and MT-TNR. (ii) GBDT, which trains samples from all the tasks together, boosts 15 tasks’ performance compared with ST-GB, but ST-GB still outperforms GBDT on 5 tasks. This phenomenon is called negative transfer (Ge et al., 2014) in MTL, and we owe the main reason to the negative task gain problem we analyzed in Sec. 3. (iii) Although task-wise split is first proposed in MT-ET (Simm et al., 2014)

, MT-ET does not achieve satisfactory performance on multi-center diabetes data. The task-wise split criterion in MT-ET is an alternative of one-hot encoding of task feature, which means separate the samples into two random sets of tasks instead of two specific sets version. It is not well designed for the

negative task gain problem. However, the competitive performance of TSGB indicates that split the samples task-wisely is promising. (iv) TSGB outperforms baseline models in almost all the tasks. Specifically, it boosts the performance on 17 of 21 tasks compared to all the other models. TSGB is outperformed by ST-GB on only 1 task with a smaller gap than those between GBDT, MT-ET, TSGB, and ST-GB. This indicates that our analysis of negative task gain is reasonable, and our task-wise split mechanism is effective. In conclusion, the results show TSGB is effective on solving data heterogeneity and insufficiency.

5.3.2. Multi-domain Sentiment Classification

The experimental procedures follow the same setting, and we show the main results in Tab. 2.

Books 94.37
DVDs 94.39
Electr. 96.06
K. App. 97.20
AVG 95.51
Table 2. AUC Scores Under Multi-domain Sentiment dataset.

From AUC scores derived with the whole sentiment dataset, TSGB outperforms all the baseline models. Interestingly, TSGB reach a good performance second only to TSGB and outperforms MT-ET significantly. We analyze the reasons for this situation is that, although the original task-wise split used in MT-ET boost performance by introducing additional randomness for bagging, such kind of split is an improved realization of encoding task as an additional dimension of feature, and it separates samples into two sets of tasks by a randomly selected task-related value, which can not ensure reducing negative task gain ratio significantly. Different from random many-vs-many split used in MT-ET, we proposed “ones-vs-rest” task-wise split in TSGB, which is more targeted for the mentioned negative task gain problem and reduces the negative task gain ratio more effectively. The ones-vs-rest task-wise split means separating tasks with negative task gains and those with positive ones, therefore it is much more reasonable than its original version according to our theoretic analysis and leads to better performance in MTL setting. The analysis also explains that TSGB can perform better than TSGB, since TSGB employs negative task gain ratio as the criterion to perform task-wise split instead of using a constant probability to control whether performing a task-wise split at a certain decision node.

5.3.3. Robustness to Data Sparsity

We further study the impact of training data sparsity. We compare TSGB with the best two baselines, TSGB and GBDT, as well as the original MT-ET on our multi-center diabetes dataset but with different training data volume. We subsample 10%, 25%, and 50% training data on each task and conduct the experiments with the same procedure as before. In Fig. 4, we plot the average AUC of three models on the testing set. It shows that TSGB reaches an average AUC of with only 25% training data, while GBDT and TSGB approach to but still inferior to such an AUC score using 100% training data (GBDT with average AUC and TSGB ). MT-ET is the most sensitive to data volume, with performance fluctuates in a large interval.

The observations can be summarized as follows: (i) TSGB reaches a higher average AUC with less training data, which shows that TSGB is robust to data sparsity issue. (ii) Performances of TSGB and TSGB are far better than the original version of MT-ET. (iii) TSGB outperforms TSGB and GBDT in most tasks (1618 out of 21 tasks) on all the training data volumes. With these three observations, we conclude that task-wise split is helpful in our MTL scenario and conducting task-wise split with consideration of proposed task gain further improves the tree-based model’s performance.

Figure 4. The performance of TSGB with different training data volume consistently outperforms MT-ET, GBDT, and TSGB.
Figure 5. The performance of TSGB with different threshold ratio consistently compared with MT-ET, GBDT, and TSGB on different training data volume.

5.3.4. Hyperparameter Study

We introduce a hyperparameter as the threshold ratio to determine when to split the node task-wisely. We set and plot the corresponding average AUC over all the tasks to see the influence of threshold ratio on TSGB’s performance in Fig. 5. The experimental results at and data volume are very similar to the result at data volume. When the training data is sparse (10%), the performance difference between TSGB and the other two baselines are small. When there is more training data (25%, 50%, and 100%), TSGB outperforms GBDT and our variant TSGB consistently. We also find that TSGB has the best performance when value is set low but not zero, i.e., in Fig. 5(b). If is too high, we conduct task-wise splits only in a few nodes where the negative task gain problem is severe and fail to handle the problem in many other nodes. On the contrary, if we set the too low, nearly all the nodes will be split task-wisely (96.47% nodes, as mentioned in Sec. 3), and only a few nodes can be used to optimize the learning objective. Thus, a relatively low threshold ratio leads to the best performance.

5.4. Case Study

To get deeper insights into how negative task gain problem influences GBDT in MTL, we study one specific task, task-21, with imbalanced down-sampled training data. More specifically, we randomly choose 10% samples (0.5% positive & 99.5% negative) from task-21, while for the other 20 tasks, we randomly selected 10% samples (50% negative & 50 % positive) as the training data. One additional reason we build task-21’s training data with the very sparse positive sample is that the positive case of some diseases, due to many reasons, might be relatively rare in practice. We want to see whether TSGB could handle this condition and outperform TSGB, GBDT, and ST-GB. In addition, we introduce TSGB-4 to see whether a task-wise split is effective. TSGB-4 means that we use GBDT to train the first three decision trees, but from the fourth tree, we change to TSGB.

Figure 6. AUC on task-21’s validation set in discussion.

From Fig. 6

we can see that (i) when the training data is extremely imbalanced, MTL helps boost model performance. All the MTL models have a large AUC lift compared to ST-GB. (ii) TSGB obtains the highest AUC, which indicates proposed TSGB is capable of better leveraging the training data on all the tasks. (iii) Although directly using GBDT brings about 10% AUC improvement, the weird AUC curve of GBDT at the first 6 trees (below 0.5, which is worse than the performance of a random classifier) shows some problems of GBDT. (iv) TSGB-4 has exactly the same performance as GBDT in the first three trees, but when the decision tree is constructed in a TSGB from the fourth tree, its performance improves significantly and outperforms GBDT eventually.

We also compare the different behaviors of the fourth decision tree of GBDT and TSGB-4 to see what happened in the training process and have the following findings. Since we set threshold ratio , TSGB-4 will conduct a task-wise split instead of the found best feature-wise split after constructing the fourth tree if . Therefore, with the observation of the negative task gain problem, TSGB converts some feature-wise splits in GBDT into task-wise splits and benefits task-21 from other tasks’ training samples. As a result, TSGB-4 boosts AUC on task-21 on the fourth tree and achieves better performance than GBDT when it converges (Fig. 6).

6. Application: online diabetes risk assessment software

Rui-Ning Diabetes Risk Assessment is a professional diabetes prediction platform developed by 4Paradigm Inc. It predicts the risk score of healthy people suffering type-2 diabetes in the coming 3 years based on the proposed TSGB. We normalize the model output probability to 1-100 as the risk score. To make users understand better, we sort the risk score into 4 intervals, 1-30 for good, 31-60 for risk, 61-90 for high-risk, and 91-100 for dangerous. Beyond that, we also provide key risk factors analysis and corresponding personalized health tips to help the early prevention of type-2 diabetes and guides daily health management.

Figure 7. Demonstration of Rui-Ning Diabetes Risk Assessment software workflow.

To start a test, the users need to fill in a questionnaire about their living habits and several medical indicators which can be done within 1 minute (Fig. 7). We would like to emphasize that, for a rapid prediction, it is impossible to ask users to provide all 50-dimensional features as the training set in practice. Therefore, we select 13 of 50 dimensions, which are the most informative and easy to obtain in medical testing, as the content of the questionnaire (Details in Appendix C). The characteristics of tree model in dealing with missing features naturally ensure the performance.

In order to evaluate and analyze the performance of Rui-Ning Diabetes Risk Assessment, we employed another 880 healthy volunteers from different regions of China to complete the assessment, then we follow-up visited the volunteers three years later to record whether they get diabetes or not, finally formed the testing data.

As we known, to deploy a binary classification model in a real world scenario, the classification threshold is important. In healthcare domain, the threshold usually be set according to the specific needs. For an example, tumor screening hopes to screen out all positive suspicious, thus tumor screening model focuses on a high sensitivity, which leads high screening costs and relatively low precision. However, in the field of diabetes, it is not the case. For large-scale population, we need to consider the actual economic cost. We must improve the specificity on a certain sensitivity to reduce the actual cost. To determine the best threshold for Rui-Ning Diabetes Risk Assessment, we plot the P-R curve as in Figure 8(a). We can see, when take 42 as the threshold (risk score greater than 42 will be predicted as positive sample), the model has appreciated performance. Based on this threshold, we evaluate the software with multiple indexes as shown in Table 3.

Accuracy Precision Recall F1-score AUC
0.7508 0.6040 0.6354 0.6193 0.7830
Table 3. Evaluation of deployed software.

We then compared the performance of our deployed software with an existing diabetes risk prediction system. To our best knowledge, there are not other open-source softwares that provide diabetes risk assessment in industry, so we employ a traditional rule based diabetes risk scoring method CDS

(Zhou et al., 2013), which is a regional authoritative diabetes risk assessment method recommended by Chinese Medical Association, as the main object of comparison. We plot the ROC curve of our deployment and CDS in Figure 8(b). Rui-Ning significantly improves the AUC from 0.6180 to 0.7963. The slightly improvement on sensitivity ensures the detection rate of diabetes, while the greatly improvement on the specificity can significantly reduce the cost of screening, which provides a practical and effective prevention and control program in China, a developing country with tight average medical expenses.

Figure 8. (a) P-R curve for threshold determination, (b) ROC comparison between Rui-Ning and CDS.
Figure 9. Statistical analysis of user data. (a) Gender distribution, (b) Age distribution, (c) Prediction risk score, (d) Positive rates in different risk groups.

Rui-Ning Diabetes Risk Assessment aims at providing an efficient and low-cost scheme for huge-scale screening of diabetes, it has been used in different organizations such as physical examination centers, human resource departments and insurance institutes. The software is used by more than people after its deployment. The distribution of all users and their risk scores is illustrated as in Figure 9(a,b,c). We also discuss the positive rates in different risk groups, the result is drawn in Figure 9(d). of the people were diagnosed with diabetes three years after they were predicted to be in “Good” risk group, for “Risk”, for “High-risk”, and for “Dangerous”, which further verified the effectiveness of the deployment software.

7. Related Works

Multi-Task Learning. Multi-task learning (MTL) (Zhang and Yang, 2017) aims to make use of data obtained from multiple related tasks to improve the model performance on all the tasks. MTL helps in many domains under plenty of situations, especially when the amount of data for one learning task is limited, or the cost of collecting data is high.

Consider the types of learning tasks, MTL can be categorized into two sets. Homogeneous MTL deals with learning tasks of the same data space but with different distributions (Kumar and Daumé III, 2012), which is the case discussed in this paper, while heterogeneous MTL processes various types of learning tasks (e.g., classification, regression) (Jin et al., 2015).

MTL can also be categorized by the form of knowledge sharing among the tasks. Feature-based MTL learns common features among tasks (Argyriou et al., 2007; Maurer et al., 2013), while parameter-based MTL usually leverages the parameter of model from a task to benefit the models from other tasks (Ando and Zhang, 2005; Evgeniou and Pontil, 2004).

Tree-based Model in MTL. There are a few previous works focused on tree-based models related to MTL. In (Goussies et al., 2014)

, a mixed information gain is defined to leverage the knowledge from multiple tasks to train a better learner for a target domain, thus this is not an MTL algorithm but a transfer learning algorithm. The reason why such multiple source domain adaptation algorithms of transfer learning are not suitable for this multi-task diabetes prediction task is that, the setting of transfer learning aims at using knowledge from multiple source domains to improve the performance on target domain, however here we aim at improving performance on every task but not only one task of target domain. Of course we can train several multiple source domain transfer learning models so that improve the performance for every single task, but there is no doubt that such methods are computational expensive and not elegant. An MTL algorithm with boosted decision tree is proposed in

(Faddoul et al., 2012), but it is designed for heterogeneous MTL i.e., the tasks share the same data but with different labels set for each task.

The two previous works which solve the same MTL problem studied in this paper are Multi-task boosting (MT-B) (Chapelle et al., 2010) and Multi-Task ExtraTrees (MT-ET) (Simm et al., 2014). The main drawback of MT-B is its computational complexity since it trains forests, and despite being simple and direct, it derives low empirical performance. In MT-ET, the authors propose to perform a random task split under some predefined probability. The task split is conducted by sorting the tasks by their ratio of positive samples at this node and find the best split threshold with the highest information gain. The disadvantage of MT-ET is that the decision of task split is randomly determined and is thus not reasonable since it fails to leverage any task related information, we have verified this through experiments of TSGB.

8. Conclusion

In this paper, we proposed the novel Task-wise Split Gradient Boosting Trees (TSGB) model, which extends GBDT to multi-task settings to better leverage data collected from different medical centers. TSGB outperforms several strong baseline models and achieves the best performance in our diabetes prediction task. Moreover, experiments in multi-domain sentiment dataset also show the effectiveness of TSGB in general MTL tasks. The discussion further supports our analysis of task gain and negative task gain problem and provides insights of tree-based models in MTL. We deployed and productized our online diabetes prediction software Rui-Ning Diabetes Risk Assessment based on proposed TSGB. The online software has been widely used by a considerable number of people from different organizations. We have also published our code of TSGB which will help in algorithm reproducibility. Current limitation of TSGB is that when to execute task-wisely is controlled by a predefined hyperparameter of threshold ratio

and needs proper setting. For future work, we focus on further study the negative transfer problem and hope to explore the possibility of making decision of when to perform task-wise split via reinforcement learning.

The project is supported by National Key Research and Development Program of Ministry of Science and Technology of the People’s Republic of China (2016YFC1305600, 2018YFC1311800), National Natural Science Foundation of China (82070880, 81771937, 61772333) and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).


  • R. K. Ando and T. Zhang (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR. Cited by: §7.
  • A. Argyriou, T. Evgeniou, and M. Pontil (2007) Multi-task feature learning. In NIPS, Cited by: §7.
  • J. Blitzer, M. Dredze, and F. Pereira (2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In ACL, Cited by: §5.1.
  • J. L. Breault, C. R. Goodall, and P. J. Fos (2002) Data mining a diabetic data warehouse. Artificial intelligence in medicine. Cited by: §1.
  • L. Breiman (2017) Classification and regression trees. Routledge. Cited by: §1, §2, §2.
  • O. Chapelle, P. Shivaswamy, S. Vadrevu, K. Weinberger, Y. Zhang, and B. Tseng (2010) Multi-task learning for boosting with application to web search ranking. In KDD, Cited by: §1, §1, §4, 5th item, §7.
  • M. Chen, Z. Xu, K. Weinberger, and F. Sha (2012)

    Marginalized denoising autoencoders for domain adaptation

    arXiv. Cited by: §5.1.
  • T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In KDD, Cited by: §1, §2, §2, §2, §5.2.
  • T. Evgeniou and M. Pontil (2004) Regularized multi–task learning. In KDD, Cited by: §7.
  • J. B. Faddoul, B. Chidlovskii, R. Gilleron, and F. Torre (2012) Learning multiple tasks with boosted decision trees. In ECML PKDD, Cited by: §7.
  • J. H. Friedman (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics. Cited by: §2.
  • J. H. Friedman (2002) Stochastic gradient boosting. CSDA. Cited by: §1.
  • L. Ge, J. Gao, H. Ngo, K. Li, and A. Zhang (2014) On handling negative transfer and imbalanced distributions in multiple source transfer learning. SADM. Cited by: §5.3.1.
  • I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §1.
  • K. E. Goodman, J. Lessler, S. E. Cosgrove, A. D. Harris, E. Lautenbach, J. H. Han, A. M. Milstone, C. J. Massey, and P. D. Tamma (2016) A clinical decision tree to predict whether a bacteremic patient is infected with an extended-spectrum -lactamase–producing organism. CID. Cited by: §1.
  • N. A. Goussies, S. Ubalde, and M. Mejail (2014) Transfer learning decision forests for gesture recognition. JMLR. Cited by: §7.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv. Cited by: §1.
  • S. Ji and J. Ye (2009) An accelerated gradient method for trace norm minimization. In ICML, Cited by: §1, §1, 4th item.
  • X. Jin, F. Zhuang, S. J. Pan, C. Du, P. Luo, and Q. He (2015) Heterogeneous multi-task semantic feature learning for classification. In CIKM, Cited by: §7.
  • R. Johnson and T. Zhang (2014) Learning nonlinear functions using regularized greedy forest. TPAMI. Cited by: §2, §2.
  • H. C. Koh, G. Tan, et al. (2011) Data mining applications in healthcare. JHIM. Cited by: §1.
  • A. Kumar and H. Daumé III (2012) Learning task grouping and overlap in multi-task learning. In ICML, Cited by: §7.
  • A. Maurer, M. Pontil, and B. Romera-Paredes (2013) Sparse coding for multitask and transfer learning. In ICML, Cited by: §7.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011)

    Scikit-learn: machine learning in python

    JMLR. Cited by: §2.
  • B. Qin, Y. Xia, and F. Li (2009) DTU: a decision tree for uncertain data. In PAKDD, Cited by: §1.
  • Y. Qu, B. Fang, W. Zhang, R. Tang, M. Niu, H. Guo, Y. Yu, and X. He (2018) Product-based neural networks for user response prediction over multi-field categorical data. arXiv. Cited by: §1.
  • G. Ridgeway (2007) Generalized boosted models: a guide to the gbm package. Update. Cited by: §2.
  • J. Simm, I. M. de Abril, and M. Sugiyama (2014) Tree-based ensemble multi-task learning method for classification and regression. IEICE TRANSACTIONS on Information and Systems. Cited by: §1, §1, §4, 3rd item, §5.3.1, §7.
  • P. B. Walker, M. L. Mehalick, A. C. Glueck, A. E. Tschiffely, C. A. Cunningham, J. N. Norris, and I. N. Davidson (2017) A decision tree framework for understanding blast-induced mild traumatic brain injury in a military medical database. JDMS. Cited by: §1.
  • Y. Zhang and Q. Yang (2017) A survey on multi-task learning. arXiv. Cited by: §1, §7.
  • J. Zhou, J. Chen, and J. Ye (2011) Clustered multi-task learning via alternating structure optimization. In NIPS, Cited by: §1, §1, 6th item.
  • X. Zhou, Q. Qiao, L. Ji, F. Ning, W. Yang, J. Weng, Z. Shan, H. Tian, Q. Ji, L. Lin, et al. (2013) Nonlaboratory-based risk assessment algorithm for undiagnosed type 2 diabetes developed on a nation-wide diabetes survey. Diabetes care 36 (12), pp. 3944–3952. Cited by: §6.

Appendix A Hyperparameters Settings

In this section, we present the search spaces of hyperparameters in our experiments, as well as the details of hyperparameters settings we finally used, which helps in algorithm reproducing.

In our experiments, we firstly fix the training hyperparameters and consider different combinations of tree booster hyperparameters. Specifically, we search maximum tree depth from , minimum leaf node sample weight sum from , sample rates from , and from . After effective tree booster hyperparameters are found, we further search learning rate from and regularization weight from . We finally finetune the hyperparameters around the current optimum, and determine the threshold ratio following Sec. 5.3.4. The best hyperparameters for the diabetes dataset and sentiment dataset are listed in Tab. 4.

Hyperparameter Diabetes Sentiment
max_depth 5 9
min_child_weight 5 1
colsample_bytree 0.7 1.0
colsample_bylevel 0.8 0.8
subsample 0.8 1.0
gamma 0.2 0.45
learning _rate 0.1 0.3
reg_alpha 0.1 0.0005
reg_lambda 12 12
max_neg_sample_ratio 0.4 0.5
Table 4. The TSGB hyperparameters on different datasets

We use the optimal parameters of different models and randomly selected 10 initial seeds to run each training-evaluation process, and calculate the average AUC score as the final result. The 95% confidence intervals is given by

, where , ,

is standard deviation. In this experimental settings, the more detailed corresponding results are shown in Tab. 

6 and Tab. 7.

Appendix B Experimental Settings

In this section, we provide detailed experimental settings of Sec. 5.4 for better reproducibility of such kind of case study.

We set threshold ratio , and then train GBDT and TSGB-4 on task-21 dataset respectively. Therefore, how each task’s samples go through the tree in detail should be like that shown in Tab. 5. Take a vivid example, once we plot a certain decision tree (the fourth here) of GBDT and TSGB-4, we can get a view similar to Fig. 10. The (positive sample number negative sample number) pairs of task-21 in each node and leaf show how task-21’s samples go through the decision tree. The defined in Eq. (10) is provided with the corresponding split condition at each node.

Table 5. An example of how each task’s samples go through the tree shown in Fig. 10.

In GBDT (Fig. 10(a)), when 2 positive and 163 negative samples from task-21 are assigned to node-2, the best split condition in GBDT is dividing the samples based on whether a sample is from task-21. This is because most samples of task-21 in node-2 are negative samples, and such a split condition could minimize the overall objective than others. But this condition is not good to task-21 since only task-21’s samples could go through the sub-tree (), this sub-tree is actually constructed in a single-task manner, and task-21 could not benefit from other tasks’ data. This sub-tree structure is also related to the AUC below 0.5 on task-21, because the model is biased by task-21’s imbalanced label distribution, and almost all of the samples on the right branch from node-2 will be given a negative predicted value.

In TSGB-4 (Fig. 10(b)), the first split at root node-0 is the same with GBDT, as the optimal split condition “f2¡6.935?” only has and is thus good to most of the tasks. But when it comes to node-2 on the right branch, the optimal split condition “Is from task 21?” will actually increase, instead of reduce, near half of the tasks will have negative task gain problem. In this condition, TSGB will instead split the samples task-wisely, with tasks with negative task gain to the left branch and positive ones to the right. Although the same 2 positive and 163 negative samples of task-21 are assigned to node-6 as GBDT in Fig. 10(b), there also are 2431 samples from other 11 tasks in node-6, which means task-21 could benefit from other tasks by sharing more tree structure. At node-6, the optimal split condition, again, leads to negative task gain problem in about half the samples, which means for more than the half samples, they had better not split following “f5¡0.172?”. Therefore, TSGB conducts task-wise split at node-6 and assigns task-21’s samples to leaf-14 with other 1088 samples from 5 tasks. One may doubt the task-wise split on the last level in a decision tree. Actually, the rationale lies in the decision tree’s nature of dichotomy, and therefore task-wise split is an alternative to feature-wise split to find a generally “good” division when feature-wise split cannot.

Appendix C Application Settings

The deployment environment is based on servers with 2 regular nodes and 2 test nodes. Each node is in CentOS Linux release 7.6.1810 with 4 cores 2394MHz CPU, 32GB RAM, and 50GB SSD.

Users are required to provide a series of basic personal information and living habits containing gender, year of birth, educational background, height, weight, family history of diabetes, hypertension, fatty liver, smoking and drinking. As an optional item, user can also provide province, fasting blood glucose and systolic blood pressure for a more accurate diabetes risk assessment.

Task 25% 50% 100%
Table 6. Complete AUC results on diabetes dataset for hyperparameters listed in Tab. 4.
Task 25% 50% 100%