Building Decision Forest via Deep Reinforcement Learning

by   Guixuan Wen, et al.
Chongqing University

Ensemble learning methods whose base classifier is a decision tree usually belong to the bagging or boosting. However, no previous work has ever built the ensemble classifier by maximizing long-term returns to the best of our knowledge. This paper proposes a decision forest building method called MA-H-SAC-DF for binary classification via deep reinforcement learning. First, the building process is modeled as a decentralized partial observable Markov decision process, and a set of cooperative agents jointly constructs all base classifiers. Second, the global state and local observations are defined based on informations of the parent node and the current location. Last, the state-of-the-art deep reinforcement method Hybrid SAC is extended to a multi-agent system under the CTDE architecture to find an optimal decision forest building policy. The experiments indicate that MA-H-SAC-DF has the same performance as random forest, Adaboost, and GBDT on balanced datasets and outperforms them on imbalanced datasets.


page 1

page 2

page 3

page 4


Balanced Random Forest Classifier in WEKA

Data analysis and machine learning have become an integrative part of th...

A Deep Reinforcement Learning Based Multi-Criteria Decision Support System for Textile Manufacturing Process Optimization

Textile manufacturing is a typical traditional industry involving high c...

Explaining random forest prediction through diverse rulesets

Tree-ensemble algorithms, such as random forest, are effective machine l...

Fast classification using sparse decision DAGs

In this paper we propose an algorithm that builds sparse decision DAGs (...

Ensemble Classifier Design Tuned to Dataset Characteristics for Network Intrusion Detection

Machine Learning-based supervised approaches require highly customized a...

RL-PGO: Reinforcement Learning-based Planar Pose-Graph Optimization

The objective of pose SLAM or pose-graph optimization (PGO) is to estima...

Deep Reinforcement Learning with Modulated Hebbian plus Q Network Architecture

This paper introduces the modulated Hebbian plus Q network architecture ...

1 Introduction

Decision tree is one of the classical machine learning algorithms, which goal is to generate a tree structure used to represent a set of test rules by continuously dividing the feature space. Unlike black box models that provide only predictions, decision tree models are not only simple and easy to use, but also have natural interpretability. Users are able to obtain more detailed information about the decision process from decision tree models, which is why it is widely used in industries such as medical aid diagnosis

[1, 2], financial risk control[3, 4], and marketing[5, 6].

However, individual decision tree model is very prone to overfitting and often requires additional pruning to trade-off accuracy and robustness. Ensemble learning[7] is one of the common means to improve the robustness and accuracy of decision tree models. For example, in the random forest[8], users sample multiple subsets from the dataset in parallel for training different decision tree models and eventually make decisions by voting. In addition, each tree in a random forest is fully grown without pruning. Random forests tend to perform very well, especially for those datasets that contain a large number of attributes.

Another important issue is that the data imbalance also poses a challenge for decision tree[9]

. Data imbalance, also known as data skew, is one of the difficulties that machine learning algorithms must address in real-world applications. For example, in disease diagnosis, the number of healthy individuals is often 1000 or even 10,000 times greater than the number of diseased individuals, and the importance of accurately predicting diseased individuals is much higher than that of accurately predicting healthy individuals. Some classical decision tree algorithms are proposed without considering the distribution of the data, and although these algorithms can achieve good performance when dealing with balanced data, they will fail in the face of imbalanced data because the algorithm is biased towards the majority class. The reason for this phenomenon is that the split criteria introduces the prior probability of the sample distribution in the process of calculating the node impurity.

This paper proposes a deep reinforcement learning(DRL) based decision forest induction method called MA-H-SAC-DF for binary classification with the maximization of a long-term return, where a set of cooperative agents jointly constructs all base classifiers. This idea builds on our previous work[10], which proposed to represent the decision tree induction by Markov decision process(MDP) and solve the optimal induction strategy by deep reinforcement learning algorithm with hybrid action space. In MA-H-SAC-DF, the decision forest building process is considered as a multi-step game that can be modeled as a decentralized partial observable Markov decision process(Dec-POMDP). Besides, the state space during decision forest induction is redefined according to informations of parent node and current position. Still, the action space and reward function are consistent with our previous work. Finally, we extend the Hybrid SAC[11] to the multi-agent system based on the centralized training decentralized execution(CTDE) architecture[12] and propose MA-H-SAC, which is applied to solve the optimal decision forest induction policy. Experimental results show that MA-H-SAC-DF not only has the same performance as random forest, Adaboost and GBDT on balanced datasets, but also outperforms them on unbalanced datasets.

The rest of this paper is organized as follows. The second section firstly introduces the ensemble learning related to decision tree, and then summarizes the Hybrid SAC algorithm and the existing multi-agent method architecture. The details of MA-H-SAC and MA-HSAC-DF will be described in the third section. The fourth section shows the experimental results and evaluates the performance of MA-H-SAC-DF compared with other methods. Conclusions and future work will be discussed in section five.

2 Reated work

2.1 Decision tree based ensemble learning

The basic idea of ensemble learning is to combine multiple week classifiers to form a strong classifier. The Hoffding inequality[13] proves that when the accuracy of the base classifier is more than 50%, the error rate of the ensemble classifier decreases exponentially with the increase of the number of base classifiers and finally approaches 0. Decision tree is a common base classifier in ensemble learning which can be classified into bagging, boosting, and stacking according to different ensemble strategies.

One of the earliest ensemble algorithms is bagging[15], where each base classifier is trained on a subset of the initial training set, and new samples are usually predicted by voting. Random forest, whose base classifier is a decision tree, is a variant of bagging. Like bagging, random forest uses bootstrap sampling and aggregates the predictions of all base classifiers in an unweighted manner for the final vote. However, the induction of the base classifier in the random forest is different from that of a single decision tree. Firstly, attribute subsets are randomly selected from attributes, and then base classifiers are trained on the sampled data subsets based on attribute subsets. The best value for is always . In addition, every tree in the random forest is fully grown without pruning. Random forests tend to perform very well, especially for data sets that contain many attributes.

Unlike the bagging, the base classifiers are generated sequentially in the Boosting[15]

. Firstly a base classifier is trained from the initial training set, and then the next base classifier is trained based on the performance of the previous base classifier. This process is repeated until the number of base classifiers reaches a pre-specified. The final set of base classifiers is weighted and combined. Gradient Boosting Decision Tree (GBDT)

[14] is a classical algorithm in the boosting. For the classification problem, given a data set

, the loss function of a single sample can be expressed in terms of cross-entropy



denotes the prediction probability. Assuming that the

th base classifier is , the ensemble classifier obtained in the first iterations is


According to the log odds function

, substitution gives


It is easy to obtain that the negative gradient of the loss function L for the current ensemble classifier is


Then the sample is used as the training sample for the th base classifier. This is repeated until the number of base classifiers reaches a pre-specified value.


is different from bagging and boosting in two main points. First, stacking considers heterogeneous base classifiers, while bagging and boosting consider homogeneous base classifiers. Second, stacking uses meta model to combine base classifiers. For example, KNN, logistic regression, and SVM can be selected as the base classifier and the neural network as the meta model. The neural network takes the output of the base classifier as input and produces the final prediction result.

2.2 Hybrid SAC

Many deep reinforcement learning methods only consider the discrete or continuous action, but hybrid action space is also essential. Hybrid SAC[11], an extension of SAC, can directly handle hybrid action space without any approximation or relaxation. Assuming that there are discrete actions and each action corresponds to an -dimensional parameter , the hybrid action space can be expressed as


As depicted in the FIGURE 1, Hybrid SAC follows the actor-critic architecture, where the actor network takes the state as input and produces both a discrete distribution as well as the mean

and standard deviation

of the continuous parameters, and the critic network takes state combined with parameters actions as input, then produces values for all discrete actions.

Figure 1: The framework of hybrid sac

2.3 Multi-agent architecture

In multi-agent system (MAS), different agents interact in a common environment and make decisions autonomously to achieve their own goals or a common goal. According to different tasks, the relationship between agents can be divided into cooperation, competition and mix. Compared with the single agent system, the application of reinforcement learning in MAS will face the following challenges: the instability of the environment, the limitation of the agent to obtain information, the consistency of individual goals and scalability[16].

Distributed training decentralized execution (DTDE) is a multi-agent deep reinforcement learning architecture proposed earlier. In DTDE, the training stage of each agent is independent, and there is no information sharing between agents. Besides, each agent makes independent decisions during the execution stage based on its local observations. The disadvantage of the DTDE architecture is that the environment is unstable for any agent[17]. Because while an agent is making decisions, other agents are also taking actions, which leads to changes in the state of the environment related to all agents’ joint actions.

Centralized training centralized Execution (CTCE), which is the opposite of DTDE, assumes that all agents can communicate with each other. Therefore, it can learn a joint policy for all agents by directly applying the single agent method to MAS[17]. Although methods based on CTCE architecture can improve scalability to some extent by sharing parameters, they still face the dimension disaster of exponential growth of joint action space as the number of agents increases.

Centralized training decentralized execution (CTDE) is the most popular multi-agent training architecture. Each agent must train with the global information and make independent decisions based on local observations during execution. Jakob Foerster et al.[18] first applied the training method of centralized critic distributed actor based on the actor-critic architecture and proposed counterfactual multi-agent policy gradients (COMA) algorithm. In addition, a series of landmark multi-agent deep reinforcement learning methods such as MADDPG[12], VDN[19], QMIX[20], and QTRAN[21] is also built based on the CTDE architecture. Therefore, the MA-H-SAC proposed in this paper is also on CTDE.

3 Methods

Ensemble learning is one of the standard methods to improve the generalization and robustness of algorithms. So far, the ensemble algorithms whose base classifier is decision trees usually belong to the bagging or boosting. There has been no similar work to build an ensemble classifier to maximize a long-term return. This paper first comes up with a multi-agent deep reinforcement learning method called MA-H-SAC for hybrid action space. It then proposes a new decision forest induction method based on MA-H-SAC called MA-H-SAC-DF, where a group of cooperative agents constructs all base classifiers.

3.1 Ma-H-Sac

Usually, a multi-agent task can be described as the decentralized partially observable Markov decision process (Dec-POMDP), which can be represented by tuples , where represents the global state of the environment. Considering that the agent is partially visible to the environment, at each time-step , the agent obtains its observation and selects an action , which forms a joint action . When the joint action is performed, the environment transitions according to the state transition function . All agents share the same reward function, and discount rate .

MA-H-SAC is an extension of Hybrid SAC to the MAS under the architecture of CTDE. The global state is utilized in the training stage to ensure that the environment is stable. Each agent only needs to use the local observation to make decisions during execution. Suppose agents with the actor network and critic network . As shown in the FIGURE 2, like MADDPG, each agent in MA-H-SAC uses an independent critic network that takes the global state and continuous actions as input and produces the

values of all discrete actions during the centralized training stage. In the decentralized execution stage, the actor network can simultaneously output the probability distribution

of discrete actions as well as the mean and standard deviation of all continuous actions only by inputting local observations. Then the agent obtains specific discrete actions by sampling from the probability distribution, i.e. , and computes the final continuous parameters by the resample trick , where is the gaussian noise.

Figure 2: The framework of MA-H-SAC

According to the CTDE architecture and Hybrid SAC algorithm, the loss function of the -th agent for updating actor network is




and are temperature parameters, and the replay buffer is used to store the agent’s historical experience . The parameter of the critic network can be updated by minimizing the Bellman error




is the target critic network parameters. It is worth noting that, to make the training more stable, MA-H-SAC refers to the suggestions in TD3[22] that uses double critic network to avoid the overestimation problem of value function during training. The details of MA-H-SAC are shown in Algorithm 1.

1:  learning rate , max episode , max timestep

, Gaussian distribution

, temperature factor , number of agents
2:  Initialize actor network , critic network
3:  Initialize target actor network and critic network
4:  for  to  do
5:     obtain initial global state and local observations
6:     for  to  do
7:        for  to  do
8:           compute
9:           select action , where
10:        end for
11:        takes the joint action , obtain reward , next global state and next local observations
12:        save tuple to replay buffer
13:        for  to  do
14:           get a batch of samples from
15:           compute the target according to Equation (11)
16:           compute the loss and update critic network according to Equation (10)
17:           compute the loss and update actor network according to Equation (6)
18:           update the target network
19:        end for
20:     end for
21:  end for
Algorithm 1 MA-H-SAC

3.2 Ma-H-Sac-Dt

MA-H-SAC-DF needs to transform the induction process of decision forest into DEC-POMDP first. As shown in FIGURE 3 (a), it is assumed that agents respectively make

trees. When the agent generates a node, the local observation information includes the parent node’s attribute, threshold value, and node type. In addition, local observation also consists of the node’s local position and global position information. The local position means that the current node is a left child or a right child. The global position refers to the current node’s position when traversing the whole decision tree in a level order traversal way. All the information is represented by One-Hot encoding and spliced into the one-dimensional vector to form the local observation vector

of agent . MA-H-SAC adopts the CTDE architecture, and the global state of the environment is required as input of the critic network during centralized training. A simple method is to all of the agent’s local observations in terms of joining together the global state, which is easy to produce redundant information. Because all the agents that generate nodes by means of level order traversal, at the same time, the node’s local and global position are the same, it can only keep one part of this information. The simplified global state is shown in FIGURE 3 (b).

Figure 3: Representation of partial observation and global state of decision forest

The action space and reward function are defined as same as when building a single decision tree model which is based on our previous work. We consider the continues attributes in binary classification. Thus the attributes can be represented as discrete action and the threshold values are the corresponding continues action-parameter . Each node in DT corresponds with a action .


Intuitively we can combine the tree structures generated by agents into ensemble classifiers and classify on the training set at step . Next, the predicted results and the truth are used to calculate a score

based on arbitrary evaluation metrics, such as accuracy and G-Mean. Finally the reward

is easily obtained according to equation (13).

Figure 4: The Framework of MA-H-SAC-DT

That is to say, the positive means action improves the performance of the decision tree while negative reward decreases it. Note that the is usually set to 0 or 0.5. In other words, if the initial is set to zero, the total reward is equivalent to the evaluation score of the final classifier. Similarly, setting at 0.5 is equivalent to adding a baseline.


The framework of MA-H-SAC-DF is shown in FIGURE 4. Each agent is responsible for the generation of a corresponding base classifier. At time , the agent selects attribute and its corresponding threshold value according to actor network to form action , and all agent actions form joint action . The environment generates node for each base classifier according to the attribute and partition threshold in the joint action performed by the agent. Next, the environment combines all base classifiers into ensemble classifiers in an unweighted way for classification evaluation in dataset and gets the reward . Finally, the agent randomly samples a group of samples from the experience pool to update the network parameters. It should be noted that MA-H-SAC-DF adopts a delayed update strategy to update the actor network of an agent, which is similar to the MATD3 method.

4 Experiment

4.1 Datasets

In this section, random forest, GBDT, and Adaboost are used as comparison baselines to verify the performance of MA-H-SAC-DT in both balanced data and imbalanced data.

Data Sets #I #F
appendicitis 106 7
bupa 345 6
coil2000 9822 85
heart 220 13
magic 19020 10
pima 768 8
sonar1 208 60
spectfheart 267 44
breast-cancer 683 10
bands 365 19
australian 690 14
mammographi 830 5
saheart 462 9
liver-disorders 345 5
diabetes 759 8
Figure 6: Description of the imbalanced datasets. #I, #F denote the number of instances, attributes respectively
Data Sets #I #F IR
ecoli-0-1-vs-2-3-5 244 7 9.17
ecoli-0-1-4-6-vs-5 187 6 13.0
ecoli-0-1-4-7-vs-2-3-5-6 336 7 10.59
ecoli-0-6-7-vs-5 220 6 10.0
ecoli2 336 7 5.46
haberman 306 3 2.78
new-thyroid1 215 5 5.14
new-thyroid2 215 5 5.14
vehicle3 846 18 3.0
winequality-red-4 1599 11 29.17
wisconsin 683 9 1.86
yeast-0-2-5-6-vs-3-7-8-9 1004 8 9.14
yeast1 1484 8 2.46
glass0 214 9 2.06
glass6 214 9 6.38
pima 768 8 1.87
africa recession 486 53 11.9
insurance 382154 10 5.1
Figure 5: Description of the balanced datasets. #I, #F denote the number of instances, attributes respectively

15 balanced datasets and 18 imbalanced datasets from reality are described in TABLE 6 and TABLE 6, which contains the number of instances and attributes. Besides, the imbalance ratio (IR) that measures the degree of imbalance between the classes of majority and minority is also provided. These data sets are collected from three well-known public sources called UCI Machine Learning Repository, and KEEL Imbalanced Data Sets.

4.2 Evaluation Metrics

We use the accuracy to evaluate the model’s performance on balanced data sets. However, it is unreasonable to take the accuracy as the metric to evaluate classifier performance in imbalanced classification. At present, the commonly used imbalanced classification evaluation metrics include G-Mean and the area under the ROC curve (AUC) that are applied in our experiments.

To make the experiment more convincing, we not only take 10-fold cross-validation but also conduct a Friedman test and Nemenyi test. The null hypothesis of the Friedman test is that all the methods are equivalent. Precisely, assuming there are

methods, data sets, and average rank corresponding to each method , we should compute two essential statistics and that calculated as (14) and (15), then compare the value of with the critical value of given significance level . If the null hypothesis is rejected, we need to take a Nemenyi test for further comparison. For given significance level , the critical value can be calculated as (16) on the Nemenyi test. If the average rank difference between the two methods is greater than , it is believed that the two methods have different performances.


4.3 Result

The results of Friedman test(Fr.T) with the "Base" classifier are shown in the last line of tables. The "✓" sign under a classifier indicates that the "Base" classifier significantly outperforms that classifier at 90% confidence level.

Datasets Random Forest Adaboost GBDT MA-H-SAC-DT
appendicitis 0.844 0.833 0.844 0.846
bupa 0.670 0.615 0.670 0.583
coil2000 0.922 0.940 0.940 0.940
heart 0.784 0.828 0.799 0.798
magic 0.858 0.774 0.786 0.798
pima 0.734 0.744 0.735 0.777
sonar 0.747 0.750 0.729 0.772
spectfheart 0.791 0.783 0.797 0.793
breast-cancer 0.963 0.949 0.943 0.973
australian 0.963 0.949 0.943 0.973
bands 0.687 0.661 0.632 0.661
bands 0.687 0.661 0.632 0.661
mammographi 0.787 0.838 0.834 0.833
saheart 0.648 0.693 0.688 0.726
liver-disorders 0.580 0.550 0.588 0.580
diabetess 0.708 0.729 0.753 0.657
Avg.Accuracy 0.772 0.770 0.773 0.779
Avg.Rank 2.800 2.467 2.600 2.133
Win/Tie/Loss 10/1/4 9/2/4 9/1/5 Base
Fr.T - - - Base
Table 1: Accuracy of four methods on fifteen balanced datasets

TABLE 1 shows the performance comparison of four methods on balanced data sets according to accuracy. From the average accuracy ranking, the performance of MA-H-SAC-DF on balanced data is not different from random forest, Adaboost, and GBDT. According to equations (14) and (15), the statistics and can be calculated. In the case of 15 datasets and four methods, the critical value of Friedman test when .Because , the Friedman test null hypothesis is accepted, and the performance of the four methods are the same, which means that MA-H-SAC-DF has the same classification performance as random forest, Adaboost, and GBDT on balanced data.

TABLE 2 and TABLE 2 compare the performance differences of MA-H-SAC-DF with random forest, Adaboost, and GBDT in processing unbalanced data according to G-mean and AUC, respectively. According to the mean and average ranking of G-mean and AUC, MA-H-SAC-DF has a better classification ability. To further verify this conclusion, the Friedman test was performed on the experimental results of G-mean value and AUC value, respectively. For G-mean, and can be calculated according to equations (14) and (15). When , the critical value of Friedman test when comparing the four methods on 18 datasets. Because , the Friedman test null hypothesis is rejected, and the performance of the four methord is not the same. For AUC, the statistics were and . We also reject the null hypothesis. According to equation 16, the Nemenyi test critical value , and the performance of the MA-H-SAC-DF is significantly better than the other three ensemble learning methods.

Datasets Random Forest Adaboost GBDT MA-H-SAC-DT
ecoli-0-1-vs-2-3-5 0.878 0.854 0.608 0.938
ecoli-0-1-4-6-vs-5 0.735 0.750 0.430 0.971
ecoli-0-1-4-7-vs-2-3-5-6 0.750 0.793 0.664 0.891
ecoli-0-6-7-vs-5 0.772 0.756 0.656 0.929
ecoli2 0.847 0.795 0.657 0.886
haberman 0.475 0.467 0.060 0.635
new-thyroid1 0.961 0.941 0.942 0.984
new-thyroid2 0.937 1.000 0.866 0.958
vehicle3 0.691 0.691 0.267 0.696
winequality-red-4 0.179 0.000 0.229 0.717
wisconsin4 0.968 0.973 0.964 0.977
yeast-0-2-5-6-vs 3-7-8-9 0.625 0.435 0.499 0.835
yeast1 0.619 0.542 0.000 0.687
glass0 0.800 0.718 0.471 0.754
glass6 0.851 0.894 0.768 0.992
pima 0.674 0.683 0.531 0.731
africa recession 0.383 0.280 0.275 0.607
insurance 0.601 0.000 0.000 0.818
Avg.G-Mean 0.708 0.643 0.494 0.834
Avg.Rank 2.333 2.778 2.778 1.111
Win/Tie/Loss 17/0/1 17/0/1 18/0/0 Base
Fr.T Base
Table 2: G-Mean of four methods on eighteen imbalanced datasets
Datasets Random Forest Adaboost GBDT MA-H-SAC-DT
ecoli-0-1-vs-2-3-5 0.887 0.861 0.681 0.935
ecoli-0-1-4-6-vs-5 0.777 0.777 0.595 0.970
ecoli-0-1-4-7-vs-2-3-5-6 0.786 0.820 0.725 0.888
ecoli-0-6-7-vs-5 0.802 0.786 0.715 0.926
ecoli2 0.857 0.812 0.749 0.883
haberman 0.549 0.578 0.500 0.600
new-thyroid1 0.962 0.894 0.900 0.983
new-thyroid2 0.940 1.000 0.875 0.957
vehicle3 0.694 0.694 0.536 0.696
winequality-red-4 0.521 0.500 0.524 0.712
wisconsin4 0.968 0.973 0.964 0.977
yeast-0-2-5-6-vs 3-7-8-9 0.691 0.584 0.623 0.835
yeast1 0.650 0.608 0.500 0.682
glass0 0.808 0.728 0.632 0.735
glass6 0.961 0.900 0.792 0.992
pima 0.688 0.704 0.628 0.729
africa recession 0.574 0.545 0.552 0.586
insurance 0.654 0.500 0.500 0.812
Avg.AUC 0.759 0.737 0.667 0.828
Avg.Rank 2.333 2.833 2.722 1.111
Win/Tie/Loss 17/0/1 17/0/1 18/0/0 Base
Fr.T Base
Table 3: AUC of four methods on eighteen imbalanced datasets

5 Conclusion and future work

This paper comes up with a new decision forest building method MA-H-SAC-DF, which targets maximizing long-term returns via deep reinforcement learning and is different from bagging and boosting. In MA-H-SAC, we firstly model the building process as De-POMDP, and all base classifiers are constructed by a set of cooperative agents jointly. Besides, global state and local observations are defined based on parent node information and location information, while the action space and reward function are the same as our previous work. Last, the Hybrid SAC is extended to MAS under the CTDE framework to find an optimal decision forest building policy. The experiment results indicate that MA-H-SAC-DF has better performance on imbalanced data.

In the future, we will continue to explore the efficiency optimization and scalability of MA-H-SAC-DF and extend it to discrete attribute problems. Another area we would like to explore is the extension of MA-H-SAC-DF under multi-class classification scenarios.


  • [1] Al Jarullah A A. Decision tree discovery for the diagnosis of type II diabetes[C]//2011 International conference on innovations in information technology. IEEE, 2011: 303-307.
  • [2] Azar A T, El-Metwally S M. Decision tree classifiers for automated medical diagnosis[J]. Neural Computing and Applications, 2013, 23(7): 2387-2403.
  • [3] Koyuncugil A S, Ozgulbas N. Risk modeling by CHAID decision tree algorithm[J]. ICCES, 2009, 11(2): 39-46.
  • [4] Kim S Y, Upneja A. Predicting restaurant financial distress using decision tree and AdaBoosted decision tree models[J]. Economic Modelling, 2014, 36: 354-362.
  • [5] Yang S B, Chen T L. Uncertain decision tree for bank marketing classification[J]. Journal of Computational and Applied Mathematics, 2020, 371: 112710.
  • [6] Olson D L, Chae B K. Direct marketing decision support through predictive customer response modeling[J]. Decision Support Systems, 2012, 54(1): 443-451.
  • [7] Sagi O, Rokach L. Ensemble learning: A survey[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2018, 8(4): e1249.
  • [8] Biau G, Scornet E. A random forest guided tour[J]. Test, 2016, 25(2): 197-227.
  • [9] Cieslak D A, Chawla N V. Learning decision trees for unbalanced data[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Berlin, Heidelberg, 2008: 241-256.
  • [10] Wen G, Wu K. Building Decision Tree for Imbalanced Classification via Deep Reinforcement Learning[C]//Asian Conference on Machine Learning. PMLR, 2021: 1645-1659.
  • [11] Delalleau O, Peter M, Alonso E, et al. Discrete and continuous action representation for practical rl in video games[J]. arXiv preprint arXiv:1912.11077, 2019.
  • [12] Lowe R, Wu Y I, Tamar A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[J]. Advances in neural information processing systems, 2017, 30.
  • [13] Zhou Z H. Ensemble learning[M]//Machine learning. Springer, Singapore, 2021: 181-210.
  • [14] Friedman J H. Greedy function approximation: a gradient boosting machine[J]. Annals of statistics, 2001: 1189-1232.
  • [15] Zhou Z H. Ensemble methods: foundations and algorithms[M]. CRC press, 2012.
  • [16]

    Du W, Ding S. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications[J]. Artificial Intelligence Review, 2021, 54(5): 3215-3238.

  • [17] Gronauer S, Diepold K. Multi-agent deep reinforcement learning: a survey[J]. Artificial Intelligence Review, 2022, 55(2): 895-943.
  • [18] Foerster J, Farquhar G, Afouras T, et al. Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1).
  • [19] Sunehag P, Lever G, Gruslys A, et al. Value-decomposition networks for cooperative multi-agent learning[J]. arXiv preprint arXiv:1706.05296, 2017.
  • [20] Rashid T, Samvelyan M, Schroeder C, et al. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning. PMLR, 2018: 4295-4304.
  • [21] Son K, Kim D, Kang W J, et al. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//International Conference on Machine Learning. PMLR, 2019: 5887-5896.
  • [22] Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods[C]//International conference on machine learning. PMLR, 2018: 1587-1596.