RDS
Reinforced Data Sampling
view repo
With the rising number of machine learning competitions, the world has witnessed an exciting race for the best algorithms. However, the involved data selection process may fundamentally suffer from evidence ambiguity and concept drift issues, thereby possibly leading to deleterious effects on the performance of various models. This paper proposes a new Reinforced Data Sampling (RDS) method to learn how to sample data adequately on the search for useful models and insights. We formulate the optimisation problem of model diversification δdiv in data sampling to maximise learning potentials and optimum allocation by injecting model diversity. This work advocates the employment of diverse base learners as value functions such as neural networks, decision trees, or logistic regressions to reinforce the selection process of data subsets with multimodal belief. We introduce different ensemble reward mechanisms, including soft voting and stochastic choice to approximate optimal sampling policy. The evaluation conducted on four datasets evidently highlights the benefits of using RDS method over traditional sampling approaches. Our experimental results suggest that the trainable sampling for model diversification is useful for competition organisers, researchers, or even starters to pursue full potentials of various machine learning tasks such as classification and regression. The source code is available at https://github.com/probeu/RDS.
READ FULL TEXT VIEW PDFReinforced Data Sampling
Reinforced Data Sampling
Reinforced Data Sampling
Data sampling is the process of selecting subsets of data points for analysis and reporting in the larger dataset. In machine learning, it is a fundamental step to ensure learning methods to generalise new observations adequately. However, classical data sampling techniques like randomisation are susceptible to detrimental issues on learning performance, including concept drift and evidence ambiguity [2]. Consider a machine learning task such as classification or regression, in which it aims at modeling using training data samples in light of maximising its task performance. The model concept, or mappings from data observations to outputs, may drift rendering suboptimal fit with new data points due to hidden context changes or datarelated problems. Often, there is simply insufficient or inappropriate information presented in the data samples to support adequate predictive strength, which is also known as evidence ambiguity [20]. Hence, the nexus between data points and model outputs plays a crucial role in data selection to improve generalisation and to mitigate such issues [31]. This paper describes a new sampling method, named as Reinforced Data Sampling (RDS), to learn how to sample data effectively predicated on base models in searching for useful models and insights. We focus on the ensemble use of multiple models to enhance the representational ability of the data and to select subsets of data points according to future learning potentials [11].
Models are different in their strengths and weaknesses, thus exploiting their disagreements is a useful mechanism for better learning performance. Samplebased model diversification has been proven as an effective ensemble strategy in machine learning [7]. Given learners , diverse ensembles can be derived from the collective behaviours of the members [20]. As typical learners are apt to modeseeking behaviours, samplebased randomisation can be used to inject model diversity for gaining more complement information from multiple learners [7]. Therefore, learning how to sample with the consideration of task performance and model diversity is intriguing to support many applications in machine learning.
In this paper, we formulate a sampling problem for model diversification , in which a data sampler is trained to generate subsets of the larger dataset. Moreover, we advocate the use of diverse learners
in our method to promote model diversity by design, including stable learning methods (e.g., support vector machines, regularised least square regression) and unstable learning methods (e.g., neural networks or decision trees)
[1; 23]. Our approach entails reinforcement learning of observable evidence in the dataset to approximate the parameters of our sampler with ensemble value functions. We propose several novel ensemble reward mechanisms, collectively soft voting and stochastic choice. Furthermore, our method is designed to achieve proportionate allocation with regularisation of distributional property.
We evaluate the RDS approach using four datasets, including NIPS 2003 Feature Selection Challenge  Madelon, Kaggle Hackaton  Drug Reviews, MNIST, and Kalapa Credit Scoring Challenge. Our experiments cover a range of machine learning tasks such as binary classification, multiclass classification, and regression on multivariate, textual, and visual data. The results evidently highlight better performance impacts of trainable data samples over classical or prior data selection.
In data challenges, AI and data scientists have witnessed the deleterious effects of suboptimal data preparation in largescale competitions. It entangles the machine learning community with limitations for analysis and reporting, thereby restricting innovations and useful outcomes. In everyday settings, the same phenomenon may happen at the early stage of machine learning tasks. Therefore, we suggest our trainable sampling method for model diversification as a viable alternative to classical methods.
Firstly, this paper introduces Reinforced Data Sampling (RDS), a method to approximate optimum sampling for model diversification with ensemble rewarding to attain maximal machine learning potentials. A novel stochastic choice rewarding is developed as a viable mechanism for injecting model diversity in reinforcement learning. Secondly, we implement an endtoend framework for trainable data sampling, which can easily be adopted in the early stages of machine learning tasks, including classification and regression. Thirdly, we conduct comprehensive experiments to compare RDS against other traditional data splitting methods, on realworld datasets with various tasks. The results suggest that RDS is an effective method for data sampling with the objective of achieving high model diversification.
In machine learning, generalisation, or the ability to adapt new, previously unseen observations, plays a vital role in creating useful models [9]
. It entails the process of data sampling, which is employed to select and manipulate a representative subset of data points for performance estimation. Early approaches, such as simple random sampling or stratified sampling, have been widely adopted in numerous machine learning tasks to date. The use of simple randomisation (e.g., Knuth’s algorithm
[15]) in data selection is overly popular; however, it is susceptible to many sampling issues such as violation of statistical independence, bias or covariate shift [22; 2]. Stratification technique is used to partition the dataset into homogenous strata to ensure the adequate representation of data points [8].In computational learning theory, model performance and complexity have been formalised as factors to generalisation bounds according to Occam’s razor
[26]. Holdout method [28]for data selection is commonly performed to estimate the predictive performance of a learner, which can be repeated multiple times to improve stability with less variance. Furthermore, modern datasets are typically associated with heterogeneous features, ambiguous evidence, and complex dependencies, thereby leading to concept drift in model performance
[23]. Importance sampling by reweighting data points has been explored as a remedial mechanism [29; 21]. In recent years, many researchers have approached model drift and related dataset issues with ensemble learning [2; 23; 16; 1]. Multiple base models can be trained on blocks of data samples to address uncertainty by injecting model diversity, in the hope of maximising performance generalisation [17]. With recent advances in reinforcement learning [24; 5; 20], we explore how to sample informative data points that best generalise machine learning models with ensemble learning and model diversification.This study aims at developing a novel approach to sample a dataset into relevant subsets for various machine learning tasks to achieve an optimum goal. It comprises of task performance and model diversity to maximise the candidate learning potentials with adequate allocation in searching for useful models and insights in subsequent processes in machine learning. This paper formulates a sampling problem for samplebased model diversification .
Let denote the dataset of size , where are arbitrary inputs and are dependent outputs. We propose a data sampler to generate multiple subsets of the dataset with several properties.
First, we advocate the use of diverse K learners , including stable methods (e.g., support vector machines or regularised least square regression) and unstable methods (e.g., neural networks or decision trees). The goal is to find an optimal data sampler to maximise the ensemble learning potentials with diversity induced as the following:
(1) 
where is an ensemble learner of and is the criterion that measures the performance of .
We posit that the sampling procedure is stochastic, in which the allocation of samples to subsets is based on parametric probability distributions
with the parameters .(2) 
To maintain a sampling ratio , the third property of is described as the following:
(3) 
where is the mean of the probability distributions
We assume that data samples should be independent and identically distributed (i.i.d). Hence, data subsets are representative of the true population in respect to statistical independence. We formulate the fourth property for each subset of size as below:
(4) 
where is the true distribution.
This study is on searching for solutions to achieve optimum task performance; nevertheless, it is NPHard with the possibility of solving with approximation. Therefore, we propose a reinforcement learning approach to discovering how to sample by approximating solutions.
We propose Reinforced Data Sampling (RDS) framework based on the Markov Decision Process (MDP) to maximise
.We posit the use of the data sampler to create a training dataset and a test dataset to discuss our approach without loss of generality.
RDS is a reinforcement learning framework, where an agent receives a data sample
at each step, classifies which subset the sample belongs to, and interacts with an environment. As a result, a reward
is given to agent by the environment based on the outcome of its action . The agent reaches an optimum goal through its interactions with the environment to accumulate maximum possible rewards. It is described as a tuple as the following:is a finite set of states, where the decision process is evolved sample by sample.
is the discrete action space of the agent, .
is the reward set where is mapped a state and an action .
denotes the transition from the current state to the next state. ; thus, .
The framework employs a stochastic policy which defines the probability of performing action by the agent given the state ; thus where the probability distribution is determined by the parameter according to Eq(2). Given policy , RDS starts from observing an initial state according to the probability distribution . At each step of interaction , it evolves according to:
(5) 
We denote as the trajectory of the RDS, where . The transition is deterministic as the agent moves from the current state to the next state according to the order of observations in the dataset. The optimisation problem in RDS is expressed by finding a good set of parameters to maximise the expected return:
(6) 
where is the finitehorizon undiscounted return computed based on the trajectory of steps.
In RDS, we investigate the use of supervised learning methods, which mimic the inputoutput process in nature. These function approximation methods may range from linear functions to decision trees or artificial neural networks. They receive data samples as observations of the state to predict values, where
. In general, the objective function for machine learning is specified as . At the end of each episode, we apply our function approximation on the training dataset and the test dataset . Once the transition is terminated at step , we compute .We utilise the policy gradient method to address the optimisation problem, in which policy weights are updated by the stochastic gradient optimisation at the end of every episode as the following:
(7) 
where is the trajectory probability of .
Once the policy has converged, we estimate a good approximation of our data sampler in Eq(1) as the following:
(8) 
where .
In general, the RDS process starts with sampling a training subset and a test subset of the dataset according to the policy . We apply the function approximation by training on the training set and evaluating on the test set to obtain an expected return . The policy is then updated accordingly and a new episode is started to reach the convergence of . Refer to the Appendices A.1 and A.2 for details of our design and algorithm.
The convergence of RDS is inherited from the convergence of the policy gradient method with function approximation [30]. The function approximation is designed with consistent parameter initialisation and hyperparameters; hence, the reward is fixed for each sampled dataset. We posit that the computational complexity of RDS is , where is the number of episodes for policy updating, is the size of the dataset, is the cost for state updating, and is the computational cost for function approximation from learners.
Our target is to train the agent to draw relevant samples with the policy to maximise the expected return , which entails performance potentials of function approximators. In this paper, we consider generalised function approximators such as linear estimators, decisiontrees, and neural networks, which are commonly useful in data challenges. Let be any arbitrary function learners, then the RDS process converges with the specified rewards [30].
We design the learning environment with the use of ensemble method of multiple function approximators to enrich model diversity for data selection by design; because each base model provides outcome reflecting multimodal belief [11]. Let denote the ensemble function approximator; hence we have , where is the number of base learning models for evaluating performance metrics. We fix the training procedure of the supervised learners, including parameter initialisation, model architecture, hyperparameters, and random seeds to ensure the same output from a given state for reproducibility. This paper investigates several reward mechanisms, including soft voting and stochastic choice, which are applicable for both classification and regression problems. In the soft voting approach, we define RDS using an ensemble approximator with the following value function:
(9) 
Thus, the environment is observed as deterministic with the reward .
In addition, we define a stochastic RDS process depends on the base models randomly picked from a stationary distribution
at each epoch. It is desirable that this stochastic behaviour may overcome local optimisation despite the noise introduced. We define:
(10) 
The environment, therefore, is observed as stochastic with the reward .
We argue that the choice of base models is crucial to achieving higher learning potentials with model diversity. In addition, preprocessing steps or pretrained feature mappings can be adopted in these learners to provide better representational abilities.
We implement the policy learning using the Gated Recurrent Unit
[3] with the feature size of the dataset . It is an intuitive choice as the gated networks support the data selection based on the sequence of samples similar to the desired agent’s brain. In the learning, the state at the step is encoded to create a hidden vector presentation of . With two reset and update gates, the computation of is described as the following:(11) 
A linear layer is adopted to derive the probability distribution of the action . Moreover, the policy is pretrained based on the sampling ratio to achieve faster convergence.
We use the log of the action probability an equivalent loss function with the learning factor
:(12) 
We implement a regularisation loss to ensure the sampling ratio based on the distributional property of the action probability in Eq(3) as the following:
(13) 
where is the scale factor and is the sampling ratio.
In addition, we design a regularisation mechanism to ensure that training samples and testing samples are drawn from the same distribution as described in Eq(4). This is important for both classification (e.g., class ratios) and regression (e.g., identically distributed). Given probability density distributions of the training set and the testing set . We define:
(14) 
where is the scale factor and
is the Kullback–Leibler divergence. In regression, we estimate Kullback–Leibler divergence of continuous samples using PérezCruz’s method
[25].The final loss for our policy optimisation is computed as:
(15) 
In this section, we conduct experiments on four datasets to examine the effectiveness of the RDS method. It is done via evaluating model diversification reflected by the performance evaluated on proposed data samples by RDS in comparison with classical methods.
Madelon (MDL) [13]
was artificially developed for the NIPS 2003 feature selection challenge. It has 500 numerical features, in which 20 real features and 480 distractors have no predictive capacity. Several preprocessing techniques were adopted to conceal the origin and patterns of the dataset on the search for functional feature extractors. We employ barebone Logistic Regression (LR), Random Forest (RF), and MultiLayer Perception (MLP) in our experiments. And a pipeline of stability selection and Logistic Regression of feature interactions is adopted for public benchmarking
[6].Drug Review (DR) [12]
provides patients’ reviews on specific drugs crawled from multiple online pharmaceutical review sites. It contains categorical features including drug name and patient condition, review text and date, and numerical features including review rating and useful counts. In total, there are 215,063 examples, which is split into a training set of 75% and a test set of 25%. In this experiment, we use three base learners including Ridge Regression (Ridge), MultiLayer Perception (MLP), and Convolutional Neural Network (CNN).
MNIST [19]
consists of 70,000 handwritten digits and is one of the most wellknown datasets in the deep learning community. MNIST is selected for experiments since it represents very well for multiclass classification task on images. MNIST considered a balanced image classification data, and it is divided into 60K samples for training and 10K samples for testing. In this experiment, we use three base learners including Logistic Regression (LR), Random Forest (RF), and Convolutional Neural Network (CNN).
Kalapa Credit Scoring (KLP) [14] is a data challenge for credit scoring task. The dataset consists of 30,000 training and 20,000 testing examples. It contains two labels (i.e., GOOD and BAD) associated with 62 variables, including demographics and financial status. There is an imbalance problem on the label distribution with a ratio of 1.6% (i.e., only 486 BAD samples among 30,000 training samples). Moreover, 40 data fields have missing rates of more than 30%, which increases the difficulty in finding a good data selection for the data challenge. We will consider three models namely Logistic Regression (LR), Random Forest (RF), and MultiLayer Perception (MLP) on investigating the effectiveness of the splitting methods. The first ranked solution [18] is selected as the public model for comparisons.
Sampling  #Sample  Class Ratio  LR  RF  MLP  Ensemble  Public  

Train  Test  Train  Test  
Preset  2000  600  1.0000  1.0000  .6019  .8106  .5590  .6783  .9063 
Random  2000  600  .9920  1.0270  .5742  .7729  .5774  .6453  .9002 
Stratified  2000  600  1.0000  1.0000  .5673  .7470  .6153  .6360  .8828 
RDS  2001  599  1.0375  .9137  .6192  .8050  .6228  .6973  .8915 
RDS  2021  579  1.0010  .9966  .6192  .8050  .6050  .6947  .9106 
Sampling  Train  Test  Ridge  MLP  CNN  Ensemble  Public 

Preset  161,297  53,766  .4580  .5787  .7282  .6660  .7637 
Random  161,297  53,766  .4597  .4179  .7353  .6485  .7503 
RDS  162,070  52,993  .4646  .5776  .7355  .6692  .7649 
RDS  161,944  53,119  .4647  .5370  .7509  .6562  .7600 
The source code of RDS is implemented in Pytorch whilst learning environments are built flexibly by various learning frameworks such as Keras, Tensorflow or Scikitlearn. Environmental learning models are optimised concurrently using a common evaluation metric. For the policy optimisation, the number of hidden unit of GRU is 256. The learning is run on 3400 episodes with the RMSprop optimiser and the initial learning rate of 0.001. Scaling factors
are empirically selected, i.e., (1.0, 0.9, 0.1) for Madelon, (1.0, 1.0, 40) for Kalapa, (1.0, 0.1, 0.01) for MNIST and (1.0, 0.9, 0.1) for Drug Review. For KLP and DR datasets, we employ FastText [4] and BERT [10] language models for extracting representation for textual contents. All experiments are conducted on a similar computational environment of Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 256GB Ram, and a Titan RTX 2080Ti GPU card.We compare our proposed RDS approach with several traditional data sampling methods, including simple randomisation denoted as Random, stratification (only for classification) denoted as Stratified. We also include comparisons with available splitting denoted as Preset, which is provided either by the organisers of competitions or the authors of the datasets. Moreover, we select a number of prominent solutions which have been shared by the members of the public to examine the effects of various techniques on the datasets.
For Madelon and Kalapa datasets, the tasks are binary classification; therefore, we use AUC to measure model performance. In turn, we employ MicroF1 metric for the task of multiclass classification on MNIST dataset. For experiment on Drug Review, we use Rsquared () to measure performance of the models as the task is regression.
Sampling  #Sample  Class Ratio  LR  RF  CNN  Ensemble  Public  

Train  Test  Train  Test  
Preset  60000  10000  .8571  .1429  .9647  .9524  .9824  .9819  .9917 
Random  59500  10500  .8500  .1500  .9603  .9465  .9779  .9768  .9914 
Stratified  59500  10500  .8500  .1500  .9625  .9510  .9795  .9792  .9901 
RDS  59938  10062  .8562  .1438  .9495  .9382  .9757  .9769  .9927 
RDS  59496  10504  .8499  .1501  .9583  .9486  .9851  .9830  .9931 
Sampling  #Sample  Class Ratio  LR  RF  MLP  Ensemble  Public  

Train  Test  Train  Test  
Preset  30000  20000  .0165  .0186  .5799  .5517  .5635  .5723  .5953 
Random  30000  20000  .0169  .0179  .5886  .5374  .5914  .5856  .6042 
Stratified  30000  20000  .0173  .0173  .5952  .5608  .5780  .5983  .6014 
RDS  29999  20001  .0180  .0163  .6045  .5350  .5802  .6057  .5362 
RDS  30031  19969  .0172  .0174  .5997  .5491  .6354  .6072  .6096 
The results demonstrate that our proposed RDS approach with various reward mechanisms works steadily with the four datasets. Figure 1 depicts the learning dynamics of RDS, in which the regression line is highlighted in red to indicate the improvements over time of the designed agent with diversification of multiple base models. Given a finite number of episodes, RDS establishes desirable optimisation behaviours regularised by sampling assumptions of the problem. Likewise, Figure 2 illustrates the learning dynamics of stochastic reward mechanism, in which lesser numbers of approximations are exhibited in the model performance of all datasets. The results show better optimisation of the learning gradients with this simple yet efficient method.
In details, RDS yields good performance for the ensemble performance which has been directly optimised for. This upward trend can be clearly observed across all datasets. RDS demonstrates clear outperformance for the base learners, especially the results are significant for LR model on Madelon (Table 1), CNN models on DR (Table 2) and MNIST (Table 3), as well as KLP (Table 4). Amongst the baselines, Stratified has a strength of maintaining class ratios for the task of classification, which can also be maintained by the proposed RDS methods. The Preset splitting, given by the competition organiser or authors, appears to be either Random or Stratified. Thus, they obtain comparable performance to randomisation and stratification but worse than RDS variants. Although the preset allocation performs well in some settings, the adequate performance of the RDS is consistently observed in both ensemble evaluation and public benchmarking. The stochastic choice mechanism gains some advantages over the previously designed algorithms. Moreover, the assumption of statistical independence holds a critical impact on the learning of the agent, which must be considerably regularised for imbalance datasets. See Appendix A.4 for experiment notes.
Trainable data sampling for model diversification achieves good performance based on ensemble learning and publicly available solutions; thus, higher learning potentials are yet to be explored.
This paper proposes Reinforced Data Sampling (RDS) method, which learns to select representative samples. The objective is to emphasise model diversification by maximising learning potentials of various base learners. We introduce different reward mechanisms including soft voting and stochastic choice to train optimal sampling policy under reinforcement learning framework. Experiments conducted on four datasets evidently highlight the benefits of using RDS over classical sampling approaches. Moreover, RDS’s sampling approach is configurable and can be applied to many different types of data and models.
This research is one fundamental step in advances of information processing that may levitate many tasks in machine learning. Our proposed Reinforced Data Sampling (RDS) approach will bring meaningful changes to the research community and related industries. In practice, we advocate that the use of RDS is preferable over the popular selection methods, such as simple randomisation, stratification, or holdout, in classification and regression. Promoting optimum sampling with model diversity will also bring farreaching impacts in the hope of searching for useful models and insights in diverse venues, including worldwide AI challenges and largescale research projects. During our research, we have contacted multiple participants and winners of recent data challenges with sizable monetary prizes; and improper data selection with concept drift was the key issue causing the waste of vast amounts of hours and computational powers. In average, each AI competition yields around hundreds to thousands of individuals or teams with enormous resources. The adoption of our framework, therefore, has potential environmental impacts to minimise the possible loss of excessive experimentation and productivity. In addition, model diversification will also be beneficial for researchers, competition organisers and large companies to reach maximal potentials of data and models.
Improving generalization with active learning
. Machine learning 15 (2), pp. 201–221. Cited by: §2.Aspectbased sentiment analysis of drug reviews applying crossdomain and crossdata learning
. In Proceedings of the 2018 International Conference on Digital Health, pp. 121–125. Cited by: Table 7, §4.Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 4707–4714. Cited by: §2.This section covers the supplementary information for our approach and experiments as the following:
Overall Process of Reinforced Data Sampling for Model Diversification
RDS Algorithms
Datasets
Detailed Experiment Specifications and Notes
The RDS process starts with the initialisation of the sampling policy with an initial distribution based on the sampling ratio in Eq(3). During an episodic run, an action is drawn from the policy for each state based on the trajectory of steps. The environment employs base learners to handle data samples based on the given action and transits to the next step. RDS stores the approximated values of the data samples into a replay memory; which is used to compute the ensemble return for policy update.
This subsection describes the algorithms for two variants of the RDS as the following.
In this paper, four datasets are selected for the analysis of the effectiveness of RDS. They cover a range of machine learning tasks, including binary classification, multiclass classification and regression. Our study simulates the way that the datasets are prepared for the tasks, given no prior knowledge on existing or public solutions.
We employ barebone base models with minimal settings to conduct experiments on the datasets. They are mostly originated from data science or AI challenges. The existing train/test subsets of the datasets are obtained from their public websites, forums, or emails.
The primary purpose is to demonstrate the generalisability of RDS by applying base models for data sampling and evaluating the data samples with ensemble learning and publicly available solutions. We note that the existing solutions may have been designed to fit the preset allocation of the datasets. For example, hyperparameters or handcraft rules may have been explicitly finetuned to the published data samples to perform highly in the competitions. Nevertheless, it is intriguing to examine the effects of RDS using existing public solutions, knowing the learning potentials of trainable data samples may even go further.
Dataset  Task  Challenge  Size of Data  Evaluation  Year 

MADELON  Binary Classification  NIPS 2013 Feature Selection  (multivariate)  AUC  2003 
DR  Regression  Drug Reviews (Kaggle Hackathon)  (multivariate, text)  2018  
MNIST  Multiclass Classification  Hand Written Digit Recognition  (image)  MicroF1  1998 
KLP  Binary Classification  Kalapa Credit Scoring  (multivariate, text)  AUC  2020 
This subsection describes the specifications and notes of our experiments in great details.
Experiment  Notes 

Background  Madelon was artificially developed for the NIPS 2003 feature selection challenge. It has 500 numerical features, in which 20 real features and 480 distractors have no predictive capacity. Several preprocessing techniques were previously adopted to conceal the origin and patterns of the dataset on the search for functional feature extractors. 
Settings  (1.0, 0.9, 0.1) 
Base Models  • LR  Logistic Regression (solver=’liblinear’,penalty=’l2’,random_state=123) 
• RF  Random Forest(n_estimators=128,random_state=123)  
• MLP  Multilayer Perceptron (Adam,lr=1e3,manual_seed=123) 

• SVC  Support Vector Classifier (kernel=’rbf’,coef0=1)  
Benchmark  A pipeline of stability selection and Logistic Regression of feature interactions is adopted for public benchmarking [6] 
Preprocessing  No preprocessing needed 
Run time  90s / epoch 
Observations  
Learning Dynamics as shown in Figure 1(a) and Figure 2(a)  
• LR has performed consistently on the Madelon dataset. There were some low performance points at the beginning; however, the upward trend can be observed during the agent learning.  
• Both LR and MLP are less stable with lower performance than RF.  
• The ensemble performance falls between the range of the performance of three base models. This observation indicates that there are disagreements among classifiers, which hints that model diversity is injected into model performance.  
• The regularisations serve as important mechanisms for optimum allocation.  
• There is the balance in class ratios of the dataset; hence, IID regularisation is approaching almost zero.  
• Experiments on SVM yield similar observations at higher computational cost; thus, SVM results are not included in our report.  
Comparison between RDS and RDS  
• Consistent results are observed on both deterministic and stochastic reward mechanisms.  
• The agent learning has become stable within the first 30 episodes.  
• The stochastic choice has the same learning dynamics with the noticeable, lesser number of value approximations.  
Comparison between RDS and other methods as shown in Table 1  
• The preset selection shows the highest performance on RF; which may hint that the dataset was prepared with holdout based on RF. However, it has the worst performance on MLP; therefore, it may hinder neural networkbased solutions during the challenge.  
• Both random and stratified sampling techniques show suboptimal performance in all evaluation metrics.  
• RDS has a good balance amongst all three base models. The performance of RDS on RF is slightly lower than the performance of the preset selection. However, it yields higher results in both LR, MLP and Ensemble.  
• RDS has achieved the highest performance in the ensemble use of three base models.  
• Preset and stratified allocation have a perfect sampling ratio.  
Public Benchmarking  
• The use of public solution has shown higher performance compared to the barebone ensemble of three base models. The evaluation metrics are in agreement across multiple sampling techniques, except RDS.  
• The RDS performs as the top in public benchmarking.  
Summary  
• The Madelon experiment highlight effectiveness of reinforced sampling for model diversification.  
• The model diversity is observed based on the ensemble and performance of various base models.  
• The regularisations play an important role in optimum sampling. 
Experiment  Notes 

Background  Drug Review (DR) [12] provides patients’ reviews on specific drugs crawled from multiple online pharmaceutical review sites. It contains categorical features including drug name and patient condition, review text and date, and numerical features including review rating and useful counts. In total, there are 215,063 examples, which is split into a training set of 75% and a test set of 25%. 
Settings  (1.0, 0.9, 0.1) 
Base Models  • Ridge  Ridge Regression (solver=‘sage’, random_state=2020) 
• MLP  Multilayer Perceptron (Adam,lr=1e3,manual_seed=2020)  
• CNN  Convolutional Neural Network (Adam,lr=1e3,manual_seed=2020)  
Benchmark  A public solution on Kaggle ^{1}^{1}1https://www.kaggle.com/stasian/predictingreviewscoresusingneuralnetworks using twolayer BidirectionalLSTM with Bahdanau Attention pooling before the prediction. 
Preprocessing  Ridge and MLP use averagepooling word embeddings from BERTBase model of 768 dimensions, while CNN word embeddings are initialised from pretrained word2vec of 300 dimensions. 
Run time  1200s / epoch 
Observations  
Learning Dynamics as shown in Figure 1(d) and Figure 2(d)  
• The overall trend is improving for all models, which is characterised by the average performance (coloured in red).  
• Ridge has performed consistently on the DR dataset. Despite that the performance is still the lowest as Ridge is the most simple one among the three base models.  
• MLP’s performance is the most unstable. We argue that it is due to the complexity of the task and MLP is easily trapped in local minimums of the optimisation space.  
• With the highest modelling capacity, CNN shows the best performance among the base models. CNN is more stable than MLP though not as stable as Ridge.  
• The sampling ratio is stabilised after several episodes and converged nicely over time.  
Comparison between RDS and RDS  
• The optimisation is stable on both deterministic and stochastic reward mechanisms.  
• Sampling ratio has become stable within the first 30 episodes. Both converge to the expected ratio (0.75) over time.  
• IID regularisation is preserved better with deterministic reward mechanism.  
Comparison between RDS and other methods as shown in Table 2  
• The preset selection shows the highest performance on MLP; which may hint that the dataset was prepared with holdout based on MLP model. However, it has the worst performance on Ridge and CNN.  
• Random sampling technique shows suboptimal results in all comparisons. Notably, it yields the worst performance with the public solution.  
• RDS has a good balance amongst all three base models. The performance of RDS variants on MLP is slightly lower than the performance of the preset selection. However, it yields higher results in Ridge, CNN, and Ensemble.  
• RDS achieves the best performance for Ridge and CNN, while RDS achieving the highest results in the ensemble as well as on the public solution. It suggests the effectiveness of using RDS techniques for data sampling to capture full model potentials.  
Public Benchmarking  
• The use of public solution has shown higher performance compared to the barebone ensemble of three base models. It is consistent across all sampling strategies.  
• The best performance is achieved by RDS, which agrees with the ensemble as well.  
Summary  
The experimental results on Drug Review dataset suggest the effectiveness of reinforced sampling for model diversification. The observations are also agreeable with other experiments. 
Experiment  Notes 

Background  MNIST [19] consists of 70,000 handwritten digits and is one of the most wellknown datasets in the deep learning community. MNIST is selected for experiments since it represents very well for multiclass classification task on images. MNIST considered a balanced image classification data, and it is divided into 60K samples for training and 10K samples for testing. In this experiment, we use three base learners, including Logistic Regression (LR), Random Forest (RF), and Convolutional Neural Network (CNN). 
Settings  (1.0, 0.9, 0.1) 
Base Models  • LR  Logistic Regression (solver=‘lbfgs’) 
• RF  Random Forest (n_estimators=50)  
• CNN  Convolutional Neural Network (Adam, lr=0.01)  
Benchmark  A high score solution on Kaggle for MNIST classification task [27]. 
Preprocessing  We extract Histogram of Oriented Gradients (HOG) features for policy learner, LR, and RF algorithms. To reduce the dimension, we apply PCA with the n_components of . It means the number of components is selected such that the amount of variance that needs to be explained is greater than the specified percentage (i.e., 95%). For CNN model, it runs directly on RGB codes with normalisation. 
Run time  On average 88s / epoch 
Observations  
Learning Dynamics as shown in Figure 1(c) and Figure 2(c)  
• LR, RF, CNN have performed consistently on the MNIST dataset. The upward trend can be observed during the agent learning across three base models.  
• There is the balance in class ratios of the dataset. This observation is based on seeing the IID regulation approaches almost zero.  
Comparison between RDS and RDS  
• The optimisation process is stable and gets better on both deterministic and stochastic reward mechanisms.  
• CNN achieves the highest performance on both RDS setting and RDS settings. CNN can increasingly perform better suggests that the selected samples are wellrepresented and the model does not face classimbalance issue.  
• The stochastic reward mechanism has the same learning dynamics and achieves better performance in shorter time.  
Comparison between RDS and other methods as shown in Table 3  
• The preset selection gets the highest performance on RF. This might hint that the preset selection process was based on RF.  
• The stratified setting gets the highest performance on LR. It showcases that stratified splitting approach is a strong baseline for balanced data.  
• Both random and RDS show suboptimal performance in all evaluation metrics. This hints that they might have the issue of imbalance selection between samples for training and testing. For RDS, the majority voting was used for reward mechanism. It might be a reason to cause the problem of diversity.  
Public Benchmarking  
• The use of public solution has the best performance in comparison to the ensemble of three base models. The same evaluation metric is used for all sampling techniques.  
• The RDS gets the best performance in public benchmarking.  
Summary  
• The MNIST experiment shows that the proposed approach can perform effectively on image classification task for model diversification. Given the fact that the preset setting of MNIST dataset was wellprepared for having good splitting in terms of both samples’ similarity and classes’ balance. However, our proposed RDS approach can effectively select a better split by showing that the public solution gets better performance in comparison to other splitting methods. 
Experiment  Notes 

Background  KLP was provided in the Kalapa credit scoring challenge [14]. It contains 50,000 profiles associated with good or bad labels. Each profile has 62 demographic and financial features. Originally, KLP is separated into 30,000 training and 20,000 testing examples with the strongly imbalanced data problem, i.e., only approximately 1.6% the total number of profiles are labelled as ’good’. As another serious issue, over 40 fields have more than 30% missing values. 
Settings  (1.0, 1.0, 40) 
Base Models  • LR  Logistic Regression (solver=‘liblinear’, random_state=123) 
• RF  Random Forest (n_estimators=64, random_state=123)  
• MLP  Multilayer Perceptron (Adam, lr=1e3, manual_seed=123)  
Benchmark  The 1st rank solution using Random Forest with WOE binnings and the number estimators of 767 [18] 
Preprocessing  Three text fields namely ’Province’, ’Job’, ’District’ are extracted averagepooling word embeddings from a finetuned Fasttext model of 32 dimensions. Other fields are applied with traditional feature engineering techniques (e.g., MinMaxScaler, OrdinaryEncoder, dummy variables). 
Run time  On average 18s / epoch 
Observations  
Learning Dynamics as shown in Figure 1(b) and Figure 2(b)  
• LR is quite stable with higher performance compared against RF and MLP.  
• RF is not always the most effective model for the classification task with the imbalanced and missing data problems.  
• MLP shows a better performance without the constraint by the deterministic softvoting reward mechanism.  
• Regularisations work properly to force the sampling ratio converged to the expected value, and reduce the label distribution difference between the training and testing sets.  
Comparison between RDS and RDS  
• There is a consistent upward trend on classification performance of the ensemble model for both reward mechanisms.  
• The models are more relaxing on the learning process with the stochastic reward mechanism, hence they have better performance.  
• The sampling ratio and IID regularisation are fluctuating within the first 30 episodes. Subsequently, they are more stable and converged after 450 episodes. A similar trend is observed for both reward mechanisms.  
Comparison between RDS and other methods as shown in Table 4  
• The best performance with the preset selection method is observed on LR. It implies that LR is the base model to split data in the credit scoring challenge.  
• It shows suboptimal performance for both random and stratified sampling methods on all evaluation metrics.  
• RDS variants outperform traditional sampling methods for almost individual or ensemble models. The performance of RDS on RF is slightly lower than the performance of the stratified selection due to the solid dependence of RS on feature distributions.  
Public Benchmarking  
• The public solution results in better performance compared against to the barebone ensemble of three base models across all sampling methods, except RDS.  
• The best performance is achieved by RDS  
Summary  
Experiments on KLP affirms the advantages of the reinforced sampling method for model diversification. With appropriate regularisation settings, the proposed method can help to effectively control sampling constraints, even if there are serious imbalanced and missing data problems. 