Reinforced Data Sampling for Model Diversification

06/12/2020 ∙ by Hoang D. Nguyen, et al. ∙ Umeå universitet Singapore Management University University of Glasgow VNU 7

With the rising number of machine learning competitions, the world has witnessed an exciting race for the best algorithms. However, the involved data selection process may fundamentally suffer from evidence ambiguity and concept drift issues, thereby possibly leading to deleterious effects on the performance of various models. This paper proposes a new Reinforced Data Sampling (RDS) method to learn how to sample data adequately on the search for useful models and insights. We formulate the optimisation problem of model diversification δ-div in data sampling to maximise learning potentials and optimum allocation by injecting model diversity. This work advocates the employment of diverse base learners as value functions such as neural networks, decision trees, or logistic regressions to reinforce the selection process of data subsets with multi-modal belief. We introduce different ensemble reward mechanisms, including soft voting and stochastic choice to approximate optimal sampling policy. The evaluation conducted on four datasets evidently highlights the benefits of using RDS method over traditional sampling approaches. Our experimental results suggest that the trainable sampling for model diversification is useful for competition organisers, researchers, or even starters to pursue full potentials of various machine learning tasks such as classification and regression. The source code is available at https://github.com/probeu/RDS.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data sampling is the process of selecting subsets of data points for analysis and reporting in the larger dataset. In machine learning, it is a fundamental step to ensure learning methods to generalise new observations adequately. However, classical data sampling techniques like randomisation are susceptible to detrimental issues on learning performance, including concept drift and evidence ambiguity [2]. Consider a machine learning task such as classification or regression, in which it aims at modeling using training data samples in light of maximising its task performance. The model concept, or mappings from data observations to outputs, may drift rendering sub-optimal fit with new data points due to hidden context changes or data-related problems. Often, there is simply insufficient or inappropriate information presented in the data samples to support adequate predictive strength, which is also known as evidence ambiguity [20]. Hence, the nexus between data points and model outputs plays a crucial role in data selection to improve generalisation and to mitigate such issues [31]. This paper describes a new sampling method, named as Reinforced Data Sampling (RDS), to learn how to sample data effectively predicated on base models in searching for useful models and insights. We focus on the ensemble use of multiple models to enhance the representational ability of the data and to select subsets of data points according to future learning potentials [11].

Models are different in their strengths and weaknesses, thus exploiting their disagreements is a useful mechanism for better learning performance. Sample-based model diversification has been proven as an effective ensemble strategy in machine learning [7]. Given learners , diverse ensembles can be derived from the collective behaviours of the members [20]. As typical learners are apt to mode-seeking behaviours, sample-based randomisation can be used to inject model diversity for gaining more complement information from multiple learners [7]. Therefore, learning how to sample with the consideration of task performance and model diversity is intriguing to support many applications in machine learning.

In this paper, we formulate a sampling problem for model diversification , in which a data sampler is trained to generate subsets of the larger dataset. Moreover, we advocate the use of diverse learners

in our method to promote model diversity by design, including stable learning methods (e.g., support vector machines, regularised least square regression) and unstable learning methods (e.g., neural networks or decision trees)

[1; 23]

. Our approach entails reinforcement learning of observable evidence in the dataset to approximate the parameters of our sampler with ensemble value functions. We propose several novel ensemble reward mechanisms, collectively soft voting and stochastic choice. Furthermore, our method is designed to achieve proportionate allocation with regularisation of distributional property.

We evaluate the RDS approach using four datasets, including NIPS 2003 Feature Selection Challenge - Madelon, Kaggle Hackaton - Drug Reviews, MNIST, and Kalapa Credit Scoring Challenge. Our experiments cover a range of machine learning tasks such as binary classification, multi-class classification, and regression on multivariate, textual, and visual data. The results evidently highlight better performance impacts of trainable data samples over classical or prior data selection.

In data challenges, AI and data scientists have witnessed the deleterious effects of sub-optimal data preparation in large-scale competitions. It entangles the machine learning community with limitations for analysis and reporting, thereby restricting innovations and useful outcomes. In everyday settings, the same phenomenon may happen at the early stage of machine learning tasks. Therefore, we suggest our trainable sampling method for model diversification as a viable alternative to classical methods.

Contributions.

Firstly, this paper introduces Reinforced Data Sampling (RDS), a method to approximate optimum sampling for model diversification with ensemble rewarding to attain maximal machine learning potentials. A novel stochastic choice rewarding is developed as a viable mechanism for injecting model diversity in reinforcement learning. Secondly, we implement an end-to-end framework for trainable data sampling, which can easily be adopted in the early stages of machine learning tasks, including classification and regression. Thirdly, we conduct comprehensive experiments to compare RDS against other traditional data splitting methods, on real-world datasets with various tasks. The results suggest that RDS is an effective method for data sampling with the objective of achieving high model diversification.

2 Related Work

In machine learning, generalisation, or the ability to adapt new, previously unseen observations, plays a vital role in creating useful models [9]

. It entails the process of data sampling, which is employed to select and manipulate a representative subset of data points for performance estimation. Early approaches, such as simple random sampling or stratified sampling, have been widely adopted in numerous machine learning tasks to date. The use of simple randomisation (e.g., Knuth’s algorithm

[15]) in data selection is overly popular; however, it is susceptible to many sampling issues such as violation of statistical independence, bias or covariate shift [22; 2]. Stratification technique is used to partition the dataset into homogenous strata to ensure the adequate representation of data points [8].

In computational learning theory, model performance and complexity have been formalised as factors to generalisation bounds according to Occam’s razor

[26]. Hold-out method [28]

for data selection is commonly performed to estimate the predictive performance of a learner, which can be repeated multiple times to improve stability with less variance. Furthermore, modern datasets are typically associated with heterogeneous features, ambiguous evidence, and complex dependencies, thereby leading to concept drift in model performance

[23]. Importance sampling by reweighting data points has been explored as a remedial mechanism [29; 21]. In recent years, many researchers have approached model drift and related dataset issues with ensemble learning [2; 23; 16; 1]. Multiple base models can be trained on blocks of data samples to address uncertainty by injecting model diversity, in the hope of maximising performance generalisation [17]. With recent advances in reinforcement learning [24; 5; 20], we explore how to sample informative data points that best generalise machine learning models with ensemble learning and model diversification.

3 Proposed Framework

This study aims at developing a novel approach to sample a dataset into relevant subsets for various machine learning tasks to achieve an optimum goal. It comprises of task performance and model diversity to maximise the candidate learning potentials with adequate allocation in searching for useful models and insights in subsequent processes in machine learning. This paper formulates a sampling problem for sample-based model diversification .

3.1 Problem Formulation

Let denote the dataset of size , where are arbitrary inputs and are dependent outputs. We propose a data sampler to generate multiple subsets of the dataset with several properties.

First, we advocate the use of diverse K learners , including stable methods (e.g., support vector machines or regularised least square regression) and unstable methods (e.g., neural networks or decision trees). The goal is to find an optimal data sampler to maximise the ensemble learning potentials with diversity induced as the following:

(1)

where is an ensemble learner of and is the criterion that measures the performance of .

We posit that the sampling procedure is stochastic, in which the allocation of samples to subsets is based on parametric probability distributions

with the parameters .

(2)

To maintain a sampling ratio , the third property of is described as the following:

(3)

where is the mean of the probability distributions

We assume that data samples should be independent and identically distributed (i.i.d). Hence, data subsets are representative of the true population in respect to statistical independence. We formulate the fourth property for each subset of size as below:

(4)

where is the true distribution.

This study is on searching for solutions to achieve optimum task performance; nevertheless, it is NP-Hard with the possibility of solving with approximation. Therefore, we propose a reinforcement learning approach to discovering how to sample by approximating solutions.

3.2 Reinforced Data Sampling (RDS)

We propose Reinforced Data Sampling (RDS) framework based on the Markov Decision Process (MDP) to maximise

.

We posit the use of the data sampler to create a training dataset and a test dataset to discuss our approach without loss of generality.

RDS is a reinforcement learning framework, where an agent receives a data sample

at each step, classifies which subset the sample belongs to, and interacts with an environment. As a result, a reward

is given to agent by the environment based on the outcome of its action . The agent reaches an optimum goal through its interactions with the environment to accumulate maximum possible rewards. It is described as a tuple as the following:

  • is a finite set of states, where the decision process is evolved sample by sample.

  • is the discrete action space of the agent, .

  • is the reward set where is mapped a state and an action .

  • denotes the transition from the current state to the next state. ; thus, .

The framework employs a stochastic policy which defines the probability of performing action by the agent given the state ; thus where the probability distribution is determined by the parameter according to Eq(2). Given policy , RDS starts from observing an initial state according to the probability distribution . At each step of interaction , it evolves according to:

(5)

We denote as the trajectory of the RDS, where . The transition is deterministic as the agent moves from the current state to the next state according to the order of observations in the dataset. The optimisation problem in RDS is expressed by finding a good set of parameters to maximise the expected return:

(6)

where is the finite-horizon undiscounted return computed based on the trajectory of steps.

In RDS, we investigate the use of supervised learning methods, which mimic the input-output process in nature. These function approximation methods may range from linear functions to decision trees or artificial neural networks. They receive data samples as observations of the state to predict values, where

. In general, the objective function for machine learning is specified as . At the end of each episode, we apply our function approximation on the training dataset and the test dataset . Once the transition is terminated at step , we compute .

We utilise the policy gradient method to address the optimisation problem, in which policy weights are updated by the stochastic gradient optimisation at the end of every episode as the following:

(7)

where is the trajectory probability of .

Once the policy has converged, we estimate a good approximation of our data sampler in Eq(1) as the following:

(8)

where .

In general, the RDS process starts with sampling a training subset and a test subset of the dataset according to the policy . We apply the function approximation by training on the training set and evaluating on the test set to obtain an expected return . The policy is then updated accordingly and a new episode is started to reach the convergence of . Refer to the Appendices A.1 and A.2 for details of our design and algorithm.

The convergence of RDS is inherited from the convergence of the policy gradient method with function approximation [30]. The function approximation is designed with consistent parameter initialisation and hyper-parameters; hence, the reward is fixed for each sampled dataset. We posit that the computational complexity of RDS is , where is the number of episodes for policy updating, is the size of the dataset, is the cost for state updating, and is the computational cost for function approximation from learners.

3.3 Reward Mechanisms

Our target is to train the agent to draw relevant samples with the policy to maximise the expected return , which entails performance potentials of function approximators. In this paper, we consider generalised function approximators such as linear estimators, decision-trees, and neural networks, which are commonly useful in data challenges. Let be any arbitrary function learners, then the RDS process converges with the specified rewards [30].

We design the learning environment with the use of ensemble method of multiple function approximators to enrich model diversity for data selection by design; because each base model provides outcome reflecting multi-modal belief [11]. Let denote the ensemble function approximator; hence we have , where is the number of base learning models for evaluating performance metrics. We fix the training procedure of the supervised learners, including parameter initialisation, model architecture, hyper-parameters, and random seeds to ensure the same output from a given state for reproducibility. This paper investigates several reward mechanisms, including soft voting and stochastic choice, which are applicable for both classification and regression problems. In the soft voting approach, we define RDS using an ensemble approximator with the following value function:

(9)

Thus, the environment is observed as deterministic with the reward .

In addition, we define a stochastic RDS process depends on the base models randomly picked from a stationary distribution

at each epoch. It is desirable that this stochastic behaviour may overcome local optimisation despite the noise introduced. We define:

(10)

The environment, therefore, is observed as stochastic with the reward .

We argue that the choice of base models is crucial to achieving higher learning potentials with model diversity. In addition, pre-processing steps or pre-trained feature mappings can be adopted in these learners to provide better representational abilities.

3.4 Policy Optimisation

We implement the policy learning using the Gated Recurrent Unit

[3] with the feature size of the dataset . It is an intuitive choice as the gated networks support the data selection based on the sequence of samples similar to the desired agent’s brain. In the learning, the state at the step is encoded to create a hidden vector presentation of . With two reset and update gates, the computation of is described as the following:

(11)

A linear layer is adopted to derive the probability distribution of the action . Moreover, the policy is pre-trained based on the sampling ratio to achieve faster convergence.

We use the log of the action probability an equivalent loss function with the learning factor

:

(12)

3.4.1 Sampling Regularisation

We implement a regularisation loss to ensure the sampling ratio based on the distributional property of the action probability in Eq(3) as the following:

(13)

where is the scale factor and is the sampling ratio.

In addition, we design a regularisation mechanism to ensure that training samples and testing samples are drawn from the same distribution as described in Eq(4). This is important for both classification (e.g., class ratios) and regression (e.g., identically distributed). Given probability density distributions of the training set and the testing set . We define:

(14)

where is the scale factor and

is the Kullback–Leibler divergence. In regression, we estimate Kullback–Leibler divergence of continuous samples using Pérez-Cruz’s method

[25].

The final loss for our policy optimisation is computed as:

(15)

4 Experiments

(a) MADELON
(b) DR
(c) MNIST
(d) KLP
Figure 1: Learning Dynamics for Deterministic Soft-Voting Reward Mechanism
(a) MADELON
(b) DR
(c) MNIST
(d) KLP
Figure 2: Learning Dynamics for Stochastic Choice Reward Mechanism.

In this section, we conduct experiments on four datasets to examine the effectiveness of the RDS method. It is done via evaluating model diversification reflected by the performance evaluated on proposed data samples by RDS in comparison with classical methods.

Madelon (MDL) [13]

was artificially developed for the NIPS 2003 feature selection challenge. It has 500 numerical features, in which 20 real features and 480 distractors have no predictive capacity. Several pre-processing techniques were adopted to conceal the origin and patterns of the dataset on the search for functional feature extractors. We employ bare-bone Logistic Regression (LR), Random Forest (RF), and Multi-Layer Perception (MLP) in our experiments. And a pipeline of stability selection and Logistic Regression of feature interactions is adopted for public benchmarking

[6].

Drug Review (DR) [12]

provides patients’ reviews on specific drugs crawled from multiple online pharmaceutical review sites. It contains categorical features including drug name and patient condition, review text and date, and numerical features including review rating and useful counts. In total, there are 215,063 examples, which is split into a training set of 75% and a test set of 25%. In this experiment, we use three base learners including Ridge Regression (Ridge), Multi-Layer Perception (MLP), and Convolutional Neural Network (CNN).

MNIST [19]

consists of 70,000 hand-written digits and is one of the most well-known datasets in the deep learning community. MNIST is selected for experiments since it represents very well for multi-class classification task on images. MNIST considered a balanced image classification data, and it is divided into 60K samples for training and 10K samples for testing. In this experiment, we use three base learners including Logistic Regression (LR), Random Forest (RF), and Convolutional Neural Network (CNN).

Kalapa Credit Scoring (KLP) [14] is a data challenge for credit scoring task. The dataset consists of 30,000 training and 20,000 testing examples. It contains two labels (i.e., GOOD and BAD) associated with 62 variables, including demographics and financial status. There is an imbalance problem on the label distribution with a ratio of 1.6% (i.e., only 486 BAD samples among 30,000 training samples). Moreover, 40 data fields have missing rates of more than 30%, which increases the difficulty in finding a good data selection for the data challenge. We will consider three models namely Logistic Regression (LR), Random Forest (RF), and Multi-Layer Perception (MLP) on investigating the effectiveness of the splitting methods. The first ranked solution [18] is selected as the public model for comparisons.

Sampling #Sample Class Ratio LR RF MLP Ensemble Public
Train Test Train Test
Preset 2000 600 1.0000 1.0000 .6019 .8106 .5590 .6783 .9063
Random 2000 600 .9920 1.0270 .5742 .7729 .5774 .6453 .9002
Stratified 2000 600 1.0000 1.0000 .5673 .7470 .6153 .6360 .8828
RDS 2001 599 1.0375 .9137 .6192 .8050 .6228 .6973 .8915
RDS 2021 579 1.0010 .9966 .6192 .8050 .6050 .6947 .9106
Table 1: Madelon Experiment. Public denotes a public solution of [6].
Sampling Train Test Ridge MLP CNN Ensemble Public
Preset 161,297 53,766 .4580 .5787 .7282 .6660 .7637
Random 161,297 53,766 .4597 .4179 .7353 .6485 .7503
RDS 162,070 52,993 .4646 .5776 .7355 .6692 .7649
RDS 161,944 53,119 .4647 .5370 .7509 .6562 .7600
Table 2: Drug Review Experiment. Public denotes a public solution, Bi-LSTM with Attention, on Kaggle.

4.1 Experimental Settings

Implementation.

The source code of RDS is implemented in Pytorch whilst learning environments are built flexibly by various learning frameworks such as Keras, Tensorflow or Scikit-learn. Environmental learning models are optimised concurrently using a common evaluation metric. For the policy optimisation, the number of hidden unit of GRU is 256. The learning is run on 3-400 episodes with the RMSprop optimiser and the initial learning rate of 0.001. Scaling factors

are empirically selected, i.e., (1.0, 0.9, 0.1) for Madelon, (1.0, 1.0, 40) for Kalapa, (1.0, 0.1, 0.01) for MNIST and (1.0, 0.9, 0.1) for Drug Review. For KLP and DR datasets, we employ FastText [4] and BERT [10] language models for extracting representation for textual contents. All experiments are conducted on a similar computational environment of Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 256GB Ram, and a Titan RTX 2080Ti GPU card.

Baselines.

We compare our proposed RDS approach with several traditional data sampling methods, including simple randomisation denoted as Random, stratification (only for classification) denoted as Stratified. We also include comparisons with available splitting denoted as Preset, which is provided either by the organisers of competitions or the authors of the datasets. Moreover, we select a number of prominent solutions which have been shared by the members of the public to examine the effects of various techniques on the datasets.

Metrics.

For Madelon and Kalapa datasets, the tasks are binary classification; therefore, we use AUC to measure model performance. In turn, we employ Micro-F1 metric for the task of multi-class classification on MNIST dataset. For experiment on Drug Review, we use R-squared () to measure performance of the models as the task is regression.

4.2 Results and Discussion

Sampling #Sample Class Ratio LR RF CNN Ensemble Public
Train Test Train Test
Preset 60000 10000 .8571 .1429 .9647 .9524 .9824 .9819 .9917
Random 59500 10500 .8500 .1500 .9603 .9465 .9779 .9768 .9914
Stratified 59500 10500 .8500 .1500 .9625 .9510 .9795 .9792 .9901
RDS 59938 10062 .8562 .1438 .9495 .9382 .9757 .9769 .9927
RDS 59496 10504 .8499 .1501 .9583 .9486 .9851 .9830 .9931
Table 3: MNIST Experiment. Public is a solution based on CNN architecture. It is noted that the public solution is not the same CNN’s architecture used in RDS, which has fewer layers and takes shorter time to train.
Sampling #Sample Class Ratio LR RF MLP Ensemble Public
Train Test Train Test
Preset 30000 20000 .0165 .0186 .5799 .5517 .5635 .5723 .5953
Random 30000 20000 .0169 .0179 .5886 .5374 .5914 .5856 .6042
Stratified 30000 20000 .0173 .0173 .5952 .5608 .5780 .5983 .6014
RDS 29999 20001 .0180 .0163 .6045 .5350 .5802 .6057 .5362
RDS 30031 19969 .0172 .0174 .5997 .5491 .6354 .6072 .6096
Table 4: Kalapa Experiment. Public denotes the first solution from Kalapa challenge on the private leader board.

The results demonstrate that our proposed RDS approach with various reward mechanisms works steadily with the four datasets. Figure 1 depicts the learning dynamics of RDS, in which the regression line is highlighted in red to indicate the improvements over time of the designed agent with diversification of multiple base models. Given a finite number of episodes, RDS establishes desirable optimisation behaviours regularised by sampling assumptions of the problem. Likewise, Figure 2 illustrates the learning dynamics of stochastic reward mechanism, in which lesser numbers of approximations are exhibited in the model performance of all datasets. The results show better optimisation of the learning gradients with this simple yet efficient method.

In details, RDS yields good performance for the ensemble performance which has been directly optimised for. This upward trend can be clearly observed across all datasets. RDS demonstrates clear outperformance for the base learners, especially the results are significant for LR model on Madelon (Table 1), CNN models on DR (Table 2) and MNIST (Table 3), as well as KLP (Table 4). Amongst the baselines, Stratified has a strength of maintaining class ratios for the task of classification, which can also be maintained by the proposed RDS methods. The Preset splitting, given by the competition organiser or authors, appears to be either Random or Stratified. Thus, they obtain comparable performance to randomisation and stratification but worse than RDS variants. Although the preset allocation performs well in some settings, the adequate performance of the RDS is consistently observed in both ensemble evaluation and public benchmarking. The stochastic choice mechanism gains some advantages over the previously designed algorithms. Moreover, the assumption of statistical independence holds a critical impact on the learning of the agent, which must be considerably regularised for imbalance datasets. See Appendix A.4 for experiment notes.

Trainable data sampling for model diversification achieves good performance based on ensemble learning and publicly available solutions; thus, higher learning potentials are yet to be explored.

5 Conclusion

This paper proposes Reinforced Data Sampling (RDS) method, which learns to select representative samples. The objective is to emphasise model diversification by maximising learning potentials of various base learners. We introduce different reward mechanisms including soft voting and stochastic choice to train optimal sampling policy under reinforcement learning framework. Experiments conducted on four datasets evidently highlight the benefits of using RDS over classical sampling approaches. Moreover, RDS’s sampling approach is configurable and can be applied to many different types of data and models.

6 Broader Impact

This research is one fundamental step in advances of information processing that may levitate many tasks in machine learning. Our proposed Reinforced Data Sampling (RDS) approach will bring meaningful changes to the research community and related industries. In practice, we advocate that the use of RDS is preferable over the popular selection methods, such as simple randomisation, stratification, or hold-out, in classification and regression. Promoting optimum sampling with model diversity will also bring far-reaching impacts in the hope of searching for useful models and insights in diverse venues, including worldwide AI challenges and large-scale research projects. During our research, we have contacted multiple participants and winners of recent data challenges with sizable monetary prizes; and improper data selection with concept drift was the key issue causing the waste of vast amounts of hours and computational powers. In average, each AI competition yields around hundreds to thousands of individuals or teams with enormous resources. The adoption of our framework, therefore, has potential environmental impacts to minimise the possible loss of excessive experimentation and productivity. In addition, model diversification will also be beneficial for researchers, competition organisers and large companies to reach maximal potentials of data and models.

References

  • [1] S. H. Bach and M. A. Maloof (2008) Paired learners for concept drift. In 2008 Eighth IEEE International Conference on Data Mining, pp. 23–32. Cited by: §1, §2.
  • [2] S. Bach and M. Maloof (2010) A bayesian approach to concept drift. In Advances in neural information processing systems, pp. 127–135. Cited by: §1, §2, §2.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.4.
  • [4] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2016) Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. Cited by: §4.1.
  • [5] J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee (2018) Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §2.
  • [6] B. Cannon (2018) Madelon madness feature selection. https://github.com/wblakecannon/madelon-madness. Cited by: Table 6, Table 1, §4.
  • [7] M. A. Carreira-Perpinán and R. Raziperchikolaei (2016) An ensemble diversity approach to supervised binary hashing. In Advances in Neural Information Processing Systems, pp. 757–765. Cited by: §1.
  • [8] W. G. Cochran (2007) Sampling techniques. John Wiley & Sons. Cited by: §2.
  • [9] D. Cohn, L. Atlas, and R. Ladner (1994)

    Improving generalization with active learning

    .
    Machine learning 15 (2), pp. 201–221. Cited by: §2.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §4.1.
  • [11] Z. Gong, P. Zhong, and W. Hu (2019) Diversity in machine learning. IEEE Access 7, pp. 64323–64350. Cited by: §1, §3.3.
  • [12] F. Gräßer, S. Kallumadi, H. Malberg, and S. Zaunseder (2018)

    Aspect-based sentiment analysis of drug reviews applying cross-domain and cross-data learning

    .
    In Proceedings of the 2018 International Conference on Digital Health, pp. 121–125. Cited by: Table 7, §4.
  • [13] I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror (2005) Result analysis of the nips 2003 feature selection challenge. In Advances in neural information processing systems, pp. 545–552. Cited by: §4.
  • [14] Kalapa (2020) Kalapa credit scoring challenge. https://challenge.kalapa.vn/. Cited by: Table 9, §4.
  • [15] D. E. Knuth (1997) The art of computer programming. Vol. 3, Pearson Education. Cited by: §2.
  • [16] J. Z. Kolter and M. A. Maloof (2007) Dynamic weighted majority: an ensemble method for drifting concepts. Journal of Machine Learning Research 8 (Dec), pp. 2755–2790. Cited by: §2.
  • [17] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402–6413. Cited by: §2.
  • [18] L. Le (2020) The 1st rank solution for kalapa credit scoring challenge. https://github.com/klcute/kalapa/. Cited by: Table 9, §4.
  • [19] Y. LeCun (1998) The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Cited by: Table 8, §4.
  • [20] S. Lee, S. P. S. Prakash, M. Cogswell, V. Ranjan, D. Crandall, and D. Batra (2016) Stochastic multiple choice learning for training diverse deep ensembles. In Advances in Neural Information Processing Systems, pp. 2119–2127. Cited by: §1, §1, §2.
  • [21] A. R. Mahmood, H. P. van Hasselt, and R. S. Sutton (2014) Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, pp. 3014–3022. Cited by: §2.
  • [22] R. J. May, H. R. Maier, and G. C. Dandy (2010) Data splitting for artificial neural networks using som-based stratified sampling. Neural Networks 23 (2), pp. 283–294. Cited by: §2.
  • [23] L. L. Minku, A. P. White, and X. Yao (2009) The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Transactions on knowledge and Data Engineering 22 (5), pp. 730–742. Cited by: §1, §2.
  • [24] M. Peng, Q. Zhang, X. Xing, T. Gui, X. Huang, Y. Jiang, K. Ding, and Z. Chen (2019) Trainable undersampling for class-imbalance learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 4707–4714. Cited by: §2.
  • [25] F. Pérez-Cruz (2008) Kullback-leibler divergence estimation of continuous distributions. In 2008 IEEE international symposium on information theory, pp. 1666–1670. Cited by: §3.4.1.
  • [26] C. E. Rasmussen and Z. Ghahramani (2001) Occam’s razor. In Advances in neural information processing systems, pp. 294–300. Cited by: §2.
  • [27] A. Scarlat (2018) MNIST with convoluted nn and keras. Note: https://www.kaggle.com/drscarlat/mnist-99-74-with-convoluted-nn-and-keras Cited by: Table 8.
  • [28] F. Schorfheide and K. I. Wolpin (2012) On the use of holdout samples for model selection. American Economic Review 102 (3), pp. 477–81. Cited by: §2.
  • [29] M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawanabe (2008) Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pp. 1433–1440. Cited by: §2.
  • [30] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §3.2, §3.3.
  • [31] T. Zhou, S. Wang, and J. A. Bilmes (2018) Diverse ensemble evolution: curriculum data-model marriage. In Advances in Neural Information Processing Systems, pp. 5905–5916. Cited by: §1.

Appendix A Appendices

This section covers the supplementary information for our approach and experiments as the following:

  • Overall Process of Reinforced Data Sampling for Model Diversification

  • RDS Algorithms

  • Datasets

  • Detailed Experiment Specifications and Notes

a.1 Overall RDS Process

The RDS process starts with the initialisation of the sampling policy with an initial distribution based on the sampling ratio in Eq(3). During an episodic run, an action is drawn from the policy for each state based on the trajectory of steps. The environment employs base learners to handle data samples based on the given action and transits to the next step. RDS stores the approximated values of the data samples into a replay memory; which is used to compute the ensemble return for policy update.

Figure 3: Overall Process of Reinforced Data Sampling (RDS)

a.2 Algorithms

This subsection describes the algorithms for two variants of the RDS as the following.

1:dataset , ensemble learner , sampling ratio
2:procedure RDS()
3:     initialise
4:     for all  do is the maximum number of epochs
5:          is the replay memory
6:         for  do
7:              
8:              select action
9:               Saving to the replay memory
10:         end for
11:         
12:         train on the sampled training set
13:         
14:         compute According to Eq(9)
15:         update the policy
16:     end for
17:     generate According to Eq(8)
18:     
19:     return
20:end procedure
Algorithm 1 Training Algorithm for RDS
1:dataset , ensemble learner , sampling ratio
2:procedure RDS()
3:     initialise
4:     for all  do is the maximum number of epochs
5:          is the replay memory
6:         for  do
7:              
8:              select action
9:               Saving to the replay memory
10:         end for
11:         
12:         choose from the stationary distribution
13:         train on the sampled training set
14:         
15:         compute According to Eq(10)
16:         update the policy
17:     end for
18:     generate According to Eq(8)
19:     
20:     return
21:end procedure
Algorithm 2 Training Algorithm for RDS

a.3 Datasets

In this paper, four datasets are selected for the analysis of the effectiveness of RDS. They cover a range of machine learning tasks, including binary classification, multi-class classification and regression. Our study simulates the way that the datasets are prepared for the tasks, given no prior knowledge on existing or public solutions.

We employ bare-bone base models with minimal settings to conduct experiments on the datasets. They are mostly originated from data science or AI challenges. The existing train/test subsets of the datasets are obtained from their public websites, forums, or emails.

The primary purpose is to demonstrate the generalisability of RDS by applying base models for data sampling and evaluating the data samples with ensemble learning and publicly available solutions. We note that the existing solutions may have been designed to fit the preset allocation of the datasets. For example, hyper-parameters or hand-craft rules may have been explicitly fine-tuned to the published data samples to perform highly in the competitions. Nevertheless, it is intriguing to examine the effects of RDS using existing public solutions, knowing the learning potentials of trainable data samples may even go further.

Dataset Task Challenge Size of Data Evaluation Year
MADELON Binary Classification NIPS 2013 Feature Selection (multivariate) AUC 2003
DR Regression Drug Reviews (Kaggle Hackathon) (multivariate, text) 2018
MNIST Multiclass Classification Hand Written Digit Recognition (image) Micro-F1 1998
KLP Binary Classification Kalapa Credit Scoring (multivariate, text) AUC 2020
Table 5: List of datasets used in this paper.

a.4 Experiment Specifications and Notes

This subsection describes the specifications and notes of our experiments in great details.

Experiment Notes
Background Madelon was artificially developed for the NIPS 2003 feature selection challenge. It has 500 numerical features, in which 20 real features and 480 distractors have no predictive capacity. Several pre-processing techniques were previously adopted to conceal the origin and patterns of the dataset on the search for functional feature extractors.
Settings (1.0, 0.9, 0.1)
Base Models     LR - Logistic Regression (solver=’liblinear’,penalty=’l2’,random_state=123)
    RF - Random Forest(n_estimators=128,random_state=123)
    MLP

- Multilayer Perceptron (Adam,lr=1e-3,manual_seed=123)

    SVC - Support Vector Classifier (kernel=’rbf’,coef0=1)
Benchmark A pipeline of stability selection and Logistic Regression of feature interactions is adopted for public benchmarking [6]
Pre-processing No pre-processing needed
Run time 90s / epoch
Observations
Learning Dynamics as shown in Figure 1(a) and Figure 2(a)
    LR has performed consistently on the Madelon dataset. There were some low performance points at the beginning; however, the upward trend can be observed during the agent learning.
    Both LR and MLP are less stable with lower performance than RF.
    The ensemble performance falls between the range of the performance of three base models. This observation indicates that there are disagreements among classifiers, which hints that model diversity is injected into model performance.
    The regularisations serve as important mechanisms for optimum allocation.
    There is the balance in class ratios of the dataset; hence, IID regularisation is approaching almost zero.
    Experiments on SVM yield similar observations at higher computational cost; thus, SVM results are not included in our report.
Comparison between RDS and RDS
    Consistent results are observed on both deterministic and stochastic reward mechanisms.
    The agent learning has become stable within the first 30 episodes.
    The stochastic choice has the same learning dynamics with the noticeable, lesser number of value approximations.
Comparison between RDS and other methods as shown in Table 1
    The preset selection shows the highest performance on RF; which may hint that the dataset was prepared with holdout based on RF. However, it has the worst performance on MLP; therefore, it may hinder neural network-based solutions during the challenge.
    Both random and stratified sampling techniques show sub-optimal performance in all evaluation metrics.
    RDS has a good balance amongst all three base models. The performance of RDS on RF is slightly lower than the performance of the preset selection. However, it yields higher results in both LR, MLP and Ensemble.
    RDS has achieved the highest performance in the ensemble use of three base models.
    Preset and stratified allocation have a perfect sampling ratio.
Public Benchmarking
    The use of public solution has shown higher performance compared to the bare-bone ensemble of three base models. The evaluation metrics are in agreement across multiple sampling techniques, except RDS.
    The RDS performs as the top in public benchmarking.
Summary
    The Madelon experiment highlight effectiveness of reinforced sampling for model diversification.
    The model diversity is observed based on the ensemble and performance of various base models.
    The regularisations play an important role in optimum sampling.
Table 6: Madelon - Experiment Notes
Experiment Notes
Background Drug Review (DR) [12] provides patients’ reviews on specific drugs crawled from multiple online pharmaceutical review sites. It contains categorical features including drug name and patient condition, review text and date, and numerical features including review rating and useful counts. In total, there are 215,063 examples, which is split into a training set of 75% and a test set of 25%.
Settings (1.0, 0.9, 0.1)
Base Models     Ridge - Ridge Regression (solver=‘sage’, random_state=2020)
    MLP - Multilayer Perceptron (Adam,lr=1e-3,manual_seed=2020)
    CNN - Convolutional Neural Network (Adam,lr=1e-3,manual_seed=2020)
Benchmark A public solution on Kaggle 111https://www.kaggle.com/stasian/predicting-review-scores-using-neural-networks using two-layer Bidirectional-LSTM with Bahdanau Attention pooling before the prediction.
Pre-processing Ridge and MLP use average-pooling word embeddings from BERT-Base model of 768 dimensions, while CNN word embeddings are initialised from pre-trained word2vec of 300 dimensions.
Run time 1200s / epoch
Observations
Learning Dynamics as shown in Figure 1(d) and Figure 2(d)
    The overall trend is improving for all models, which is characterised by the average performance (coloured in red).
    Ridge has performed consistently on the DR dataset. Despite that the performance is still the lowest as Ridge is the most simple one among the three base models.
    MLP’s performance is the most unstable. We argue that it is due to the complexity of the task and MLP is easily trapped in local minimums of the optimisation space.
    With the highest modelling capacity, CNN shows the best performance among the base models. CNN is more stable than MLP though not as stable as Ridge.
    The sampling ratio is stabilised after several episodes and converged nicely over time.
Comparison between RDS and RDS
    The optimisation is stable on both deterministic and stochastic reward mechanisms.
    Sampling ratio has become stable within the first 30 episodes. Both converge to the expected ratio (0.75) over time.
    IID regularisation is preserved better with deterministic reward mechanism.
Comparison between RDS and other methods as shown in Table 2
    The preset selection shows the highest performance on MLP; which may hint that the dataset was prepared with hold-out based on MLP model. However, it has the worst performance on Ridge and CNN.
    Random sampling technique shows sub-optimal results in all comparisons. Notably, it yields the worst performance with the public solution.
    RDS has a good balance amongst all three base models. The performance of RDS variants on MLP is slightly lower than the performance of the preset selection. However, it yields higher results in Ridge, CNN, and Ensemble.
    RDS achieves the best performance for Ridge and CNN, while RDS achieving the highest results in the ensemble as well as on the public solution. It suggests the effectiveness of using RDS techniques for data sampling to capture full model potentials.
Public Benchmarking
    The use of public solution has shown higher performance compared to the bare-bone ensemble of three base models. It is consistent across all sampling strategies.
    The best performance is achieved by RDS, which agrees with the ensemble as well.
Summary
The experimental results on Drug Review dataset suggest the effectiveness of reinforced sampling for model diversification. The observations are also agreeable with other experiments.
Table 7: Drug Review - Experiment Notes
Experiment Notes
Background MNIST [19] consists of 70,000 hand-written digits and is one of the most well-known datasets in the deep learning community. MNIST is selected for experiments since it represents very well for multi-class classification task on images. MNIST considered a balanced image classification data, and it is divided into 60K samples for training and 10K samples for testing. In this experiment, we use three base learners, including Logistic Regression (LR), Random Forest (RF), and Convolutional Neural Network (CNN).
Settings (1.0, 0.9, 0.1)
Base Models     LR - Logistic Regression (solver=‘lbfgs’)
    RF - Random Forest (n_estimators=50)
    CNN - Convolutional Neural Network (Adam, lr=0.01)
Benchmark A high score solution on Kaggle for MNIST classification task [27].
Pre-processing We extract Histogram of Oriented Gradients (HOG) features for policy learner, LR, and RF algorithms. To reduce the dimension, we apply PCA with the n_components of . It means the number of components is selected such that the amount of variance that needs to be explained is greater than the specified percentage (i.e., 95%). For CNN model, it runs directly on RGB codes with normalisation.
Run time On average 88s / epoch
Observations
Learning Dynamics as shown in Figure 1(c) and Figure 2(c)
    LR, RF, CNN have performed consistently on the MNIST dataset. The upward trend can be observed during the agent learning across three base models.
    There is the balance in class ratios of the dataset. This observation is based on seeing the IID regulation approaches almost zero.
Comparison between RDS and RDS
    The optimisation process is stable and gets better on both deterministic and stochastic reward mechanisms.
    CNN achieves the highest performance on both RDS setting and RDS settings. CNN can increasingly perform better suggests that the selected samples are well-represented and the model does not face class-imbalance issue.
    The stochastic reward mechanism has the same learning dynamics and achieves better performance in shorter time.
Comparison between RDS and other methods as shown in Table 3
    The preset selection gets the highest performance on RF. This might hint that the preset selection process was based on RF.
    The stratified setting gets the highest performance on LR. It showcases that stratified splitting approach is a strong baseline for balanced data.
    Both random and RDS show sub-optimal performance in all evaluation metrics. This hints that they might have the issue of imbalance selection between samples for training and testing. For RDS, the majority voting was used for reward mechanism. It might be a reason to cause the problem of diversity.
Public Benchmarking
    The use of public solution has the best performance in comparison to the ensemble of three base models. The same evaluation metric is used for all sampling techniques.
    The RDS gets the best performance in public benchmarking.
Summary
    The MNIST experiment shows that the proposed approach can perform effectively on image classification task for model diversification. Given the fact that the preset setting of MNIST dataset was well-prepared for having good splitting in terms of both samples’ similarity and classes’ balance. However, our proposed RDS approach can effectively select a better split by showing that the public solution gets better performance in comparison to other splitting methods.
Table 8: MNIST - Experiment Notes
Experiment Notes
Background KLP was provided in the Kalapa credit scoring challenge [14]. It contains 50,000 profiles associated with good or bad labels. Each profile has 62 demographic and financial features. Originally, KLP is separated into 30,000 training and 20,000 testing examples with the strongly imbalanced data problem, i.e., only approximately 1.6% the total number of profiles are labelled as ’good’. As another serious issue, over 40 fields have more than 30% missing values.
Settings (1.0, 1.0, 40)
Base Models     LR - Logistic Regression (solver=‘liblinear’, random_state=123)
    RF - Random Forest (n_estimators=64, random_state=123)
    MLP - Multilayer Perceptron (Adam, lr=1e-3, manual_seed=123)
Benchmark The 1st rank solution using Random Forest with WOE binnings and the number estimators of 767 [18]
Pre-processing

Three text fields namely ’Province’, ’Job’, ’District’ are extracted average-pooling word embeddings from a fine-tuned Fasttext model of 32 dimensions. Other fields are applied with traditional feature engineering techniques (e.g., MinMaxScaler, OrdinaryEncoder, dummy variables).

Run time On average 18s / epoch
Observations
Learning Dynamics as shown in Figure 1(b) and Figure 2(b)
    LR is quite stable with higher performance compared against RF and MLP.
    RF is not always the most effective model for the classification task with the imbalanced and missing data problems.
    MLP shows a better performance without the constraint by the deterministic soft-voting reward mechanism.
    Regularisations work properly to force the sampling ratio converged to the expected value, and reduce the label distribution difference between the training and testing sets.
Comparison between RDS and RDS
    There is a consistent upward trend on classification performance of the ensemble model for both reward mechanisms.
    The models are more relaxing on the learning process with the stochastic reward mechanism, hence they have better performance.
    The sampling ratio and IID regularisation are fluctuating within the first 30 episodes. Subsequently, they are more stable and converged after 450 episodes. A similar trend is observed for both reward mechanisms.
Comparison between RDS and other methods as shown in Table 4
    The best performance with the preset selection method is observed on LR. It implies that LR is the base model to split data in the credit scoring challenge.
    It shows sub-optimal performance for both random and stratified sampling methods on all evaluation metrics.
    RDS variants outperform traditional sampling methods for almost individual or ensemble models. The performance of RDS on RF is slightly lower than the performance of the stratified selection due to the solid dependence of RS on feature distributions.
Public Benchmarking
    The public solution results in better performance compared against to the bare-bone ensemble of three base models across all sampling methods, except RDS.
    The best performance is achieved by RDS
Summary
Experiments on KLP affirms the advantages of the reinforced sampling method for model diversification. With appropriate regularisation settings, the proposed method can help to effectively control sampling constraints, even if there are serious imbalanced and missing data problems.
Table 9: KLP - Experiment Notes