Code for the paper "Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond"
Manipulating data, such as weighting data examples or augmenting with new instances, has been increasingly used to improve model training. Previous work has studied various rule- or learning-based approaches designed for specific types of data manipulation. In this work, we propose a new method that supports learning different manipulation schemes with the same gradient-based algorithm. Our approach builds upon a recent connection of supervised learning and reinforcement learning (RL), and adapts an off-the-shelf reward learning algorithm from RL for joint data manipulation learning and model training. Different parameterization of the "data reward" function instantiates different manipulation schemes. We showcase data augmentation that learns a text transformation network, and data weighting that dynamically adapts the data sample importance. Experiments show the resulting algorithms significantly improve the image and text classification performance in low data regime and class-imbalance problems.READ FULL TEXT VIEW PDF
Code for the paper "Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond"
The performance of machines often crucially depend on the amount and quality of the data used for training. It has become increasingly ubiquitous to manipulate data to improve learning, especially in low data regime or in presence of low-quality datasets (e.g., imbalanced labels). For example, data augmentation applies label-preserving transformations on original data points to expand the data size; data weighting assigns an importance weight to each instance to adapt its effect on learning; and data synthesis generates entire artificial examples. Different types of manipulation can be suitable for different application settings.
Common data manipulation methods are usually designed manually, e.g., augmenting by flipping an image or replacing a word with synonyms, and weighting with inverse class frequency or loss values (freund1997decision; malisiewicz2011ensemble). Recent work has studied automated approaches, such as learning the composition of augmentation operators with reinforcement learning (ratner2017learning; cubuk2018autoaugment), deriving sample weights adaptively from a validation set via meta learning (ren2018learning), or learning a weighting network by inducing a curriculum (jiang2017mentornet). These learning-based approaches have alleviated the engineering burden and produced impressive results. However, the algorithms are usually designed specifically for certain types of manipulation (e.g., either augmentation or weighting) and thus have limited application scope in practice.
In this work, we propose a new approach that enables learning for different manipulation schemes with the same single algorithm. Our approach draws inspiration from the recent work (tan2018connecting) that shows equivalence between the data in supervised learning and the reward function in reinforcement learning. We thus adapt an off-the-shelf reward learning algorithm (zheng2018learning) to the supervised setting for automated data manipulation. The marriage of the two paradigms results in a simple yet general algorithm, where various manipulation schemes are reduced to different parameterization of the data reward. Free parameters of manipulation are learned jointly with the target model through efficient gradient descent on validation examples. We demonstrate instantiations of the approach for automatically fine-tuning an augmentation network and learning data weights, respectively.
We conduct extensive experiments on text and image classification in challenging situations of very limited data and imbalanced labels. Both augmentation and weighting by our approach significantly improve over strong base models, even though the models are initialized with large-scale pretrained networks such as BERT (devlin2018bert) for text and ResNet (he2016deep) for images. Our approach, besides its generality, also outperforms a variety of dedicated rule- and learning-based methods for either augmentation or weighting, respectively. Lastly, we observe that the two types of manipulation tend to excel in different contexts: augmentation shows superiority over weighting with a small amount of data available, while weighting is better at addressing class imbalance problems.
The way we derive the manipulation algorithm represents a general means of problem solving through algorithm extrapolation between learning paradigms, which we discuss more in section 6.
Rich types of data manipulation have been increasingly used in modern machine learning pipelines. Previous work each has typically focused on a particular manipulation type. Data augmentation that perturbs examples without changing the labels is widely used especially in vision(simard1998transformation; krizhevsky2012imagenet) and speech (ko2015audio; park2019specaugment)
domains. Common heuristic-based methods on images include cropping, mirroring, rotation(krizhevsky2012imagenet), and so forth. Recent work has developed automated augmentation approaches (cubuk2018autoaugment; ratner2017learning; lemley2017smart; peng2018jointly; tran2017bayesian). xie2019unsupervised additionally use large-scale unlabeled data. cubuk2018autoaugment; ratner2017learning learn to induce the composition of data transformation operators. Instead of treating data augmentation as a policy in reinforcement learning (cubuk2018autoaugment)
, we formulate manipulation as a reward function and use efficient stochastic gradient descent to learn the manipulation parameters. Text data augmentation has also achieved impressive success, such as contextual augmentation(kobayashi2018contextual; wu2018conditional), back-translation (sennrich2015improving), and manual approaches (wei2019eda; andreas2016learning)
. In addition to perturbing the input text as in classification tasks, text generation problems expose opportunities to adding noise also in the output text, such as(norouzi2016reward; xie2017data). Recent work (tan2018connecting)
shows output nosing in sequence generation can be treated as an intermediate approach in between supervised learning and reinforcement learning, and developed a new sequence learning algorithm that interpolates between the spectrum of existing algorithms. We instantiate our approach for text contextual augmentation as in(kobayashi2018contextual; wu2018conditional), but enhance the previous work by additionally fine-tuning the augmentation network jointly with the target model.
Data weighting has been used in various algorithms, such as AdaBoost (freund1997decision), self-paced learning (kumar2010self), hard-example mining (shrivastava2016training), and others (chang2017active; katharopoulos2018not). These algorithms largely define sample weights based on training loss. Recent work (jiang2017mentornet; fan2018learning) learns a separate network to predict sample weights. Of particular relevance to our work is (ren2018learning)
which induces sample weights using a validation set. The data weighting mechanism instantiated by our framework has a key difference in that samples weights are treated as parameters that are updated iteratively, instead of re-estimated from scratch at each step. We show improved performance of our approach. Besides, our data manipulation approach is derived based on a different perspective of reward learning, instead of meta-learning as in(ren2018learning).
Another popular type of data manipulation involves data synthesis, which creates entire artificial samples from scratch. GAN-based approaches have achieved impressive results for synthesizing conditional image data (baluja2017adversarial; mirza2014conditional). In the text domain, controllable text generation (hu2017controllable)
presents a way of co-training the data generator and classifier in a cyclic manner within a joint VAE(kingma2013auto) and wake-sleep (hinton1995wake) framework. It is interesting to explore the instantiation of the present approach for adaptive data synthesis in the future.
We first present the relevant work upon which our automated data manipulation is built. This section also establishes the notations used throughout the paper.
Let denote the input and the output. For example, in text classification, can be a sentence and is the sentence label. Denote the model of interest as , where is the model parameters to be learned. In supervised setting, given a set of training examples , we learn the model by maximizing the data log-likelihood.
The recent work (tan2018connecting) introduced a unifying perspective of reformulating maximum likelihood supervised learning as a special instance of a policy optimization framework. In this perspective, data examples providing supervision signals are equivalent to a specialized reward function. Since the original framework (tan2018connecting) was derived for sequence generation problems, here we present a slightly adapted formulation for our context of data manipulation.
To connect the maximum likelihood supervised learning with policy optimization, consider the model as a policy that takes “action” given the “state” . Let denote a reward function, and be the empirical data distribution which is known given . Further assume a variational distribution that factorizes as . A variational policy optimization objective is then written as:
is the Kullback–Leibler divergence;is the Shannon entropy; and are balancing weights. The objective is in the same form with the RL-as-inference formalism of policy optimization (e.g., dayan1997using; levine2018reinforcement; abdolmaleki2018maximum). Intuitively, the objective maximizes the expected reward under , and enforces the model to stay close to , with a maximum entropy regularization over . The problem is solved with an EM procedure that optimizes and alternatingly:
where is the normalization term. With the established framework, it is easy to show that the above optimization procedure reduces to maximum likelihood learning by taking , and the reward function:
That is, a sample receives a unit reward only when it matches a training example in the dataset, while the reward is negative infinite in all other cases. To make the equivalence to maximum likelihood learning clearer, note that the above M-step now reduces to
where the joint distributionequals the empirical data distribution, which means the M-step is in fact maximizing the data log-likelihood of the model .
There is a rich line of research on learning the reward in reinforcement learning. Of particular interest to this work is (zheng2018learning) which learns a parametric intrinsic reward that additively transforms the original task reward (a.k.a extrinsic reward) to improve the policy optimization. For consistency of notations with above, formally, let be a policy where is an action and is a state. Let be the intrinsic reward with parameters . In each iteration, the policy parameter is updated to maximize the joint rewards, through:
where is the expectation of the sum of extrinsic and intrinsic rewards; and is the step size. The equation shows depends on , thus we can write as .
The next step is to optimize the intrinsic reward parameters . Recall that the ultimate measure of the performance of a policy is the value of extrinsic reward it achieves. Therefore, a good intrinsic reward is supposed to, when the policy is trained with it, increase the eventual extrinsic reward. The update to is then written as:
That is, we want the expected extrinsic reward of the new policy to be maximized. Since is a function of
, we can directly backpropagate the gradient throughto .
We now develop our approach of learning data manipulation, through a novel marriage of supervised learning and the above reward learning. Specifically, from the policy optimization perspective, due to the -function reward (Eq.3), the standard maximum likelihood learning is restricted to use only the exact training examples in a uniform way. A natural idea of enabling data manipulation is to relax the strong restrictions of the -function reward and instead use a relaxed reward with parameters . The relaxed reward can be parameterized in various ways, resulting in different types of manipulation. For example, when a sample matches a data instance, instead of returning constant by , the new can return varying reward values depending on the matched instance, resulting in a data weighting scheme. Alternatively, can return a valid reward even when matches a data example only in part, or is an entire new sample not in , which in effect makes data augmentation and data synthesis, respectively, in which cases is either a data transformer or a generator. In the next section, we demonstrate two particular parameterizations for data augmentation and weighting, respectively.
We thus have shown that the diverse types of manipulation all boil down to a parameterized data reward . Such an concise, uniform formulation of data manipulation has the advantage that, once we devise a method of learning the manipulation parameters , the resulting algorithm can directly be applied to automate any manipulation type. We present a learning algorithm next.
To learn the parameters in the manipulation reward , we could in principle adopt any off-the-shelf reward learning algorithm in the literature. In this work, we draw inspiration from the above gradient-based reward learning (section 3) due to its simplicity and efficiency. Briefly, the objective of is to maximize the ultimate measure of the performance of model , which, in the context of supervised learning, is the model performance on a held-out validation set.
The algorithm optimizes and alternatingly, corresponding to Eq.(5) and Eq.(6), respectively. More concretely, in each iteration, we first update the model parameters in analogue to Eq.(5) which optimizes intrinsic reward-enriched objective. Here, we optimize the log-likelihood of the training set enriched with data manipulation. That is, we replace with in Eq.(4), and obtain the augmented M-step:
By noticing that the new depends on , we can write as a function of , namely, . The practical implementation of the above update depends on the actual parameterization of manipulation , which we discuss in more details in the next section.
The next step is to optimize in terms of the model validation performance, in analogue to Eq.(6). Formally, let be the validation set of data examples. The update is then:
where, since is a function of , the gradient is backpropagated to through . Taking data weighting for example where is the training sample weights (more details in section 4.2), the update is to optimize the weights of training samples so that the model performs best on the validation set.
The resulting algorithm is summarized in Algorithm 1. Figure 1 illustrates the computation flow. Learning the manipulation parameters effectively uses a held-out validation set. We show in our experiments that a very small set of validation examples (e.g., 2 labels per class) is enough to significantly improve the model performance in low data regime.
It is worth noting that some previous work has also leveraged validation examples, such as learning data augmentation with policy gradient (cubuk2018autoaugment) or inducing data weights with meta-learning (ren2018learning). Our approach is inspired from a distinct paradigm of (intrinsic) reward learning. In contrast to (cubuk2018autoaugment) that treats data augmentation as a policy, we instead formulate manipulation as a reward function and enable efficient stochastic gradient updates. Our approach is also more broadly applicable to diverse data manipulation types than (ren2018learning; cubuk2018autoaugment).
As a case study, we show two parameterizations of which instantiate distinct data manipulation schemes. The first example learns augmentation for text data, a domain that has been less studied in the literature compared to vision and speech (kobayashi2018contextual; giridhara2019survey). The second instance focuses on automated data weighting, which is applicable to any data domains.
The recent work (kobayashi2018contextual; wu2018conditional) developed a novel contextual augmentation approach for text data, in which a powerful pretrained language model (LM), such as BERT (devlin2018bert), is used to generate substitutions of words in a sentence. Specifically, given an observed sentence , the method first randomly masks out a few words. The masked sentence is then fed to BERT which fills the masked positions with new words. To preserve the original sentence class, the BERT LM is retrofitted as a label-conditional model, and trained on the task training examples. The resulting model is then fixed and used to augment data during the training of target model. We denote the augmentation distribution as , where is the fixed BERT LM parameters.
The above process has two drawbacks. First, the LM is fixed after fitting to the task data. In the subsequent phase of training the target model, the LM augments data without knowing the state of the target model, which can lead to sub-optimal results. Second, in the cases where the task dataset is small, the LM can be insufficiently trained for preserving the labels faithfully, resulting in noisy augmented samples.
To address the difficulties, it is beneficial to apply the proposed learning data manipulation algorithm to additionally fine-tune the LM jointly with target model training. As discussed in section 4, this reduces to properly parameterizing the data reward function:
That is, a sample receives a unit reward when is the true label and is the augmented sample by the LM (instead of the exact original data ). Plugging the reward into Eq.(7), we obtain the data-augmented update for the model parameters:
That is, we pick an example from the training set, and use the LM to create augmented samples, which are then used to update the target model. Regarding the update of augmentation parameters (Eq.8), since text samples are discrete, to enable efficient gradient propagation through to , we use a gumbel-softmax approximation (jang2016categorical) to when sampling substitution words from the LM.
We now demonstrate the instantiation of data weighting. We aim to assign an importance weight to each training example to adapt its effect on model training. We automate the process by learning the data weights. This is achieved by parameterizing as:
where is the weight associated with the th example. Plugging into Eq.(7), we obtain the weighted update for the model :
In practice, when minibatch stochastic optimization is used, we approximate the weighted sampling by taking the softmax over the weights of only the minibatch examples. The data weights are updated with Eq.(8). It is worth noting that the previous work (ren2018learning) similarly derives data weights based on their gradient directions on a validation set. Our algorithm differs in that the data weights are parameters maintained and updated throughout the training, instead of re-estimated from scratch in each iteration. Experiments show the parametric treatment achieves superior performance in various settings. There are alternative parameterizations of other than Eq.(11). For example, replacing in Eq.(11) with in effect changes the softmax normalization in Eq.(12) to linear normalization, which is used in (ren2018learning).
We empirically validate the proposed data manipulation approach through extensive experiments on learning augmentation and weighting. We study both text and image classification, in two difficult settings of low data regime and imbalanced labels111Code available at https://github.com/tanyuqian/learning-data-manipulation.
Base Models. We choose strong pretrained networks as our base models for both text and image classification. Specifically, on text data, we use the BERT (base, uncased) model (devlin2018bert); while on image data, we use ResNet-34 (he2016deep)
pretrained on ImageNet. We show that, even with the large-scale pretraining, data manipulation can still be very helpful to boost the model performance on downstream tasks. Since our approach uses validation sets for manipulation parameter learning, for a fair comparison with the base model, we train the base model in two ways. The first is to train the model on the training sets as usual and select the best step using the validation sets; the second is to train on the merged training and validation sets for a fixed number of steps. The step number is set to the average number of steps selected in the first method. We report the results of both methods.
Comparison Methods. We compare our approach with a variety of previous methods that were designed for specific manipulation schemes: (1) For text data augmentation, we compare with the latest model-based augmentation (wu2018conditional) which uses a fixed conditional BERT language model for word substitution (section 4.2). As with base models, we also tried fitting the augmentatin model to both the training data and the joint training-validation data, and did not observe significant difference. Following (wu2018conditional), we also study a conventional approach that replaces words with their synonyms using WordNet (miller1995wordnet). (2) For data weighting, we compare with the state-of-the-art approach (ren2018learning) that dynamically re-estimates sample weights in each iteration based on the validation set gradient directions. We follow (ren2018learning) and also evaluate the commonly-used proportion method that weights data by inverse class frequency.
For both the BERT classifier and the augmentation model (which is also based on BERT), we use Adam optimization with an initial learning rate of 4e-5. For ResNets, we use SGD optimization with a learning rate of 1e-3. For text data augmentation, we augment each minibatch by generating two or three samples for each data points (each with 1, 2 or 3 substitutions), and use both the samples and the original data to train the model. For data weighting, to avoid exploding value, we update the weight of each data point in a minibatch by decaying the previous weight value with a factor of 0.1 and then adding the gradient. All experiments were implemented with PyTorch (pytorch.org) and were performed on a Linux machine with 4 GTX 1080Ti GPUs and 64GB RAM. All reported results are averaged over 15 runs
one standard deviation.
We study the problem where only very few labeled examples for each class are available. Both of our augmentation and weighting boost base model performance, and are superior to respective comparison methods. We also observe that augmentation performs better than weighting in the low-data setting.
For text classification, we use the popular benchmark datasets, including SST-5 for 5-class sentence sentiment (socher2013recursive), IMDB for binary movie review sentiment (maas2011learning), and TREC for 6-class question types (li2002learning). We subsample a small training set on each task by randomly picking 40 instances for each class. We further create small validation sets, i.e., 2 instances per class for SST-5, and 5 instances per class for IMDB and TREC, respectively. The reason we use slightly more validation examples on IMDB and TREC is that the model can easily achieve 100% validation accuracy if the validation sets are too small. Thus, the SST-5 task has labeled examples in total, while IMDB has labels and TREC has
. Such extremely small datasets pose significant challenges for learning deep neural networks. Since the manipulation parameters are trained using the small validation sets, to avoid possible overfitting we restrict the training to small number (e.g., 5 or 10) of epochs. For image classification, we similarly create a small subset of the CIFAR10 data, which includes 40 instances per class for training, and 2 instances per class for validation.
|Model||SST-5 (40+2)||IMDB (40+5)||TREC (40+5)|
|Base model: BERT (devlin2018bert)|
|Base model + val-data|
|Fixed augmentation (wu2018conditional)|
|Ours: Fine-tuned augmentation|
Table 1 shows the manipulation results on text classification. For data augmentation, our approach significantly improves over the base model on all the three datasets. Besides, compared to both the conventional synonym substitution and the approach that keeps the augmentation network fixed, our adaptive method that fine-tunes the augmentation network jointly with model training achieves superior results. Indeed, the heuristic-based synonym approach can sometimes harm the model performance (e.g., SST-5 and IMDB), as also observed in previous work (wu2018conditional; kobayashi2018contextual). This can be because the heuristic rules do not fit the task or datasets well. In contrast, learning-based augmentation has the advantage of adaptively generating useful samples to improve model training.
Table 1 also shows the data weighting results. Our weight learning consistently improves over the base model and the latest weighting method (ren2018learning). In particular, instead of re-estimating sample weights from scratch in each iteration (ren2018learning), our approach treats the weights as manipulation parameters maintained throughout the training. We speculate that the parametric treatment can adapt weights more smoothly and provide historical information, which is beneficial in the small-data context.
It is interesting to see from Table 1 that our augmentation method consistently outperforms the weighting method, showing that data augmentation can be a more suitable technique than data weighting for manipulating small-size data. Our approach provides the generality to instantiate diverse manipulation types and learn with the same single procedure.
To investigate the augmentation model and how the fine-tuning affects the augmentation results, we show in Figure 2 the top-5 most probable word substitutions predicted by the augmentation model for two masked tokens, respectively. Comparing the results of epoch 1 and epoch 3, we can see the augmentation model evolves and dynamically adjusts the augmentation behavior as the training proceeds. Through fine-tuning, the model seems to make substitutions that are more coherent with the conditioning label and relevant to the original words (e.g., replacing the word “striking” with “bland” in epoch 1 v.s. “charming” in epoch 3).
Table 2 shows the data weighting results on image classification. We evaluate two settings with the ResNet-34 base model being initialized randomly or with pretrained weights, respectively. Our data weighting consistently improves over the base model and (ren2018learning) regardless of the initialization.
|Base model: BERT (devlin2018bert)|
|Base model + val-data|
|Base model: ResNet (he2016deep)|
|Base model + val-data|
We next study a different problem setting where the training data of different classes are imbalanced. We show the data weighting approach greatly improves the classification performance. It is also observed that, the LM data augmentation approach, which performs well in the low-data setting, fails on the class-imbalance problems.
Though the methods are broadly applicable to multi-way classification problems, here we only study binary classification tasks for simplicity. For text classification, we use the SST-2 sentiment analysis benchmark(socher2013recursive); while for image, we select class 1 and 2 from CIFAR10 for binary classification. We use the same processing on both datasets to build the class-imbalance setting. Specifically, we randomly select 1,000 training instances of class 2, and vary the number of class-1 instances in . For each dataset, we use 10 validation examples in each class. Trained models are evaluated on the full binary-class test set.
Table 3 shows the classification results on SST-2 with varying imbalance ratios. We can see our data weighting performs best across all settings. In particular, the improvement over the base model increases as the data gets more imbalanced, ranging from around 6 accuracy points on 100:1000 to over 20 accuracy points on 20:1000. Our method is again consistently better than (ren2018learning), validating that the parametric treatment is beneficial. The proportion-based data weighting provides only limited improvement, showing the advantage of adaptive data weighting. The base model trained on the joint training-validation data for fixed steps fails to perform well, partly due to the lack of a proper mechanism for selecting steps.
Table 4 shows the results on imbalanced CIFAR10 classification. Similarly, our method outperforms other comparison approaches. In contrast, the fixed proportion-based method sometimes harms the performance as in the 50:1000 and 100:1000 settings.
We also tested the text augmentation LM on the SST-2 imbalanced data. Interestingly, the augmentation tends to hinder model training and yields accuracy of around 50% (random guess). This is because the augmentation LM is first fit to the imbalanced data, which makes label preservation inaccurate and introduces lots of noise during augmentation. Though a more carefully designed augmentation mechanism can potentially help with imbalanced classification (e.g., augmenting only the rare classes), the above observation further shows that the varying data manipulation schemes have different applicable scopes. Our approach is thus favorable as the single algorithm can be instantiated to learn different schemes.
Conclusions. We have developed a new method of learning different data manipulation schemes with the same single algorithm. Different manipulation schemes reduce to just different parameterization of the data reward function. The manipulation parameters are trained jointly with the target model parameters. We instantiate the algorithm for data augmentation and weighting, and show improved performance over strong base models and previous manipulation methods. We are excited to explore more types of manipulations such as data synthesis, and in particular study the combination of different manipulation schemes.
The proposed method builds upon the connections between supervised learning and reinforcement learning (RL) (tan2018connecting) through which we extrapolate an off-the-shelf reward learning algorithm in the RL literature to the supervised setting. The way we obtained the manipulation algorithm represents a general means of innovating problem solutions based on unifying formalisms of different learning paradigms. Specifically, a unifying formalism not only offers new understandings of the seemingly distinct paradigms, but also allows us to systematically apply solutions to problems in one paradigm to similar problems in another. Previous work along this line has made fruitful results in other domains. For example, an extended formulation of (tan2018connecting) that connects RL and posterior regularization (PR) (ganchev2010posterior; hu2016harnessing) has enabled to similarly export a reward learning algorithm to the context of PR for learning structured knowledge (hu2018deep). By establishing a uniform abstration of GANs (goodfellow2014generative) and VAEs (kingma2013auto), hu2018unifying exchange techniques between the two families and get improved generative modeling. Other work in the similar spirit includes (roweis1999unifying; samdani2012unified; finn2016connection, etc).
By extrapolating algorithms between paradigms, one can go beyond crafting new algorithms from scratch as in most existing studies, which often requires deep expertise and yields unique solutions in a dedicated context. Instead, innovation becomes easier by importing rich ideas from other paradigms, and is repeatable as a new algorithm can be methodically extrapolated to multiple different contexts.