Good-Enough Compositional Data Augmentation
We propose a simple data augmentation protocol aimed at providing a compositional inductive bias in conditional and unconditional sequence models. Under this protocol, synthetic training examples are constructed by taking real training examples and replacing (possibly discontinuous) fragments with other fragments that appear in at least one similar environment. The protocol is model-agnostic and useful for a variety of tasks. Applied to neural sequence-to-sequence models, it reduces relative error rate by up to 87 problems from the diagnostic SCAN tasks and 16 Applied to n-gram language modeling, it reduces perplexity by roughly 1 small datasets in several languages.
READ FULL TEXT