Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data

04/28/2020
by   Alana Marzoev, et al.
1

Large, human-annotated datasets are central to the development of natural language processing models. Collecting these datasets can be the most challenging part of the development process. We address this problem by introducing a general purpose technique for “simulation-to-real” transfer in language understanding problems with a delimited set of target behaviors, making it possible to develop models that can interpret natural utterances without natural training data. We begin with a synthetic data generation procedure, and train a model that can accurately interpret utterances produced by the data generator. To generalize to natural utterances, we automatically find projections of natural language utterances onto the support of the synthetic language, using learned sentence embeddings to define a distance metric. With only synthetic training data, our approach matches or outperforms state-of-the-art models trained on natural language data in several domains. These results suggest that simulation-to-real transfer is a practical framework for developing NLP applications, and that improved models for transfer might provide wide-ranging improvements in downstream tasks.

READ FULL TEXT
research
05/07/2023

LatinCy: Synthetic Trained Pipelines for Latin NLP

This paper introduces LatinCy, a set of trained general purpose Latin-la...
research
12/17/2018

Multi-task learning to improve natural language understanding

Recently advancements in sequence-to-sequence neural network architectur...
research
01/31/2023

Recursive Neural Networks with Bottlenecks Diagnose (Non-)Compositionality

A recent line of work in NLP focuses on the (dis)ability of models to ge...
research
10/11/2021

Calibrate your listeners! Robust communication-based training for pragmatic speakers

To be good conversational partners, natural language processing (NLP) sy...
research
04/07/2021

Interpreting Verbal Metaphors by Paraphrasing

Metaphorical expressions are difficult linguistic phenomena, challenging...
research
03/24/2022

Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets

Natural language processing models often exploit spurious correlations b...
research
10/13/2022

Benchmarking Long-tail Generalization with Likelihood Splits

In order to reliably process natural language, NLP systems must generali...

Please sign up or login with your details

Forgot password? Click here to reset