Don't paraphrase, detect! Rapid and Effective Data Collection for Semantic Parsing

08/26/2019
by   Jonathan Herzig, et al.
20

A major hurdle on the road to conversational interfaces is the difficulty in collecting data that maps language utterances to logical forms. One prominent approach for data collection has been to automatically generate pseudo-language paired with logical forms, and paraphrase the pseudo-language to natural language through crowdsourcing (Wang et al., 2015). However, this data collection procedure often leads to low performance on real data, due to a mismatch between the true distribution of examples and the distribution induced by the data collection procedure. In this paper, we thoroughly analyze two sources of mismatch in this process: the mismatch in logical form distribution and the mismatch in language distribution between the true and induced distributions. We quantify the effects of these mismatches, and propose a new data collection approach that mitigates them. Assuming access to unlabeled utterances from the true distribution, we combine crowdsourcing with a paraphrase model to detect correct logical forms for the unlabeled utterances. On two datasets, our method leads to 70.6 accuracy on average on the true distribution, compared to 51.3 in paraphrasing-based data collection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2022

Addressing Resource and Privacy Constraints in Semantic Parsing Through Data Augmentation

We introduce a novel setup for low-resource task-oriented semantic parsi...
research
01/10/2019

Sentence Rewriting for Semantic Parsing

A major challenge of semantic parsing is the vocabulary mismatch problem...
research
06/01/2021

What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

Crowdsourcing is widely used to create data for common natural language ...
research
05/09/2017

Logical Parsing from Natural Language Based on a Neural Translation Model

Semantic parsing has emerged as a significant and powerful paradigm for ...
research
04/15/2021

Does Putting a Linguist in the Loop Improve NLU Data Collection?

Many crowdsourced NLP datasets contain systematic gaps and biases that a...
research
06/16/2016

Simpler Context-Dependent Logical Forms via Model Projections

We consider the task of learning a context-dependent mapping from uttera...
research
08/31/2023

The Smart Data Extractor, a Clinician Friendly Solution to Accelerate and Improve the Data Collection During Clinical Trials

In medical research, the traditional way to collect data, i.e. browsing ...

Please sign up or login with your details

Forgot password? Click here to reset