Outlier Detection for Improved Data Quality and Diversity in Dialog Systems

04/05/2019
by   Stefan Larson, et al.
0

In a corpus of data, outliers are either errors: mistakes in the data that are counterproductive, or are unique: informative samples that improve model robustness. Identifying outliers can lead to better datasets by (1) removing noise in datasets and (2) guiding collection of additional data to fill gaps. However, the problem of detecting both outlier types has received relatively little attention in NLP, particularly for dialog systems. We introduce a simple and effective technique for detecting both erroneous and unique samples in a corpus of short texts using neural sentence embeddings combined with distance-based outlier detection. We also present a novel data collection pipeline built atop our detection technique to automatically and iteratively mine unique data samples while discarding erroneous samples. Experiments show that our outlier detection technique is effective at finding errors while our data collection pipeline yields highly diverse corpora that in turn produce more robust intent classification and slot-filling models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/26/2018

Detecting Outliers in Data with Correlated Measures

Advances in sensor technology have enabled the collection of large-scale...
research
10/13/2021

C-AllOut: Catching Calling Outliers by Type

Given an unlabeled dataset, wherein we have access only to pairwise simi...
research
11/02/2022

Analytical method for detecting outlier evaluators

Epidemiologic and medical studies often rely on evaluators to obtain mea...
research
07/26/2022

A Survey of Intent Classification and Slot-Filling Datasets for Task-Oriented Dialog

Interest in dialog systems has grown substantially in the past decade. B...
research
07/18/2023

Pseudo Outlier Exposure for Out-of-Distribution Detection using Pretrained Transformers

For real-world language applications, detecting an out-of-distribution (...
research
04/15/2021

Does Putting a Linguist in the Loop Improve NLU Data Collection?

Many crowdsourced NLP datasets contain systematic gaps and biases that a...
research
10/19/2019

Efficient Discovery of Meaningful Outlier Relationships

We propose PODS (Predictable Outliers in Data-trendS), a method that, gi...

Please sign up or login with your details

Forgot password? Click here to reset