Multi-Task Self-Supervised Learning for Disfluency Detection

08/15/2019 ∙ by Shaolei Wang, et al. ∙ University of Oxford Harbin Institute of Technology The Regents of the University of California 0

Most existing approaches to disfluency detection heavily rely on human-annotated data, which is expensive to obtain in practice. To tackle the training data bottleneck, we investigate methods for combining multiple self-supervised tasks-i.e., supervised tasks where data can be collected without manual labeling. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled news data, and propose two self-supervised pre-training tasks: (i) tagging task to detect the added noisy words. (ii) sentence classification to distinguish original sentences from grammatically-incorrect sentences. We then combine these two tasks to jointly train a network. The pre-trained network is then fine-tuned using human-annotated disfluency detection training data. Experimental results on the commonly used English Switchboard test set show that our approach can achieve competitive performance compared to the previous systems (trained using the full dataset) by using less than 1 method trained on the full dataset significantly outperforms previous methods, reducing the error by 21

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic speech recognition (ASR) outputs often contain various disfluencies, which create barriers to subsequent text processing tasks like parsing, machine translation, and summarization. Disfluency detection (Zayats et al., 2016; Wang et al., 2016; Wu et al., 2015) focuses on recognizing the disfluencies from ASR outputs. As shown in Figure 1, a standard annotation of the disfluency structure indicates the reparandum (words that the speaker intends to discard), the interruption point (denoted as ‘+’, marking the end of the reparandum), an optional interregnum (filled pauses, discourse cue words, etc.) and the associated repair (Shriberg, 1994).

Figure 1: A sentence from the English Switchboard corpus with disfluencies annotated. RM=Reparandum, IM=Interregnum, RP=Repair. The preceding RM is corrected by the following RP.
Type Annotation
repair [ I just + I ] enjoy working
repair [ we want + {well} in our area we want ] to
repetition [it’s + {uh} it’s ] almost like
restart [ we would like + ] let’s go to the
Table 1: Different types of disfluencies.

Ignoring the interregnum, disfluencies are categorized into three types: restarts, repetitions and corrections. Table 1 gives a few examples. Interregnums are relatively easier to detect as they are often fixed phrases, e.g. “uh”, “you know”. On the other hand, reparandums are more difficult to detect in that they are in free form. As a result, most previous disfluency detection work focuses on detecting reparandums.

Most work (Zayats and Ostendorf, 2018; Lou and Johnson, 2017; Wang et al., 2017; Jamshid Lou et al., 2018; Zayats and Ostendorf, 2019) on disfluency detection heavily relies on human-annotated data, which is scarce and expensive to obtain in practice. In this paper, we investigate self-supervised learning method  (Agrawal et al., 2015; Fernando et al., 2017)

to tackle this training data bottleneck. Self-supervised learning aims to train a network on auxiliary tasks where ground-truth is obtained automatically. The merits of this line work are that they do not require manually annotations but still utilized supervised learning by inferring supervisory signals from the data structure. Neural networks pre-trained with these tasks can be fine-tuned to perform well on standard supervised task with less manually-labeled data than networks which are initialized randomly. In natural language processing domain, self-supervised research mainly focus on word embedding

(Mikolov et al., 2013a, b) or language model learning (Bengio et al., 2003; Peters et al., 2018; Radford et al., 2018). Motivated by the success of self-supervised learning, we propose two self-supervised tasks for disfluency detection task, as shown in Figure 2.

Figure 2: Illustration of our proposed methods.

The first task aims to tag corrupted parts from a disfluent sentence, generated by randomly adding words to a fluent sentence. Although there are discrepancies between the distribution of gold disfluency detection data and the generated sentences, this task endows the model to recover the fluent sentences from the disfluent ones, which matches the final goal of disfluency detection.

The second task is sentence classification to distinguish original sentences from corrupted sentences. We align each original sentence from the news dataset and another disfluent sentence generated by randomly deleting or adding words to the original fluent sentence. The goal of the task is to take these sentence pairs as input and predict which sentence is the fluent one. This task enables the model to distinguish grammatically-correct sentences from grammatically-incorrect ones. We hypothesize that this task is helpful for disfluency detection, as one core challenge for disfluency detection is to keep the output sentences grammatically-correct.

The second task can help the first by modeling sentence-level grammatical information. Inspired by the hypothesis, we combine these two tasks to jointly train a network based on the auto-constructed pseudo training data. The pre-trained network is later fine-tuned using human-annotated disfluency detection data.

Our contributions can be summarized as follows:

  • We propose two self-supervised tasks for disfluency detection to tackle the training data bottleneck. To our best knowledge, this is the first work to investigate self-supervised representation learning in disfluency detection.

  • Based on the two self-supervised tasks, we further investigate multi-task methods for combining the two self-supervised tasks.

  • Experimental results on the commonly used English Switchboard test set show that our approach can achieve competitive performance compared to the previous systems (trained using the full dataset) by using less than 1% (1000 sentences) of the training data. Our method trained on the full dataset significantly outperforms previous methods, reducing the error by 21% on English Switchboard.

2 Proposed Approach

2.1 Self-Supervised Learning Task

Let be an ordered sequence of tokens, which is taken from raw unlabeled news data, assumed to be fluent. We then propose two self-supervised tasks.

Tagging Task

The input of the tagging task is a disfluent sentence , generated by randomly adding words to a fluent sentence. is fed into a transformer encoder network to learn the representation of each word, . The goal is to detect the added noisy words by associating a label for each word, where the labels and means that the word is an added word and a fluent word, respectively. Although the distribution of the tagging task data is different from the distribution of the gold disfluency detection data, the training goal is to keep the generated sentences fluent by deleting disfluent words, which matches the goal of disfluency detection. We argue that the tagging model can capture more sentence structural information which is helpful for disfluency detection.

We start from a fluent sequence and introduce random perturbations to generate a disfluent sentence . More specifically, we propose two types of perturbations:

  • Repetition : the (randomly selected from to ) words starting from the position are repeated.

  • Inserting : we randomly pick a -gram ( is randomly selected from to ) from the news corpus and insert it to the position .

For the input fluent sentence, we randomly choose to positions, and then randomly take one of the two perturbations for each selected position to generate the disfluent sentence . It is important to note that it is possible that in some cases will itself form a fluent sentence and hence violate the definition of the disfluent sentence. We do not address this issue and assume that such cases will be relatively few and will not harm the training goal when the training data is large.

Sentence Classification Task

The input of sentence classification task is a sentence pair , where one is a fluent sentence and the other one is disfluent, generated by randomly adding or deleting some words from the corresponding fluent sentence. The sentence pair is fed into a transformer encoder network to obtain a sentence pair representation . The goal of the task is to discriminate between fluent sentence and corresponding disfluent one. We define a label set, , where and mean that the first input sentence is generated by randomly adding and deleting some words from the second sentence , respectively. We hypothesize that this task can capture sentence-level grammatical information, which is helpful for disfluency detection as its training goal is to keep the generated sentence fluent by deleting the disfluent words.

We construct two kinds of disfluent sentences for this task. We use the same method described in the tagging task to construct the disfluent sentence with added noisy words. For the disfluent sentence with deleted words, we consider a new type of perturbations:

  • Delete : for selected position , (randomly selected from to ) words starting from this position are deleted.

For the input fluent sentence, we randomly choose one to three positions, and then take the Delete perturbation to generate . Note that one sentence can only be used to generate one kind of disfluent sentence to prevent the model from learning some statistical rules (e.g. the sentence with intermediate length is a fluent sentence) beyond our goals.

2.2 Network Structure

Figure 3: Model structure. The parameters of input embedding layer , encoder layer , and tagging layer (yellow box) are shared among pre-training and fine-tuning

As shown in Figure 3, the model consists of four parts: an input embedding layer , an encoder layer , a tagging layer for the tagging task, and a classification layer for the classification task.

For , given a token, its input representation is constructed by summing the corresponding token and position embeddings. For , we use the multi-layer bidirectional transformer encoder described in Vaswani et al. (2017).

For the tagging task, takes a sequence of input and returns a representation sequence . Then the representation sequence is sent to to get a sequence of labels , where .

For the sentence classification task, takes two sequences of input and returns a representation sequence . Then we send the representation to to get the classification label , where .

2.3 Multi-Task Pre-training Procedure

Multi-task learning helps in sharing information between different tasks and across domains. Our primary aim is to use the sentence classification task to help the tagging task by integrating sentence-level grammatical information.

Under the multi-task learning framework, the parameters of and are shared. We denote and as the representations of the tagging layer and the classification layer, respectively. The total loss of the multi-task neural network is calculated as:

where means the loss of tagging task, and means the loss of sentence classification task.

In practice, we construct mini-batches of training examples, where 30% of the data are single sentences used for the tagging task, and another 70% are sentence pairs for the sentence classification task. Since parts of the encoder are shared among both tasks, we optimize both loss terms concurrently.

2.4 Disfluency Detection Fine-tuning

We directly fine-tune the pre-trained tagging model (including input embedding layer , encoder layer , and tagging layer

) on gold human-annotated disfluency detection data. Given a pre-trained tagging model, this stage converges faster as it only needs to adapt to the idiosyncrasies of the target disfluency detection data, and it allows us to train a robust disfluency detection model even for small datasets. For fine-tuning, most model hyperparameters are the same as in pre-training, with the exception of the batch size, and number of training epochs.

3 Experiment

3.1 Settings

Dataset. English Switchboard (SWBD) (Godfrey et al., 1992) is the standard and largest ( sentences for training ) corpus used for disfluency detection. We use English Switchboard as main data. Following the experiment settings in Charniak and Johnson (2001), we split the Switchboard corpus into train, dev and test set as follows: train data consists of all sw[23].dff files, dev data consists of all sw4[5-9].dff files and test data consists of all sw4[0-1].dff files. Following Honnibal and Johnson (2014), we lower-case the text and remove all punctuations and partial words.111words are recognized as partial words if they are tagged as ‘XX’ or end with ‘-’. We also discard the ‘um’ and ‘uh’ tokens and merge ‘you know’ and ‘i mean’ into single tokens.

Unlabeled sentences are randomly extracted from WMT2017 monolingual language model training data (News Crawl: articles from 2016), consisting of English news222http://www.statmt.org/wmt17/translation-task.html. Then we use the methods described in section 2.1 to construct the pre-training dataset. The training set of the tagging task contains 3 million sentences, in which half of them are corrupted disfluent sentences and others are fluent sentences directly extracted from the news corpus. We use 9 million sentence pairs for the sentence classification task.

Metric. Following previous works (Ferguson et al., 2015)

, token-based precision (P), recall (R), and (F1) are used as the evaluation metrics.

Baseline. We build two baseline systems including:(1) Transition-based (Wang et al., 2017) is a neural transition-based model and achieves the current state-of-the-art result by integrating complicated hand-crafted features. We directly use the code released by Wang et al. (2017).333https://github.com/hitwsl/transition_disfluency (2) Transformer-based is a multi-layer bidirectional transformer encoder with random initialization to directly train on human-annotated disfleuncy detection data.

Method Full 1000 sents
Dev Test Dev Test
P R F1 P R F1 P R F1 P R F1
Transition-based 92.2 84.7 88.3 92.1 84.1 87.9 82.2 57.4 67.6 81.2 56.7 66.8
Transformer-based 86.5 70.4 77.6 86.1 71.5 78.1 78.2 51.3 62 79.1 51.1 62.1
Our self-supervised 92.9 88.1 90.4 93.4 87.3 90.2 90 82.8 86.3 88.6 83.7 86.1
Table 2: Experiment results on English Switchboard data, where “Full” means the results using 100% human-annotated data, and “1000 sents” means the results using less than 1% (1000 sentences) human-annotated data.
Method P R F1
UBT (Wu et al., 2015) 90.3 80.5 85.1
semi-CRF (Ferguson et al., 2015) 90.0 81.2 85.4
Bi-LSTM (Zayats et al., 2016) 91.8 80.6 85.9
LSTM-NCM (Lou and Johnson, 2017) - - 86.8
Transition-based (Wang et al., 2017) 91.1 84.1 87.5
Our self-supervised (1000 sents) 88.6 83.7 86.1
Our self-supervised (Full) 93.4 87.3 90.2

Table 3: Comparison with previous state-of-the-art methods on the test set of English Switchboard. “Full” means using 100% human-annotated data for fine-tuning, and “1000 sents” means using less than 1% (1000 sentences) human-annotated data for fine-tuning.

3.2 Training Details

In all experiments including the transformer-based baseline and our self-supervised method, we use a transformer architecture with 512 hidden units, 8 heads, 6 hidden layers, GELU activations (Hendrycks and Gimpel, 2016), and a dropout of 0.1. We train our models with the Adam optimizer (Kingma and Ba, 2015).

For the joint tagging and sentence classification objectives, we use streams of 128 tokens and a mini-batches of size 256. We use learning rate of 1e-4 and epoch of 30. When fine-tuning on gold disfluency detection data, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learning rate, and number of training epochs. We use batch size of 32, learning rate of 1e-5, and epoch of 20.

3.3 Performance On English Switchboard

Table 2 shows the overall performances of our model on both development and test sets. We can see that our self-supervised method outperforms the baseline methods in all the settings. Surprisingly, our self-supervised method achieves almost 20 point improvements over transition-based method when using less than 1% (1000 sentences) human-annotated disfluency detection data.

We compare our self-supervised model to five top performing systems, which rely on large-scale human-annotated data and complicated hand-crafted features. Our model outperforms the state-of-the-art, achieving a 90.2% F-score as shown in Table

3. It achieves 2.7 point improvements over Transition-based method (Wang et al., 2017), which is the previous state-of-the-art method for disfluency detection. We attribute the success to the strong ability to learn global sentence-level structural information. Surprisingly, with less than 1% (1000 sentences) human-annotated training data, our model achieves comparable F1-score as the previous top performing systems using 100% human-annotated training data, which shows that our self-supervised method can substantially reduce the need for human-annotated training data. Note that we do not compare our work with the work of Wang et al. (2018) using semi-supervised method for disfluency detection. Wang et al. (2018) treat interregnum and reparandum types equally when training and evaluating their model, while others only focus on reparandums which are more difficult to detect.

4 Analysis

4.1 Effectiveness of Self-Supervised Learning

We explore the contribution of our self-supervised tasks to the final experimental results. To verify this, we train the baseline transition-based method (Wang et al., 2017) and transformer-based method by combining the gold training data with the pseudo training data (only the pesudo disfluent sentence as described in section 2.1). As shown in Figure  4 (a), F1-scores of the two baseline methods keep decreasing when the amount of the pseudo training data increases, while F1-score of our self-supervised method keeps increasing. The results show that our self-supervised task is much more effective compared with the methods of directly combining the gold training data with the pseudo training data. We attribute the F1-score decrease of baseline methods to the discrepancies between the distribution of gold disfluency detection sentence and the pseudo disfluent sentence .

4.2 Varying Amounts of Pseudo Data

(a) (b) (c) (d) (e)
Figure 4: (a) Plot showing the effectiveness of self-supervised learning compared with the baseline methods of directly combining gold training data with the pseudo training data.(b) Plot showing the impact of pseudo training data size to disfluency detection. (c) Plot showing the impact of human-annotated data size when fine-tuning. (d) Plot showing robustness of self-supervised tasks. (e) Plot showing the effectiveness of multi-task learning.

We observed the impact of pseudo training data size to disfluency detection task. Figure  4 (b) reports the results of adding varying amounts of pseudo training data to the self-supervised pre-training model. We observe that F1-score keeps growing when the amount of automatically-generated data increases. We conjecture that our two self-supervised tasks and disfluency detection task can coexist harmoniously, and more automatically-generated training data will bring more structural information. Another surprising observation is that the performance on the small supervised dataset (1000 sentences) grows faster, which shows that our method has huge potential to tackle the training data bottleneck.

4.3 Varying Amounts of Supervised Data

We explore how fine-tuning scales with human-annotated data size, by varying the amount of human-annotated training data the model has access to. We plot F1-score with respect to the amounts of human-annotated disfluency detection data for fine-tuning in Figure  4 (c). Compared with the baseline systems, fine-tuning based on our self-supervised models improves performance considerably when limited gold human-annotated training data is available, but those gains diminish with more high-quality human-annotated data. Using only 2% of the labeled data, our approach already performs as well or better than the previous state-of-the-art transition-based method using 100% of the human-annotated training data, demonstrating that our self-supervised tasks are particularly useful on small datasets.

4.4 Robustness of Self-Supervised Tasks

Our self-supervised tasks are very similar to supervised tasks, excepted that the training data is collected without manual labeling. To prove the robustness and learnability of our self-supervised tasks, we explore the performance of our self-supervised models on the pseudo testing data when pre-training. We plot the performance with respect to the amounts of pseudo training data for pre-training in Figure 4 (d). The performance keeps growing when the amount of automatically-generated data increases, achieving about 80% F1-score on tagging task and 97% accuracy on sentence classification task, respectively. The results show that our self-supervised tasks are robust and reasonable, which can really capture more sentence structural information.

4.5 Ablation Test

As described in section 2.3, the pre-training framework consists of the tagging task and the sentence classification task. We explicitly compare the impact of these two self-supervised tasks. As shown in Table 4, both of our two self-supervised tasks achieve higher performance compared with the baseline system with random initialization. Higher performance is achieved by combining the two self-supervised tasks, which demonstrates that the two tasks can coexist harmoniously, and share useful information between each other.

Method Full 1000 sents
P R F1 P R F1
Random-Initial 86.1 71.5 78.1 79.1 51.1 62.1
Tagging 91.8 84 87.7 85.1 79.6 82.3
Classification 91.2 83.1 86.9 83.2 78.3 80.7
Multi-Task 93.4 87.3 90.2 88.6 83.7 86.1
Table 4:

Results of feature ablation experiments on English Switchboard test data. “random-initial” means training transformer network on gold disfluency detection data with random initialization.

Method F1 (Full) F1 (1000 sents)
random-initial 78.1 62.1
BERT-fine-tune 90.1 82.4
Our self-supervised 90.2 86.1
combine 91.4 87.8
Table 5:

Comparison with BERT. “random-initial” means training transformer network on gold disfluency detection data with random initialization. “combine” means concatenating hidden representations of BERT and our self-supervised models for fine-tuning.

4.6 Comparison with BERT

BERT (Devlin et al., 2019) is a strong pre-trained network trained on about 3.3 billion word corpus, which advances the state-of-the-art for many NLP tasks. We would like to see the performance comparison between our pre-trained model and BERT. The large version of pre-trained BERT model (24-layer transformer blocks, 1024 hidden-size, and 16 self-attention heads, totally 340M parameters) is used for the comparison. Compared with BERT, we use much smaller training corpus and model parameters (6-layer transformer blocks, 512 hidden-size, and 8 self-attention heads) limited by devices. Results are shown in Table 5. Both our method and BERT outperforms the baseline model with random initialization, which proves the strong ability of pre-training model. Although our pre-training corpus and model parameters are much smaller than BERT, we achieve a similar result with BERT when fine-tuning on full gold training data. Surprisingly, our method achieves 3.7 point improvements over BERT when fine-tuning on 1%(1000 sentences) of the gold training data. We also try to combine our pre-train model and BERT by concatenating their hidden representation. The result shows that much higher performance is achieved. We also plot the performance with respect to the length of human-annotated disfluency detection data in Figure 4 (e). The performance is always much higher by combining our pre-train model and BERT. This proves that our model and BERT can coexist harmoniously, and capture different aspects of information helpful for disfluency detection.

Method Repet Non-repet Either
Transition-based 93.8 68.3 87.9
Transformer-based 93.6 58.9 78.1
Our self-supervised 93.7 70.8 90.2
Table 6: F-score of different types of reparandums on English Switchboard test data.

4.7 Repetitions vs Non-repetitions

Repetition disfluencies are easier to detect and even some simple hand-crafted features can handle them well. Other types of reparandums such as repair are more complex (Zayats et al., 2016; Ostendorf and Hahn, 2013). In order to better understand model performances, we evaluate our model’s ability to detect repetition vs. non-repetition (other) reparandum. The results are shown in Table 6. All three models achieve high scores on repetition reparandum. Our self-supervised model is much better in predicting non-repetitions compared to the two baseline methods. We conjecture that our self-supervised tasks can capture more sentence-level structural information.

5 Related Work

Disfluency Detection

Most work on disfluency detection focus on supervised learning methods, which mainly fall into three main categories: sequence tagging, noisy-channel, and parsing-based approaches. Sequence tagging approaches label words as fluent or disfluent using a variety of different techniques, including conditional random fields (CRF)  (Georgila, 2009; Ostendorf and Hahn, 2013; Zayats et al., 2014) , Max-Margin Markov Networks (MN) (Qian and Liu, 2013), Semi- Markov CRF (Ferguson et al., 2015)

, and recurrent neural networks 

(Hough and Schlangen, 2015; Zayats et al., 2016; Wang et al., 2016). The main benefit of sequential models is the ability to capture long-term relationships between reparandum and repairs. Noisy channel models  (Charniak and Johnson, 2001; Johnson and Charniak, 2004; Zwarts et al., 2010; Lou and Johnson, 2017) use the similarity between reparandum and repair as an indicator of disfluency. Parsing-based approaches (Rasooli and Tetreault, 2013; Honnibal and Johnson, 2014; Wu et al., 2015; Yoshikawa et al., 2016) jointly perform dependency parsing and disfluency detection. The joint models can capture long-range dependency of disfluencies as well as chunk-level information. However, training a parsing-based model requires large annotated tree-banks that contain both disfluencies and syntactic structures.

All of the above works heavily rely on human-annotated data. There exist a limited effort to tackle the training data bottleneck. Wang et al. (2018)

use an autoencoder method to help for disfluency detection by jointly training the autoencoder model and disfluency detection model. They construct large-scale pseudo disfluent sentences by using some simple rules and use autoencoder to reconstruct the disfluent sentence. We take inspiration from their method when generating disfluent sentences. They achieve higher performance by introducing pseudo training sentence. However, the performance of their method still heavily relies on annotated data.

Self-Supervised Representation Learning

Self-supervised learning aims to train a network on an auxiliary task where ground-truth is obtained automatically. Over the last few years, many self-supervised tasks have been introduced in image processing domain, which make use of non-visual signals, intrinsically correlated to the image, as a form to supervise visual feature learning (Agrawal et al., 2015; Wang and Gupta, 2015; Doersch et al., 2015).

In natural language processing domain, self-supervised research mainly focus on word embedding (Mikolov et al., 2013a, b) and language model learning (Bengio et al., 2003; Peters et al., 2018; Radford et al., 2018)

. For word embedding learning, the idea is to train a model that maps each word to a feature vector, such that it is easy to predict the words in the context given the vector. This converts an apparently unsupervised problem into a “self-supervised” one: learning a function from a given word to the words surrounding it.

Language model pre-training (Bengio et al., 2003; Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019) is another line of self-supervised learning task. A trained language model learns a function to predict the likelihood of occurrence of a word based on the surrounding sequence of words used in the text. There are mainly two existing strategies for applying pre-trained language representations to down-stream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018) and BERT (Devlin et al., 2019), introduces minimal task-specific parameters and is trained on the downstream tasks by simply fine-tuning the pre-trained parameters.

Motivated by the success of self-supervised learning, we propose a token-level tagging task and a sentence-level classification task especially powerful for disfluency detection task.

Multi-Task Learning

MTL (Multi-Task Learning) has been used for a variety of NLP tasks including named entity recognition and semantic labeling

(Martínez Alonso and Plank, 2017), super-tagging and chunking (Bingel and Søgaard, 2017) and semantic dependency parsing (Peng et al., 2017)

. The benefits of MTL largely depend on the properties of the tasks at hand, such as the skewness of the data distribution

(Martínez Alonso and Plank, 2017), the learning pattern of the auxiliary and main tasks where “target tasks that quickly plateau” benefit most from “non-plateauing auxiliary tasks” (Bingel and Søgaard, 2017) and the “structural similarity” between the tasks (Peng et al., 2017). In our work, we use the sentence classification task to help the tagging task by integrating sentence-level grammatical information.

6 Conclusion

In this work, we propose two self-supervised tasks to tackle the training data bottleneck. Experimental results on the commonly used English Switchboard test set show that our approach can achieve competitive performance compared to the previous systems (trained using the full dataset) by using less than 1% (1000 sentences) of the training data. Our method trained on the full dataset significantly outperforms previous methods, reducing the error by 21% on English Switchboard.

References

  • Agrawal et al. (2015) Pulkit Agrawal, Joao Carreira, and Jitendra Malik. 2015. Learning to see by moving. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pages 37–45.
  • Bengio et al. (2003) Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model.

    Journal of machine learning research

    , 3(Feb):1137–1155.
  • Bingel and Søgaard (2017) Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 164–169, Valencia, Spain. Association for Computational Linguistics.
  • Charniak and Johnson (2001) Eugene Charniak and Mark Johnson. 2001. Edit detection and parsing for transcribed speech. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pages 1–9. Association for Computational Linguistics.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics.
  • Doersch et al. (2015) Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430.
  • Ferguson et al. (2015) James Ferguson, Greg Durrett, and Dan Klein. 2015.

    Disfluency detection with a semi-markov model and prosodic features.

    In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 257–262. Association for Computational Linguistics.
  • Fernando et al. (2017) Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. 2017.

    Self-supervised video representation learning with odd-one-out networks.

    In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 3636–3645.
  • Georgila (2009) Kallirroi Georgila. 2009.

    Using integer linear programming for detecting speech disfluencies.

    In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 109–112. Association for Computational Linguistics.
  • Godfrey et al. (1992) John J Godfrey, Edward C Holliman, and Jane McDaniel. 1992. Switchboard: Telephone speech corpus for research and development. In icassp, pages 517–520. IEEE.
  • Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415.
  • Honnibal and Johnson (2014) Matthew Honnibal and Mark Johnson. 2014. Joint incremental disfluency detection and dependency parsing. Transactions of the Association for Computational Linguistics, 2:131–142.
  • Hough and Schlangen (2015) Julian Hough and David Schlangen. 2015. Recurrent neural networks for incremental disfluency detection. In Sixteenth Annual Conference of the International Speech Communication Association.
  • Jamshid Lou et al. (2018) Paria Jamshid Lou, Peter Anderson, and Mark Johnson. 2018. Disfluency detection using auto-correlational neural networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4610–4619, Brussels, Belgium. Association for Computational Linguistics.
  • Johnson and Charniak (2004) Mark Johnson and Eugene Charniak. 2004. A tag-based noisy channel model of speech repairs. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 33. Association for Computational Linguistics.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  • Lou and Johnson (2017) Paria Jamshid Lou and Mark Johnson. 2017. Disfluency detection using a noisy channel model and a deep neural language model. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
  • Martínez Alonso and Plank (2017) Héctor Martínez Alonso and Barbara Plank. 2017. When is multitask learning effective? semantic sequence prediction under varying data conditions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 44–53, Valencia, Spain. Association for Computational Linguistics.
  • Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a.

    Efficient estimation of word representations in vector space.

    CoRR.
  • Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Ostendorf and Hahn (2013) Mari Ostendorf and Sangyun Hahn. 2013. A sequential repetition model for improved disfluency detection. In INTERSPEECH, pages 2624–2628.
  • Peng et al. (2017) Hao Peng, Sam Thomson, and Noah A. Smith. 2017. Deep multitask learning for semantic dependency parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2037–2048, Vancouver, Canada. Association for Computational Linguistics.
  • Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  • Qian and Liu (2013) Xian Qian and Yang Liu. 2013. Disfluency detection using multi-step stacked learning. In HLT-NAACL, pages 820–825.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018.

    Improving language understanding with unsupervised learning.

    Technical report, Technical report, OpenAI.
  • Rasooli and Tetreault (2013) Mohammad Sadegh Rasooli and Joel R Tetreault. 2013. Joint parsing and disfluency detection in linear time. In EMNLP, pages 124–129.
  • Shriberg (1994) Elizabeth Ellen Shriberg. 1994. Preliminaries to a theory of speech disfluencies. Ph.D. thesis, Citeseer.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Wang et al. (2018) Feng Wang, Wei Chen, Zhen Yang, Qianqian Dong, Shuang Xu, and Bo Xu. 2018. Semi-supervised disfluency detection. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3529–3538. Association for Computational Linguistics.
  • Wang et al. (2016) Shaolei Wang, Wanxiang Che, and Ting Liu. 2016. A neural attention model for disfluency detection. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 278–287, Osaka, Japan. The COLING 2016 Organizing Committee.
  • Wang et al. (2017) Shaolei Wang, Wanxiang Che, Yue Zhang, Meishan Zhang, and Ting Liu. 2017. Transition-based disfluency detection using lstms. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2785–2794.
  • Wang and Gupta (2015) Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802.
  • Wu et al. (2015) Shuangzhi Wu, Dongdong Zhang, Ming Zhou, and Tiejun Zhao. 2015. Efficient disfluency detection with transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 495–503. Association for Computational Linguistics.
  • Yoshikawa et al. (2016) Masashi Yoshikawa, Hiroyuki Shindo, and Yuji Matsumoto. 2016. Joint transition-based dependency parsing and disfluency detection for automatic speech recognition texts. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1036–1041.
  • Zayats and Ostendorf (2018) Vicky Zayats and Mari Ostendorf. 2018. Robust cross-domain disfluency detection with pattern match networks. arXiv preprint arXiv:1811.07236.
  • Zayats and Ostendorf (2019) Vicky Zayats and Mari Ostendorf. 2019. Giving attention to the unexpected: Using prosody innovations in disfluency detection. arXiv preprint arXiv:1904.04388.
  • Zayats et al. (2016) Vicky Zayats, Mari Ostendorf, and Hannaneh Hajishirzi. 2016. Disfluency detection using a bidirectional lstm. arXiv preprint arXiv:1604.03209.
  • Zayats et al. (2014) Victoria Zayats, Mari Ostendorf, and Hannaneh Hajishirzi. 2014. Multi-domain disfluency and repair detection. In Fifteenth Annual Conference of the International Speech Communication Association.
  • Zwarts et al. (2010) Simon Zwarts, Mark Johnson, and Robert Dale. 2010. Detecting speech repairs incrementally using a noisy channel approach. In Proceedings of the 23rd international conference on computational linguistics, pages 1371–1378. Association for Computational Linguistics.