Toward Grammatical Error Detection from Sentence Labels: Zero-shot Sequence Labeling with CNNs and Contextualized Embeddings

by   Allen Schmaltz, et al.
Harvard University

Zero-shot grammatical error detection is the task of tagging token-level errors in a sentence when only given access to labels at the sentence-level for training. Recent work has explored attention- and gradient-based approaches for the task. We extend this line of research to CNNs by analyzing a straightforward decomposition of the sentence-level classifier. Without modification to the underlying architecture, a single-layer CNN can be used to achieve similar F1 scores to a bi-LSTM attention-based approach specifically modified for the task of zero-shot labeling on the standard dataset, as a result of relatively strong recall, but weaker precision. Interestingly, with the advantage of pre-trained contextualized embeddings, this approach yields competitive F1 scores (and with a limited amount of token-labeled data for tuning, F0.5 scores) with baseline (but no longer state-of-the-art) fully supervised bi-LSTM models (using standard pre-trained word embeddings), despite only having access to sentence-level labels for training.



There are no comments yet.


page 1

page 2

page 3

page 4


Zero-shot Sequence Labeling: Transferring Knowledge from Sentences to Tokens

Can attention- or gradient-based visualization techniques be used to inf...

Turning transformer attention weights into zero-shot sequence labelers

We demonstrate how transformer-based models can be redesigned in order t...

Narrative Incoherence Detection

Motivated by the increasing popularity of intelligent editing assistant,...

Latte-Mix: Measuring Sentence Semantic Similarity with Latent Categorical Mixtures

Measuring sentence semantic similarity using pre-trained language models...

Debiasing Pre-trained Contextualised Embeddings

In comparison to the numerous debiasing methods proposed for the static ...

Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data

Neural machine translation systems have become state-of-the-art approach...

Sentence Boundary Detection for French with Subword-Level Information Vectors and Convolutional Neural Networks

In this work we tackle the problem of sentence boundary detection applie...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent work (Rei and Søgaard, 2018b) introduced the task of zero-shot sequence labeling to address the following question: “Can attention- or gradient-based visualization techniques be used to infer token-level labels for binary sequence tagging problems, using networks trained only on sentence-level labels?” For this task, models are trained as binary, sentence-level classifiers, and then post-hoc analysis methods are used to derive token-level labels at inference time. For the particular case of zero-shot grammatical error detection, models are only given access to sentence-level labels indicating whether or not each sentence contains an error, and then token-level error labels must be derived without access to explicit token-level supervision. This setup is of practical interest in the case of grammatical error detection, as it may be useful to collect an initial set of annotations at the sentence-level111There is an assumption here that these sentence-level annotations could be collected via semi-supervised approaches and/or via faster human annotation than at the token-level, which we leave to future work. and then apply a zero-shot approach as part of a human annotation pipeline to bootstrap further refinement of token-level labels.222

Most modern neural networks have complex sets of unidentifiable parameters

(see, e.g., Hwang and Ding, 1997; Jain and Wallace, 2019), so interpreting such models with post-hoc methods for purposes of explainability is fraught with not readily resolvable challenges. The approach examined here is not suitable for model explainability, but as we show, may be useful as a (weak) sequence tagger.

Previous approaches for zero-shot grammatical error detection have been based on bidirectional LSTM models, which have been shown to work relatively well for the fully supervised case (e.g., Rei, 2017)

. Using an attention mechanism, combined with additional loss functions to encourage higher quality token-level labeling, has been found to be more effective than gradient-based approaches

(Rei and Søgaard, 2018b).

We revisit this task with convolutional neural networks (CNNs) in order to begin to analyze whether the ngram-filter interactions of the standard CNN baseline classifier exhibit similar token-level labeling behavior as previously proposed approaches.

In this brief note, we demonstrate that a single-layer one-dimensional convolutional neural network can be trained for sentence-level classification, and then decomposed in a straightforward way to produce token labels with scores comparable to previously proposed approaches, but with weaker precision. The approach does not require adding additional parameters or objectives to the standard classifier, and serves as a useful baseline. However, as with previously proposed approaches, it results in a very weak sequence labeler compared to fully supervised approaches for grammatical error detection.

Our baseline approach, as with previous zero-shot and fully supervised approaches for error detection, makes use of word embeddings pre-trained on billions of tokens. We also experiment with the recently proposed contextualized embeddings of Devlin et al. (2018). Interestingly, this is sufficient to yield a sequence tagger with token labeling effectiveness competitive with a baseline bidirectional LSTM fully-supervised model (using standard pre-trained embeddings), despite only being given access to labels at the sentence-level for training. Additionally, a small amount of token-labeled data can be used to tune a decision boundary, yielding scores that are—for perspective—competitive with the state-of-the-art approach in the literature from as recently as 2016.

2 Methods

Single-layer one-dimensional convolutional neural networks serve as strong baselines across a number of NLP classification tasks, despite their relative simplicity.

CNN classifier

In this work, we use a CNN architecture similar to that of Kim (2014), as used previously as a baseline for sentence-level classification of grammatical errors in Schmaltz et al. (2016).

Each token, in the sentence is represented by a

-dimensional vector, where

is the length of the sentence, including padding symbols, as necessary. The convolutional layer is then applied to this

matrix, using a filter of width , sliding across the -sized ngrams of the input. The convolution results in a feature map for each of total filters, with which we compute

a ReLU non-linearity followed by a max-pool over the ngram dimension resulting in

. A final linear fully-connected layer, , with a bias, , followed by a softmax, produces the output distribution over class labels, :

In practice, multiple filter widths are used, concatenating the output of the max-pooling prior to the fully-connected layer.

Sequence labeling from a CNN

Roughly speaking, the matrix multiplication of the output of the max-pooling layer with the fully connected layer can be viewed as a weighted sum of the most relevant filter-ngram interactions for each prediction class.333However, note that the work of Jacovi et al. (2018) proposes a more nuanced view. More specifically, each term in contributing to the negative class prediction (for the purposes here, the class at index 1), , can be deterministically traced back to its corresponding filter and the tokens in the window on which the filter was applied. For each token in the input, we assign a negative class contribution score by summing all applicable terms in in which the token interacted with a filter that survived the max-pooling layer. Some tokens may have a resulting score of zero. Similarly, we assign a positive class contribution score by summing all applicable terms in . We now have a decision boundary for each token, assigning the positive class when the positive class contribution score is greater than the negative class contribution score. To account for the bias in the fully connected layer, we simply add the applicable value to the non-zero token contribution scores prior to token classification.

3 Experiments


We follow past work on error detection and use the standard training, dev, and test splits of the publicly released subset of the First Certificate in English (FCE) dataset (Yannakoudakis et al., 2011; Rei and Yannakoudakis, 2016)444, consisting of 28.7k, 2.2k, and 2.7k labeled sentences, respectively.

CNN Model

Our base CNN model uses filter widths of 3, 4, and 5, with 100 filter maps each. For consistency with past work on zero-shot grammatical error detection, we fine tune 300 dimensional word-embeddings with the publicly available Glove embeddings (Pennington et al., 2014), with a vocabulary of size 7,500.


We also consider a model, CNN+BERT, which makes use of the large-scale pre-trained model of Devlin et al. (2018). This Bidirectional Encoder Representations from Transformers (BERT) model is a multi-layer bidirectional Transformer (Vaswani et al., 2017). The model is trained with masked-language modeling and next-sentence prediction objectives with large amounts of unlabeled data from 3.3 billion words. We use the publicly available pre-trained BERTLARGE model with the case-preserving WordPiece (Wu et al., 2016) model.555

We use the PyTorch ( reimplementation of the original code base available at

We take a feature-based approach, concatenating the top four hidden layers of the pre-trained BERT model with the input word-embeddings of the CNN, resulting in 4,396-dimensional input embeddings. In the CNN+BERT experiments, we use the pre-trained Word2Vec word embeddings of Mikolov et al. (2013) for consistency with the past supervised detection work of Rei and Yannakoudakis (2016). Prior to evaluation, to maintain alignment with the original tokenization and labels, the WordPiece tokenization is reversed (i.e., de-tokenized), with positive/negative token contribution scores averaged over fragments for original tokens split into separate WordPieces.

For both CNN and CNN+BERT

, we optimize for sentence-level classification, choosing the training epoch with the highest sentence-level

score on the dev set, without regard to token-level labels. We set aside 1k token-labeled sentences from the dev set to tune the token-level score for comparison purposes for the experiment labeled CNN+BERT+1k.

Previous Work

Recent work has approached zero-shot error detection by modifying and analyzing bidirectional LSTM taggers, which have shown to work comparatively well on the task in the fully supervised setting. The work of Rei and Søgaard (2018b) adds a soft-attention mechanism to a bidirectional LSTM tagger, training with additional loss functions to encourage the attention weights to yield more accurate token-level labels, consistent with the token-level human annotations, which are not provided in training and must be learned indirectly from the sentence-level labels. The attention values at each token position are used to generate token-level classifications. We use the label LSTM-ATTN-SW, as in the original work, to refer to this model. Previous work also considered a gradient-based approach to analyze this same model (here, LSTM-ATTN-BP) and the model without the attention mechanism (LSTM-LAST-BP

), by fitting a parametric model (a Gaussian) to the distribution of magnitudes of the gradients of the word representations with respect to the sentence-level loss.

Supervised Baselines

For comparison purposes, we include results with recent fully supervised sequence models appearing in the literature. Rei and Yannakoudakis (2016) compares various word-based neural sequence models, finding that a word-based bidirecitonal LSTM model was the most effective (LSTM-BASE-S*). Rei and Søgaard (2018b) compares against a bidirectional LSTM tagger, with character representations concatenated with word embeddings, trained with a cross-entropy loss against the token-level labels (LSTM-S*). The model of Rei (2017) extends this model with an auxiliary language modeling objective (LSTM-LM-S*). This model is further enhanced with a character-level language modeling objective and supervised attention mechanisms in Rei and Søgaard (2018a) (here, LSTM-JOINT-S*).

For reference, we also provide a Random baseline, which classifies based on a fair coin flip, and a MajorityClass baseline, which in this case always chooses the positive (i.e., with an error) class.

4 Results

Table 1 contains the main results with the models only given access to sentence-level labels, as well as the fully supervised baseline LSTM-S* from previous work. We see that the proposed CNN model approach has a similar score to the previously proposed LSTM-ATTN-SW, as a result of strong recall, but weaker precision. With the advantage of the pre-trained contextualized embeddings from BERT, the CNN+BERT model actually has a competitive score with the model given full token-level supervision for training and signal from additional unlabeled data via pre-trained word embeddings.

The CNN+BERT model is significantly more effective than the baseline CNN model. The CNN model also utilizes signal from billions of tokens of unlabeled data via pre-trained embeddings, but the CNN+BERT has the advantage of the pre-trained contextualized embeddings (and the Transformer architecture itself) and a larger vocabulary due to the WordPiece embeddings. The Transformer architecture potentially allows the CNN+BERT model to consider longer-distance dependencies than the local windows of the standard CNN.

It is important to highlight that the zero-shot sequence labeling scores of the non-BERT models are only a modest delta from the majority class baseline, which highlights the challenging nature of the task. Furthermore, even with relatively strong sentence-level classifiers, the scores of the back-propagation-based approaches of LSTM-ATTN-BP and LSTM-LAST-BP reported in previous work actually fall below random baselines.

In Table 1, we see that the score can be competitive with the fully supervised model LSTM-S*. This is the true zero-shot sequence-labeling setting, in which the model is not given access to token-level labels. For further reference, Table 2 compares with recent fully supervised approaches based on . For illustrative purposes, CNN+BERT+1k is given access to 1,000 token-labeled sentences to tune a single parameter, an offset on the decision boundary. In doing so, we see that the model becomes competitive with the fully supervised LSTM-BASE-S* model in terms of , which is the common metric for grammatical error detection, used under the assumption that end users prefer higher precision systems. The LSTM-BASE-S* model was the state-of-the-art model on the task as recently as 2016, although we also see in Table 2 that significant improvements in have been achieved with the more recent supervised models, which are clearly superior to the CNN+BERT+1k model in terms of precision.

The CNN+BERT model is a relatively strong sentence-level classifier. For reference, the sentence-level score of 86.35 is similar to the sentence-level score of 86.01 of LSTM-JOINT-S*, despite the latter being a significantly stronger token-level sequence labeler, at least on precision-oriented metrics. It is possible that there is a significant difference between what the sentence-level and token-level models are learning (and may diverge from the human labeled decisions), but the scoring difference may also be due to our simple decomposition approach not accurately representing what the sentence-level model is learning.

Sent Token-level
Model P R
LSTM-S* - 49.15 26.96 34.76
Random 58.30 15.30 50.07 23.44
MajorityClass 80.88 15.20 1.0 26.39
LSTM-LAST-BP 85.10 29.49 16.07 20.80
LSTM-ATTN-BP 85.14 27.62 17.81 21.65
LSTM-ATTN-SW 85.14 28.04 29.91 28.27
CNN 84.24 20.43 50.75 29.13
CNN+BERT 86.35 26.76 61.82 37.36
Table 1: FCE test set results. The LSTM model results are as reported in Rei and Søgaard (2018b). With the exception of LSTM-S*, all models only have access to sentence-level labels while training.
Model P R
LSTM-JOINT-S* 65.53 28.61 52.07
LSTM-LM-S* 58.88 28.92 48.48
LSTM-BASE-S* 46.1 28.5 41.1
CNN+BERT+1k 47.11 28.83 41.81
Table 2: Comparisons with recent state-of-the-art supervised detection models on the FCE test set. Models marked with -S* have access to approximately 28.7k token-level labeled sentences for training, with results as reported in previous work. The CNN+BERT+1k model has access to 28.7k sentence-level labeled sentences for training and 1,000 token-level labeled sentences for tuning.

5 Discussion and Limitations

The effectiveness of the CNN approach is interesting in that it requires no modification to the standard classifier, although the LSTM-ATTN-SW model is preferable when higher precision is needed. Furthermore, the CNN approach has the disadvantage of not providing a straightforward approach for training with partial token-level supervision (without, of course, adding additional layers and parameters), which has been shown to be possible and effective for the attention-based approach in Rei and Søgaard (2018a).

The zero-shot error detection effectiveness of the non-BERT models only trained on the FCE dataset (augmented with signal from unlabeled data via standard pre-trained word embeddings) is likely too low to be useful in practical applications, and token-labeled datasets are available in English. These approaches do not offer any statistical bounds nor guarantees and are not suitable for model explainability nor interpretability. Furthermore, we expect that the fully supervised approaches would also benefit significantly from the signal from BERT. However, the zero-shot sequence labeling effectiveness with the contextualized embeddings does open up the possibility that such an approach may be viable for bootstrapping systems for lower resource languages, which we plan to pursue in future work.

There are a large number of additional approaches for visualizing and analyzing networks, including Transformers themselves, which we have not considered. In this brief note, we focus on a straightforward decomposition of a CNN, demonstrating its similar effectiveness on recall-oriented metrics to previously proposed approaches for zero-shot grammatical error detection, despite its simplicity.

6 Conclusion

We have provided additional baselines for the task of zero-shot grammatical error detection. A CNN can be trained as a sentence classifier on the standard public FCE dataset, without access to token-level labels, and then token-level labels can be derived in a straightforward manner yielding a weak sequence labeler with scores comparable to previously proposed approaches, but with weaker precision. Additionally, pre-trained contextualized embeddings can be used as input features for a CNN, trained as a sentence classifier, to produce a sequence labeler that is competitive with basic (but no longer state-of-the-art) fully supervised bi-LSTM baselines (using standard pre-trained word embeddings) in terms of scores, and with a small amount of token-level labeled data for tuning, scores.