Data Annealing for Informal Language Understanding Tasks

04/24/2020 ∙ by Jing Gu, et al. ∙ University of California-Davis 0

There is a huge performance gap between formal and informal language understanding tasks. The recent pre-trained models that improved the performance of formal language understanding tasks did not achieve a comparable result on informal language. We pro-pose a data annealing transfer learning procedure to bridge the performance gap on informal natural language understanding tasks. It successfully utilizes a pre-trained model such as BERT in informal language. In our data annealing procedure, the training set contains mainly formal text data at first; then, the proportion of the informal text data is gradually increased during the training process. Our data annealing procedure is model-independent and can be applied to various tasks. We validate its effectiveness in exhaustive experiments. When BERT is implemented with our learning procedure, it outperforms all the state-of-the-art models on the three common informal language tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Because of the noisy nature of the informal language and the shortage of labelled data, the progress on informal language is not as promising as on formal language. Many tasks on formal data obtain a high performance due to deep neural models (Lee et al., 2018; Peters et al., 2018; Devlin et al., 2018). However, usually, these state-of-the-art models’ excellent performance can not directly transfer to informal data. For example, when a BERT model is fine-tuned on informal data, its performance is less encouraging than on formal data. This is caused by the domain discrepancy between the pre-training corpus used by BERT and the target data.

To solve the issues mentioned above, we propose a model-agnostic data annealing procedure. The core idea of data annealing is to let the model have more freedom to explore its update direction at the beginning of the training process. More specifically, we set informal data as target data, and we set formal data as source data. The training data first contains mainly source data, when data annealing procedure takes the advantages of a good parameter initialization from the clean nature of formal data. Then the proportion of source data keeps decreasing while the proportion of target data keeps increasing. Recent works have validated the effect on the changing of data type during training period. Curriculum learning suggests a proper order of training data improves the performance and speed up training process in single dataset setting Bengio et al. (2009). Researchers have also proven the effect of data selection in domain adaptation (Ruder and Plank, 2017; Ruder et al., 2017; van der Wees et al., 2017).

The philosophy behind data annealing is shared with other commonly used annealing techniques. One popular usages of annealing is setting learning rate of the neural model. A gradually decayed learning rate gives the model more freedom of exploration at the beginning and leads to better model performance (Zeiler, 2012; Yang and Zhang, 2018; Devlin et al., 2018). Another popular implementation of annealing is simulated annealing (Bertsimas and Tsitsiklis, 1993)

. It reduces the probability of a model converging to a bad local optimal by introducing random noise in the training process. Data annealing has similar functionality with simulated annealing. Data annealing replaces random noise with source data. By doing this, the model is not only able to explore more space at the beginning of the training process, but also the model is guided by the knowledge learned from the source domain.

Current state-of-the-art models on informal language tasks are usually designed specifically for certain task and cannot generalize to different tasks Kshirsagar et al. (2018); Gui et al. (2018)

. Data annealing is model-independent and therefore could be employed on different informal language tasks. We validate our learning procedure with two popular neural network models in NLP, LSTM and BERT, on three popular natural language understanding tasks, i.e., named entity recognition (NER), part-of-speech (POS) tagging and chunking on twitter. When BERT is fine-tuned with data annealing procedure, it outperforms all three state-of-the-art models with the same structure. By doing this, we also set the new state-of-the-art result for the three informal language understanding tasks. Experiments also validate the effectiveness of our data annealing procedure when there are limited training resources in target data.

2 Data Annealing

A pre-trained model like BERT is suggested to avoid over-training when implemented on downstream task (Peters et al., 2019; Sun et al., 2019). It is not ideal to feed too much source data, as it not only prolongs the training time but also confuses the model. Therefore, we propose data annealing, a transfer learning procedure that adjusts the ratio of the formal source data and the informal target data from large to small in training process to solve the overfitting and the noisy initialization problems.

At the beginning of the training process, most of the training samples are source data. Therefore the model obtains a good initialization from the abundant clean source data. We then gradually increase the proportion of the target data and reduce the proportion of the source data. Therefore, the model explores a larger parameter space. Besides, the labelled source dataset works as an auxiliary task. At the end of the training process, we let most of the training data to be target data, so that the model can focus on the target information more.

We reduce the source data proportion with an exponential decay function. represents the initial proportion of the source data in the training set. represents the current training step and represents the number of batches in total. represents the exponential decay rate of . and represent the proportion of the source data and proportion of target data at time step .


Let represents the accumulated source data used to train the model, and let represents the batch size. We have


After the model is updated for adequate batches, we can approximate using


We could empirically estimate proper

based on the relation between source dataset and target dataset. For example, the higher the similarity between the source and the target data, the larger the

is. Because, the more similar the source data is to the target data, there is more knowledge the target task can borrow from the source task. If researchers want to simplify the hyperparameters tuning process or constrain the influence of source data,

can be set by :


3 Experimental Design

We validate it by two popular model LSTM and BERT on three tasks: named entity recognition (NER), part-of-speech tagging (POS) and chunking. There tasks have much worse performance on formal text (such as news) than informal text (such as tweets).

3.1 Datasets

We use OntoNotes-nw as the source dataset, and RitterNER dataset as the target dataset to validate NER task. While we use Penn Treebank (PTB) POS tagging dataset as the source data set, and RitterPOS as the target dataset in the POS tagging task. For the chunking task, we use CoNLL 2000 as the source dataset, and RitterCHUNK as the target dataset. Please refer to Appendix B for more details about datasets.

3.2 Model Setting

We implemented BERT and LSTM to validate the effect of data annealing on all three tasks.

BERT. We implemented both BERTBASE model and BERTLARGE

model. We use CRF as a classifier on the top of BERT structure. In some tasks, the source dataset and target dataset do not have the same set of labels. Therefore, we use two separate CRF classifiers for source task and target task.

LSTM. We used character and word embedding as input features as previous works (Yang and Zhang, 2018; Yang et al., 2017). We use one layer bidirectional LSTM to process the input features. as the same reason as in the implementation of BERT, we use two separate CRF classifiers on the top of LSTM structure.

We compare data annealing with two popular transfer learning paradigms, parameter initialization (INIT) and multi-task learning (MULT) (Weiss et al., 2016; Mou et al., 2016). Now we introduce the training strategy in experiments.

Data annealing. In all data annealing experiments, initial source data ratio and decay rate are tuned in range (0.9, 0.99). When training BERT model, we also calculated the estimated total batches from source data that fed into the model by equation 5. By avoiding a large , the model has lower probability to suffer from catastrophic forgetting as mentioned in section 2.

MULT. We implemented MULT on both LSTM-CRF and BERT-CRF structure. In all MULT experiments, following Yang et al. (2017) and Collobert and Weston (2008), we tune the ratio of source data in range (0.1, 0.9).

INIT. We implemented INIT on BERT-CRF structure. In all INIT experiments, we run three times on source data and pick the model that achieves the highest performance. Then we continue to fine-tune the model on target task.

Figure 1: Performance on named entity recognition task. DA BERTLARGE indicates Vanilla BERTLARGE finetuned with data annealing.

4 Experiment Results

The result of three tasks are shown in Table 1. Vanilla means the model is trained without transfer learning, in other words, the model does not utilize the source data. DA means the model is implemented with data annealing procedure. All the numbers in the tables are the average result of three runs. It is worth noting that state-of-the-art results on these three tasks are achieved by different models. While our proposed data annealing algorithm is applied to the same simple BERT or LSTM structure without fancy decoration.

model NER POS Chunking
Vanilla LSTM 75.55 55.75 64.05 88.65 83.76 83.78 83.77
MULT LSTM 74.51 58.48 65.49 88.81 83.92 84.48 84.20
DA LSTM 75.51 61.01 67.45 89.16 83.81 85.37 84.58
Vanilla BERTBASE 68.73 62.74 65.58 91.05 85.05 85.96 85.50
INIT BERTBASE 69.28 63.74 66.40 90.85 85.48 86.77 86.13
MULT BERTBASE 70.42 62.38 66.12 91.39 86.01 87.75 86.87
DA BERTBASE 71.09 63.74 67.21 91.55 86.16 87.91 87.03
Vanilla BERTLARGE 68.41 67.45 67.88 91.88 85.55 86.78 86.16
INIT BERTLARGE 68.85 69.20 68.99 92.04 86.42 87.59 87.00
MULT BERTLARGE 70.05 66.08 68.00 92.06 86.29 87.21 86.54
DA BERTLARGE 70.61 68.81 69.69 92.54 86.71 88.15 87.53
   *Over state-of-the-art   -5.51   +9.71   +3.16   +1.37   +2.24   +3.61   +3.03
**State-of-the-art 76.12 59.10 66.53 91.17 84.47 84.54 84.50
Table 1: Results on NER, POS tagging and chunking task. * means the difference between DA BERTLARGE and state-of-the-art results. ** means the state-of-the-art for these three tasks are achieved by different models. Yang et al. (2019), Gui et al. (2018) and Yang et al. (2017) proposed the state-of-the-art model on NER, POS tagging and chunking respectively.

Named Entity Recognition (NER). Our annealing procedure outperforms other transfer learning procedures in terms of

, meaning our data annealing is especially effective in striking a balance between the precision and recall in extracting named entities in informal text. Usually, a sentence contains more words that are not entities. So if the model is not sure whether a word is an entity, the model is likely to predict it as not an entity in order to reduce the training loss. The state-of-the-art models easily achieved the highest precision, but their recalls are lower. It indicates that the state-of-the-art methods achieve high performance by predicting fewer entities, while BERT models receive high performance by both covering more entities and predicting them correctly.

Part-of-speech Tagging (POS tagging). LSTM achieves a higher accuracy under our proposed data annealing procedure compared to the other two transfer learning procedures. Both BERTBASE and BERTLARGE under our data annealing procedure outperform other transfer learning procedures. The improvement over the state-of-the-art method is 1.37 in accuracy measure in POS tagging.

Chunking. When LSTM, BERTBASE and BERTLARGE are used as the training model under our data annealing procedure, they achieve better performances compared to other transfer learning paradigms. Our best model outperforms the state-of-the-art model by 3.03 in .

The Dataset Size Influence. To further evaluate our methods performance when there is limited labelled data, we randomly sample 10%, 20% and 50% of the training set in RitterNER. Then we compare our proposed DA BERTLARGE with INIT BERTLARGE and Vanilla BERTLARGE baselines. The result in Figure 1 shows that our model is still better than INIT BERTLARGE on limited resources condition and achieves a significant improvement over Vanilla BERTLARGE baseline.

5 Error Analysis

We performed error analysis on the named entity task about RitterNER dataset. We first calculated the score of the ten predefined entity types. We find that compared with Vanilla BERTLARGE and INIT BERTLARGE, DA BERTLARGE achieves higher score on two frequent entities, ”PERSON” and ”OTHER”. ”PERSON” is a frequent concept in formal data. It shows our method learns to utilize formal data knowledge to obtain an improved ”PERSON” detection. Besides, ”OTHER” means entities that are not in the ten predefined entity types. Higher performance on ”OTHER” suggests DA BERTLARGE has a better understanding of the general concept of an entity. INIT BERTLARGE achieves a higher score on another frequent entity type, ”GEO-LOC”, showing the effectiveness of traditional transfer learning methods. We did not find clear difference in other entities types.

Besides, we found that if a word belongs to a rarely appeared entity type, all the three models are less likely to predict this word’s entity type correctly. We suspect that if a word belongs to a frequent seen entity type, then even if the exact work never appeared in the training set, another similar word that has a similar representation may be in the training data. So the model is able to predict a word by learning from other similar words that belong to the same type. We plan to assign more penalty to infrequent entity types to tackle this issue in the future.

We noticed that the improvement in recently reported literature on these tasks is usually less than 0.5 in absolute value in . Considering the noisy nature of informal text data, we suspect the model is close to its theoretical maximum performance. To prove this, we randomly sampled 30 sentences We found that a fairly large proportion of sentences are too noisy to predict correctly. Please refer to Appendix A for some examples. This suggests that transfer learning has limited effect when the dataset has a strong noisy feature. A denoising technique could be useful in this scenario, and a pre-trained model based on noisy text could be another possible solution.

6 Conclusion

In this paper, we propose data annealing, a model-independent transfer learning procedure for informal language understanding tasks. It is applicable to various models such as LSTM and BERT. It has been proven as a good approach to utilizing knowledge from formal data to informal data by exhaustive experiments. When data annealing is applied with BERT, it outperforms different state-of-the-art models on different informal language understanding tasks. Since large pre-trained models have been widely used, it could also serve as a good fine-tuning method. Data annealing is also effective when there is limited labelled resources.


  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    Proceedings of the 26th Annual International Conference on Machine Learning

    ICML ’09, New York, NY, USA, pp. 41–48. External Links: ISBN 978-1-60558-516-1, Link, Document Cited by: §1.
  • D. Bertsimas and J. Tsitsiklis (1993) Simulated annealing. Statist. Sci. 8 (1), pp. 10–15. External Links: Document, Link Cited by: §1.
  • R. Collobert and J. Weston (2008)

    A unified architecture for natural language processing: deep neural networks with multitask learning

    In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 160–167. External Links: ISBN 978-1-60558-205-4, Link, Document Cited by: §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1, §1.
  • T. Gui, Q. Zhang, J. Gong, M. Peng, D. Liang, K. Ding, and X. Huang (2018) Transferring from formal newswire domain with hypernet for twitter POS tagging. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2540–2549. External Links: Link, Document Cited by: §1, Table 1.
  • R. Kshirsagar, T. Cukuvac, K. R. McKeown, and S. McGregor (2018) Predictive embeddings for hate speech detection on twitter. CoRR abs/1809.10644. External Links: Link, 1809.10644 Cited by: §1.
  • J. Y. Lee, F. Dernoncourt, and P. Szolovits (2018) Transfer learning for named-entity recognition with neural networks. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan. External Links: Link Cited by: §1.
  • L. Mou, Z. Meng, R. Yan, G. Li, Y. Xu, L. Zhang, and Z. Jin (2016) How transferable are neural networks in NLP applications?. CoRR abs/1603.06111. External Links: Link, 1603.06111 Cited by: §3.2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. CoRR abs/1802.05365. External Links: Link, 1802.05365 Cited by: §1.
  • M. E. Peters, S. Ruder, and N. A. Smith (2019) To tune or not to tune? adapting pretrained representations to diverse tasks. CoRR abs/1903.05987. External Links: Link, 1903.05987 Cited by: §2.
  • S. Ruder, P. Ghaffari, and J. G. Breslin (2017)

    Data selection strategies for multi-domain sentiment analysis

    CoRR abs/1702.02426. External Links: Link, 1702.02426 Cited by: §1.
  • S. Ruder and B. Plank (2017) Learning to select data for transfer learning with bayesian optimization. CoRR abs/1707.05246. External Links: Link, 1707.05246 Cited by: §1.
  • C. Sun, X. Qiu, Y. Xu, and X. Huang (2019) How to fine-tune BERT for text classification?. CoRR abs/1905.05583. External Links: Link, 1905.05583 Cited by: §2.
  • M. van der Wees, A. Bisazza, and C. Monz (2017)

    Dynamic data selection for neural machine translation

    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1400–1410. External Links: Link, Document Cited by: §1.
  • K. Weiss, T. M. Khoshgoftaar, and D. Wang (2016) A survey of transfer learning. Journal of Big Data 3 (1), pp. 9. External Links: ISSN 2196-1115, Document, Link Cited by: §3.2.
  • J. Yang and Y. Zhang (2018)

    NCRF++: an open-source neural sequence labeling toolkit

    CoRR abs/1806.05626. External Links: Link, 1806.05626 Cited by: §1, §3.2.
  • W. Yang, W. Lu, and V. W. Zheng (2019) A simple regularization-based algorithm for learning cross-domain word embeddings. CoRR abs/1902.00184. External Links: Link, 1902.00184 Cited by: Table 1.
  • Z. Yang, R. Salakhutdinov, and W. W. Cohen (2017) Transfer learning for sequence tagging with hierarchical recurrent networks. CoRR abs/1703.06345. External Links: Link, 1703.06345 Cited by: §3.2, §3.2, Table 1.
  • M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701. External Links: Link, 1212.5701 Cited by: §1.

Appendix A Mispredicted Sentences Examples on Named Entity Recognition Task

1 Is making me purchase windows{NO_ENTITY, B-PRODUCT} , antivirus and office{NO_ENTITY, B-PRODUCT}
2 ellwood{NO_ENTITY, B-PERSON} ’s sushi , a glass of pinot , " strokes{NO_ENTITY, B-OTHER} of{NO_ENTITY, I-OTHER} genius{NO_ENTITY, I-OTHER} " by john wertheim{NO_ENTITY, I-PERSON} , play at barksdale{NO_ENTITY, B-FACILITY} in a bit , lovely friday night :)
3 lalala{B-GEO-LOC, NO_ENTITY} south{B-GEO-LOC, NO_ENTITY} game tonight !!!! Go us . RT BunBTrillOG : Okay #teamtrill time to show them our power ! #BunB106andPark needs to trend now ! RT til it hurts ! I got ya twitter{NO_ENTITY, B-COMPANY} jail …
4 Chicago Weekend Events : Lebowski{NO_ENTITY, B-OTHER} Fest{NO_ENTITY, I-OTHER} , Dave{NO_ENTITY, B-PERSON} Matthews{NO_ENTITY, I-PERSON} , Latin Music And More : The lively weekend ( well , Friday throu …
5 RT @DonnieWahlberg : Soldiers … Familia … BH’s…{B-PERSON, NO_ENTITY} NK Fam … Homies … Etc . Etc . Etc …. I ’m gonna need some company next Friday in NYC …
6 tell ur dad2bring the ypp back in Hayes{B-GEO-LOC, NO_ENTITY} we sorted it out last time I’m like yea I’ll tell him *covers eyes*wat informing am I doing #llowit
7 #aberdeen RT flook_firehose2010Polar Bear Sun , 17 Oct 2010 at 10:28 am The Tunnels Carnegies{B-GEO-LOC, NO_ENTITY} Brae Aberdeen{B-GEO-LOC, NO_ENTITY} Un …
8 < 3 it RT Djcheapshot : Tonite I m DJing at Mai{NO_ENTITY, B-FACILITY} Tai{NO_ENTITY, I-FACILITY} in Long Beach{B-GEO-LOC, I-GEO-LOC} . I’m considering wearing MY TIE !! Get it ? My tie = Mai Tai ? No ? Sorry . Bye .
9 " I gotta admit , Alex{NO_ENTITY, B-PERSON} sounds hot when he talks in spanish during the ’ Alejandro{NO_ENTITY, B-OTHER} ’ Cover " -via someone ’s tumblr{NO_ENTITY, B-COMPANY} I’m pleased to have introduced TheSmokingGunn to twitter{NO_ENTITY, B-COMPANY} . May he become as inane as me .
10 Before I proceed into the paradise , let ’s not forget the Princess{NO_ENTITY, B-MOVIE} Lover{NO_ENTITY, I-MOVIE} OVA{NO_ENTITY, I-MOVIE} 1{NO_ENTITY, I-MOVIE} teaser pic , SFW{B-GEO-LOC, NO_ENTITY}
Table 2: Ten examples of mispredictted sentences.

Appendix B Dataset Statistic

Task Type Category Dataset Train Tokens Dev Tokens Test Tokens
NER Formal Ontonote-nw 848,220 144,319 49,235
Informal RitterNER 37,098 4,461 4,730
POS Tagging Formal PTB 2003 912,344 131,768 129,654
Informal RitterPOS 10,857 2,242 2,291
Chunking Formal CoNLL 2000 211,727 - 47,377
Informal RitterCHUNK 10,610 2,309 2,292
Table 3: Dataset statistics.