CLUECorpus2020: A Large-scale Chinese Corpus for Pre-trainingLanguage Model

03/03/2020 ∙ by Liang Xu, et al. ∙ 0

In this paper, we introduce the Chinese corpus from CLUE organization, CLUECorpus2020, a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl. To better understand this corpus, we conduct language understanding experiments on both small and large scale, and results show that the models trained on this corpus can achieve excellent performance on Chinese. We release a new Chinese vocabulary with a size of 8K, which is only one-third of the vocabulary size used in Chinese Bert released by Google. It saves computational cost and memory while works as good as original vocabulary. We also release both large and tiny versions of the pre-trained model on this corpus. The former achieves the state-of-the-art result, and the latter retains most precision while accelerating training and prediction speed for eight times compared to Bert-base. To facilitate future work on self-supervised learning on Chinese, we release our dataset, new vocabulary, codes, and pre-trained models on Github.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transfer learning in natural language processing (NLP), which firstly pre-training a large model on the raw text, then fine-tuning on downstream tasks, now becomes the mainstream paradigm. It leverages large-scale raw text, which is abundant on the internet and achieves excellent performance. For example, T5 (Raffel et al., 2019) treats all NLP problems as “text-to-text” problem, trained on Colossal Clean Crawled Corpus (C4) with 750 GB raw corpus, and achieves the state-of-the-art performance on GLUE (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2016).

Behind the rapid development of NLP recently, new and better models become available, and large scale raw corpus become more and more critical. There are several large-scale pre-training datasets publicly available in English. However, there is still a lack of open source large-scale Chinese corpus that can serve for the pre-training of language model. Therefore, we release CLUECorpus2020. For the convenience of reference, we also name it as C5, which stands for Colossal Clean Crawled Corpus for Chinese.

It contains 100 GB Chinese raw corpus, which is retrieved from Common Crawl. It is a well-defined dataset that can be used directly for pre-training without requiring additional pre-processing. CLUECorpus2020 contains around 29k separate files with each file following the pre-training format for the training set. And it has some amount of files for the development and test set, but each set is smaller. And several experiments on this dataset have been conducted to test the quality of this dataset.

To summarize, this paper makes the following contributions:

  • A large-scale Chinese raw corpus that can be used for pre-training, language generation or learning word representation, such as word embedding.

  • Through our experiments, we show that the model trained on a small percentage of our corpus can achieve better performance than the model trained on Chinese Wiki, which indicates the excellent quality and big potential of the dataset. With the whole dataset, we are able to match the state-of-the-art result on Chinese.

  • A compact vocabulary( vocab_clue) that can be used for NLP tasks in Chinese with only 8k vocabulary size, which is one-third of the vocabulary size of Chinese Bert( vocab_bert). Models trained on vocab_clue and vocab_bert achieve comparable performance, while our vocabulary is smaller and better suit for Chinese, and can be faster for training machine learning models.

  • We also release large and tiny versions of our pre-trained models trained on this dataset. The large version achieves the state-of-the-art performance, while the tiny version can be used to accelerate experiments and real applications.

2 Related work

For English, there are a large number of open-source unlabeled corpora. For example, 1) Toronto Books Corpus (Zhu et al., 2015), a 4 GB dataset contains text extracted from eBooks, which represents a different domain of natural language. 2) WebText-like (Radford et al., 2019), a 17 GB WebText-like English dataset, only uses content from web pages that were submitted to the content aggregation website Reddit and received a “score” of at least. 3)

English Wikipedia, a 16 GB English Wikipedia text data which consists of millions of encyclopedia articles written collaboratively and can be found in TensorFlow Datasets

333https://www.tensorflow.org/datasets/catalog/wikipedia/, which omits all markup and reference sections from the articles. 4) C4 (Raffel et al., 2019) dataset, a 750 GB English dataset consists of hundreds of gigabytes of clean English text scraped from the web.

But for Chinese, similar corpus collections are still relatively rare and have a small size. For example, 1) THUCTC (Sum et al., 2016), a 2.19 GB dataset contains 740,000 news documents. 2) Chinese Wikipedia444https://github.com/brightmart/nlp_chinese_corpus/, a 1.1 GB dataset contains Chinese Wikipedia text data. As we all know, the size of the existing Chinese dataset is relatively small. In this paper, to solve the problem of lacking large-scale unlabeled corpus in Chinese, we leverage Common Crawl which is crawled from the whole internet and pre-process this dataset in detail. Finally, we provide a bigger and higher-quality all-inclusive corpus.

3 Dataset Description

Dataset Token (B) Sentences (M) Size (GB)
Train 34.7 106 99.0
Dev 0.18 3.9 0.5
Test 0.18 3.9 0.5
Table 1: Statistical information of CLUECorpus2020. “B”: billion; “M”: million. Dev and Test set were drawn from the same distribution of training set, which can be used to check the generalization ability of a model. e.g., check mask Language Model(LM) accuracy of a model during the training stage, or perplexity of a language model after training.

Before we release this corpus, there is few large-scale high-quality Chinese dataset designed for pre-training language model in Chinese. This corpus is around 100 GB and comes from different websites 555http://commoncrawl.org/the-data/get-started/. We use the ratio of to split the data into the training set, development set and test set randomly. As we seen from samples, it covers all sorts of topics, like news, entertainment, sports, health, international affairs, movies, celebrities, and so on. We follow pre-training format to organize the files of our dataset: one sentence per line and add an empty line at the end of a document. The overall statistics of this corpus is described in Table 1. We will elaborate on the data construction process in the next section.

4 Dataset Construction

Unlabeled large scale datasets for unsupervised learning play an increasingly important role for Chinese NLP tasks. We believe that higher-quality data will have a better impact on Chinese NLP tasks. In this paper, we select and provide high-quality unlabeled dataset. To generate datasets that satisfy our requirements, we leverage Common Crawl as a source of text scraped from the web.

Common Crawl is an organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl usually crawls internet web content once a month. Common Crawl’s web archives consist of petabytes of data collected since 2011. First, we extract text content from the scraped HTML files according to the detailed rules. Unfortunately, the majority of text content contains gibberish like dirty text or source code that we think useless to NLP tasks in Chinese. Furthermore, the scraped text includes a lot of duplicate content. To solve these problems, we do further filtering and extraction using the following heuristics rules, by which we have special treatment for Chinese, in addition to referring to the filtering method of C4:

  • Since we focus on Chinese tasks, we select sentences whose language type is Chinese, if a language is mentioned.

  • To avoid incomplete sentences, we remove characters from the end of the text, until we find a Chinese terminal punctuation mark (i.e., a period, question mark, or the end of double quotation mark).

  • Since Chinese words that contain “List of Dirty, Naughty, Obscene or Other Bad Words.”666https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/ have a bad effect on building a healthy and civilized internet environment, so we remove all sentences that contain them.

  • Warnings state that Javascript should be enabled unlikely to be helpful for NLP tasks, so we remove any line with the word Javascript or JavaScript.

  • To deduplicate the dataset, we discard all but one of any four-sentence span occurring more than once in the dataset.

  • We replace consecutive blank characters (i.e., tabs, spaces, invisible characters, etc.) that are generally meaningless in sentences with space.

  • Since the curly bracket “{” appears in many programming languages (such as Javascript, widely used on the web) but not in the natural text, we remove any sentence that contains a curly bracket.

  • To generate pre-trained format data, we use pyltp (Che et al., 2010) to separate text content into sentences, one complete sentence per line.

  • Since too short sentences that may be problematic or incomplete sentences are not suitable for language model training, we only retain sentences longer than 5.

We download the corpus from July to December 2019 from Common Crawl. After the aforementioned filtering method, we extract the corpus of 100 GB. The corpus is much larger than the previous most datasets used for pre-training (about 100 GB) and is clean and natural Chinese text.

5 Creation of CLUE Vocab

Token Type Google CLUE
Simplified Chinese 11378 5689
Traditional Chinese 3264
English 3529 1320
Japanese 573
Korean 84
Emoji 56
Numbers 1179 140
Special Tokens 106 106
Other Tokens 959 766
Total 21128 8021
Table 2:

Statistical information of the two versions of the dictionaries. “Special Tokens” include “[PAD]”, “[UNK]”, “[CLS]”, “[SEP]”, “[MASK]”, “

S”, “T” and 99 unused tokens.

The original BERT model uses character-based tokenization for Chinese. But there are many redundant tokens in the original vocabulary. Therefore, we have compiled a refined vocabulary through automated scripts and manual review. A detailed comparison of the two versions of the dictionaries can be seen in Table 2. We remove many unnecessary tokens, which will not be used in most cases of Chinese NLP tasks, such as traditional Chinese, Japanese, Korean and emojis. For English, we remove the most prefix tokens except for the single character and retain the most suffix tokens to guarantee the tokenization for English words. Similarly, for tokens standing for numbers, we only keep separate numeric symbols and the more commonly used words that represent the year. In addition, we also remove tokens composed of more than two special symbols.

As a result, the vocabulary size is only one-third of the original size of Chinese BERT. We call it “vocab_clue”.

6 Experiments

Index Model Vocab Data Steps AFQMC TNEWS IFLYTEK CMNLI AVG
1 BERT-base Google Wiki (1 GB) 125K 69.93 54.77 57.54 75.64 64.47
2 BERT-base Google C5 (1 GB) 125K 69.63 55.72 58.87 75.75 64.99
3 BERT-base CLUE C5 (1 GB) 125K 69.00 55.04 59.07 75.84 64.74
4 BERT-base† Google C5 (1 GB) 125K 69.57 55.17 59.69 75.86 65.07
Table 3: Performance of Bert Models on CLUE benchmark (http://www.cluebenchmark.com) . BERT-base† stand for BERT-base mm. For each experiment, we select the best model using dev set after training stage, then submit to CLUE benchmark and get score. Performance comparison for different corpus through Index 1 and 2. Performance comparison of two vocabularies through Index 2 and 3. Comparison of attention mechanism through Index 2 and 4. BERT-base mm, stand for BERT-base with minus and element-wise multiplication.
Index Model Vocab Data Steps AFQMC TNEWS IFLYTEK CMNLI AVG
5 BERT-base Google C5 (1 GB) 375K 69.85 55.97 59.62 76.41 65.46
6 BERT-base CLUE C5 (1 GB) 375K 69.93 56.38 59.35 76.58 65.56
7 BERT-base Google C5 (3 GB) 375K 70.22 56.41 59.58 76.70 65.73
8 BERT-base CLUE C5 (3 GB) 375K 69.49 55.97 60.12 77.66 65.81
Table 4: The effects of more training data of C5 and steps. With three times steps(3* 125k), BERT-base model trained on C5, gain 0.47 to 0.82 point compare to same model trained with 125k using different vocabularies. With tree times training data(3* 1GB), BERT-base model gain 0.74 to 1.07 point compare to same model trained with 1GB data and 125K, 0.25 to 0.27 point compare to same steps on 1GB. And the model trained with CLUE vocab always better than with Google vocab.
Index Model Vocab Training Data Steps ACC of Masked LM Loss of Masked LM
1 BERT-base Google Wiki(1 GB) 125K 72.24% 1.2321
2 BERT-base Google C5 (1 GB) 125K 77.94% 0.9702
3 BERT-base CLUE C5 (1 GB) 125K 76.47% 1.0691
4 BERT-base mm Google C5 (1 GB) 125K 78.02% 0.9816
Table 5: Training metrics of Bert Models. Accuracy and loss of masked LM is on training set.
Task Length Batch Size Learning Rate Epoch Save Steps
AFQMC 128 16 2e-5 3 300
TNEWS 128 16 2e-5 3 300
IFLYTEK 128 32 2e-5 3 300
CMNLI 128 64 3e-5 2 300
Table 6: Hyper-parameters of fine-tuning on CLUE tasks. We keep all hyper-parameters the same throughout all the experiments.
Model Vocabulary Vocabulary Size Parameters Training Device Training Speed
Bert-base google_vocab 21128 102M TPU V3-8 1000steps/404s
Bert-base clue_vocab 8021 ( 62.04%) 92M ( 9.80%) TPU V3-8 1000steps/350s ( 15.43%)
RoBERTa-tiny-clue clue_vocab 8021 ( 62.04%) 7.5M ( 92.6%) TPU V3-8 1000steps/50s ( 708.0%)
Table 7: Detailed speed comparison of “google_vocab” and “clue_vocab”, and RoBERTa-tiny-clue with Bert-base. CLUE vocabulary is one-third of Google vocabulary, with ten percentage or more speedup. RoBERTa-tiny-clue is 7 to 8 times faster than Bert-base

6.1 Pre-training with CLUECorpus2020 and Wiki

In this section, we want to compare our new dataset with Wiki using the same model. We choose BERT Devlin et al. (2018) as our baseline model. We pre-train the BERT-base model with Wiki data and C5 data, respectively. Due to the limited computing resources, we have designed a comparison on a small-scale corpus. The performance on large-scale corpus will be added in the next version. To make a fair comparison, we keep the parameters of both models the same. Meanwhile, the size of Wiki and the selected part of C5 data are both 1 GB. We release both of this corpus to make our results reproducible. As the length of most classification tasks is less than 128, we set the maximum sequence length of the pre-training to 128. It also improves the speed of pre-training. We pre-train BERT using masked language model (LM) prediction task without the next sentence prediction (NSP) task, as NSP task makes the performance of models worse observed in some recent papers, such as RoBERTa  (Liu et al., 2019).

Classification board of CLUE benchmark comprises 6 tasks meant to test general language understanding ability. We use the following four tasks to test the performance of our models as the sequence length of these tasks is 128 or less. During fine-tuning CLUE benchmark tasks, we also maintain the same parameters, as can be seen in Table 6. Here are four tasks we used, including different kinds of tasks.

As we can see in Table 3, the performance of the model pre-trained on C5 is 0.52% higher than the model pre-trained on Wiki. It suggests that our data quality is similar or even better than Wiki. As we only use one percent of the whole dataset, there is a big potential to have a good performance using the whole dataset.

Index Model Vocab Data AFQMC TNEWS IFLYTEK CMNLI AVG
9 BERT-base (Devlin et al., 2018) Google / 73.70 56.58 60.29 79.69 67.57
10 ALBERT-tiny Google (30GB) 69.92 53.35 48.71 70.61 60.65
11 ELECTRA-joint-generator-tiny (Clark et al., 2019) Google / 69.90 54.63 52.31 73.17 62.50
12 RoBERTa-tiny-clue  (Cui et al., 2019) CLUE C5 (100 GB) 69.52 54.57 57.31 73.1 63.60
Table 8: Performance of the tiny version of pre-training models. Our tiny version model, RoBERTa-tiny-clue, retains most of precision compared to BERT-base, only 4 points lower than BERT-base, while performance is much better than ALBERT-tiny. All scores were reported by submitting to CLUE benchmark.

6.2 Comparison of Attention Mechanisms with C5

The backbone of pre-trained models, typically like BERT and its variants, is Transformer model (Vaswani et al., 2017). The key component is the self-attention mechanism. With our new dataset, we are able to explore some variants of this mechanism. A self-attention module takes in inputs and returns outputs. The self-attention mechanism allows the inputs to interact with each other (“self”) and find out what they should pay more attention to (“attention”). The outputs are aggregates of these interactions and attention scores.999https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a

We believe there is some room for improvement of attention mechanism, especially that current heavily used self-attention may still too simple and naive to represent the importance of information for the input sequence. So we try a variant of self-attention mechanism, as follows:

  • BERT-base mm

    (Minus and element-wise Multiplication): Given two vector, we want to use a simple and inexpensive computation method to compute the similarity of these two vectors, such as using the absolute minus and element-wise multiplication operation. Then we transform the result with a dense layer and add them to the attention score.

As we can see from Table 3 and Table 5, our variant has similar or slightly better performance as/than baseline. These results indicate that possible improvement of attention mechanisms may improve performance on downstream tasks. With our new dataset, researchers of NLP can explore their ideas regarding this area.

6.3 Performance Comparison of Two Vocabularies

In order to verify the rationality of our vocabulary, we train the BERT-base model using the original vocabulary published by Google and our refined new vocabulary. We use the same model, and keep all hyper-parameters the same, and the only difference is the vocabulary. Similar to the previous section, we used multiple downstream tasks, four tasks, to confirm the performance of the vocabulary. As can be seen from Table 3, for Index and , the performance is similar, with only 0.25 point difference. We believe that our new vocabulary, “vocab_clue” can be used in downstream NLP tasks in Chinese in the future, especially for those situations with limited resource and computation power.

In table 7, we make a detailed comparison. As we can see, the size of clue_vocab is 62.04% less than origin vocabulary and has about 9.84% fewer parameters compared to BERT-base. We find the training speed is 15.43% faster than the original BERT-base, which pre-trained both on TPU V3-8.

6.4 More training data of C5 and steps

As we can see from Tabel 4 , under the same steps, the performance of BERT-base which use 3 GB of data is 0.27 point higher than the performance of BERT-base which use 1 GB of data through Index 5 and 7. Meanwhile, the performance of BERT-base which use clue_vocab is 0.1 point higher than the performance of BERT-base which use google vocabulary through Index 5 and 6. It can be concluded that the increase of the corpus can improve the performance of the model, and our clue_vocab is better than google vocabulary while BERT-base model is trained enough steps.

6.5 Performance of Large Version

We generate our training data the same as RoBERTa (Liu et al., 2019), and remove the Next Sentence Prediction(NSP) task. To compare with RoBERTa-wwm-large101010https://github.com/ymcui/Chinese-BERT-wwm, which is currently the best Chinese model, we also use whole word mask as our mask strategy.

To speed up pre-training in our experiments, similar to BERT, we first train 500k steps on 128 sequence length with batch size 8k. We then train 600k on 512 sequence length with batch size 4k, and make the model more suitable for tasks with longer sequence lengths.

As we can see from Table 9, As the number of training steps increases, the performance of the model is gradually getting better. Finally, we can see our performance is better than the original roberta-wwm-large.

6.6 Performance of Tiny version

State-of-the-art models, like BERT, can achieve very good performance compared to other models. However, as these models are very big and deep, with hundreds of millions of parameters, they are usually very slow during the prediction stage. To ease this problem, we release a small version of the pre-trained model. We want to keep it as small and fast as possible but retain the most precision. For those tasks that are not too difficult, like classification with few labels or sentence-pair, we recommend using this small version model to replace those big and slow BERT models.

We name it RoBERTa-tiny-clue, as it bases on the model of RoBERTa, and trained with corpus and vocabulary from CLUE. We first train it on sequence length 128 for 500k steps with batch size 8k using 100G corpus, then we train it for additional 200k steps with the same batch size using an additional 30G corpus. Together, it trains with 5.6 billion training instances.

The configuration of hyper-parameters is keep same as ALBERT-tiny 111111https://github.com/brightmart/albert_zh, with hidden size 312 for 4 layers. It is around ten times fast for training and prediction compare to BERT-base. It gains an additional ten percentage speed accelerate even compare to ALBERT-tiny as the vocabulary used, vocab-clue, is only one-third of the vocabulary of BERT. Most importantly, it’s performance is much better than ALBERT-tiny. Check table 8 for performance comparison with Bert-base and ALBERT-tiny.

6.7 Transfer Learning Among Similar Tasks using Pre-trained Models

Index Model Vocab Data Steps Init AFQMC TNEWS IFLYTEK CMNLI AVG
13 RoBERTa-wwm-large’ Google / / 74.44 58.41 62.77 82.2 69.46
14 RoBERTa-wwm-large’ Google / / CMNLI 75.19 58.41 62.77 82.2 69.64
15 RoBERTa-large-clue CLUE C5 (100 GB) 100K 69.9 56.95 62.08 80.48 67.35
16 RoBERTa-large-clue CLUE C5 (100 GB) 200K 69.98 58.66 62.50 81.33 68.12
17 RoBERTa-large-clue CLUE C5 (100 GB) 500K 74.00 58.70 62.31 82.04 69.26
18 RoBERTa-large-clue CLUE C5 (100 GB) 500K CMNLI 74.41 58.70 62.31 82.04 69.37
19 RoBERTa-large-clue CLUE C5 (100 GB) 650K 70.01 58.52 62.54 82.68 68.44
20 RoBERTa-large-clue CLUE C5 (100 GB) 650K CMNLI 74.41 58.52 62.54 82.68 69.54
21 RoBERTa-large-clue CLUE C5 (130 GB) 800K CMNLI 74.41 58.38 63.58 82.36 69.68
Table 9: Performance of RoBERTa-large using 100G corpus and vocabulary from CLUE. Our model RoBERTa-large-clue trained with 100g achieve same performance as RoBERTa-wwm-large’ (Cui et al., 2019), slightly better performance in two tasks, IFLYTEK and CMNLI. Init with CMNLI means when training task AFQMC, the model is initialized from a model that trained on CMNLI. All scores were reported by submitting to CLUE benchmark.

Pre-trained models are powerful, but it is still difficult for them to learn tasks without enough training data. We observe that the robust performance model RoBERTa-large-clue can not learn well on task AFQMC, a sentence-pair task. The CMNLI task, which is also a sentence pair-task, has a lot of training data( around 390k). Therefore, We first train CMNLI using our pre-trained model, then we use this trained model to initialize AFQMC. As a result, it is around 0.8 to 4 points of performance boost compare to initializing from the pre-trained model. See field of init with CMNLI on table 9. We believe this is a kind of transfer earning, which uses knowledge learned from one task and applies it to another similar task. We name the model trained with CMNIL as RoBERTa-pair, and release it on our repository. We believe for many other sentence-pair tasks, with the help of this model, people can also achieve better performance than initializing from general pre-trained models.

7 Conclusion

In this paper, we introduce CLUECorpus2020, a large-scale corpus that can be used directly for the pre-training language model in Chinese. It is the first well-defined large-scale public available dataset that serves the purpose of the pre-training language model in Chinese. We conduct experiments on a small portion of this new dataset and Chinese Wiki. The results prove that our dataset has good quality and huge potential. In addition, we conduct experiments on the full dataset for a full-network pre-trained model. We also release a new vocabulary which size is small but work well for Chinese tasks. With our corpus and vocabulary, our model is able to match state-of-the-art performance in Chinese. We also observe that transfer learning is useful among similar tasks, and it can boost performance. We release our dataset, vocabulary, pre-trained models, and codes on Github.

In this work, we focus on pre-training, especially for language understanding. However, this dataset can also be used for language generation and other NLP tasks. We leave these for further study.

8 Acknowledgements

Our research is supported by Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC). We thank Zhe Zhao, Junyi Li, Shaomian Zheng, Zhenzhong Lan , and Peng Li for the sharing fee of the experiments.

References

  • W. Che, Z. Li, and T. Liu (2010) Ltp: a chinese language technology platform. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, pp. 13–16. Cited by: 8th item.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2019) ELECTRA: pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, Cited by: Table 8.
  • I. CO.,LTD. (2019)

    IFLYTEK: a multiple categories chinese text classifier

    .
    competition official website, http://challenge.xfyun.cn/2019/gamelist. Cited by: 3rd item.
  • A. Conneau, R. Rinott, G. Lample, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov (2018) XNLI: evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Cited by: 4th item.
  • Y. Cui, W. Che, T. Liu, B. Qin, Z. Yang, S. Wang, and G. Hu (2019) Pre-training with whole word masking for chinese bert. arXiv preprint arXiv:1906.08101. Cited by: Table 8, Table 9.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §6.1, Table 8.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §6.1, §6.5.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: item 2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints. External Links: 1910.10683 Cited by: §1, item 4.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §1.
  • M. Sum, J. Li, Z. Guo, Y. Zhao, Y. Zheng, X. Si, and Z. Liu (2016) THUCTC: an efficient chinese text classifier. GitHub Repository. Cited by: item 1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §6.2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §1.
  • A. Williams, N. Nangia, and S. R. Bowman (2017) A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. Cited by: 4th item.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 19–27. Cited by: item 1.