BURT: BERT-inspired Universal Representation from Learning Meaningful Segment

12/28/2020 ∙ by Yian Li, et al. ∙ Shanghai Jiao Tong University 0

Although pre-trained contextualized language models such as BERT achieve significant performance on various downstream tasks, current language representation still only focuses on linguistic objective at a specific granularity, which may not applicable when multiple levels of linguistic units are involved at the same time. Thus this work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space. We present a universal representation model, BURT (BERT-inspired Universal Representation from learning meaningful segmenT), to encode different levels of linguistic unit into the same vector space. Specifically, we extract and mask meaningful segments based on point-wise mutual information (PMI) to incorporate different granular objectives into the pre-training stage. We conduct experiments on datasets for English and Chinese including the GLUE and CLUE benchmarks, where our model surpasses its baselines and alternatives on a wide range of downstream tasks. We present our approach of constructing analogy datasets in terms of words, phrases and sentences and experiment with multiple representation models to examine geometric properties of the learned vector space through a task-independent evaluation. Finally, we verify the effectiveness of our unified pre-training strategy in two real-world text matching scenarios. As a result, our model significantly outperforms existing information retrieval (IR) methods and yields universal representations that can be directly applied to retrieval-based question-answering and natural language generation tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Representations learned by deep neural models have attracted a lot of attention in Natural Language Processing (NLP). However, previous language representation learning methods such as Word2Vec [1], LASER [2] and USE [3] focus on either words or sentences. Later proposed pre-trained contextualized language representations like ELMo [4], GPT[5], BERT [6] and XLNet [7] may seemingly handle different sized input sentences, but all of them focus on sentence-level specific representation still for each word, leading to unsatisfactory performance in real-world situations. Although the latest BERT-wwm-ext [8], StructBERT [9] and SpanBERT [10] perform MLM on a higher linguistic level, the masked segments (whole words, trigrams, spans) either follow a pre-defined distribution or focus on a specific granularity. Besides, the random sampling strategy ignores important semantic and syntactic information of a sequence, resulting in a large number of meaningless segments.

However, universal representation among different levels of linguistic units may offer a great convenience when it is needed to handle free text in language hierarchy in a unified way. As well known that, embedding representation for a certain linguistic unit (i.e., word) enables linguistics-meaningful arithmetic calculation among different vectors, also known as word analogy. For example, vector (“King”) - vector (“Man”) + vector (“Woman”) results in vector (“Queen”). Thus universal representation may generalize such good analogy features or meaningful arithmetic operation onto free text with all language levels involved together. For example, Eat an onion : Vegetable :: Eat a pear : Fruit. In fact, manipulating embeddings in the vector space reveals syntactic and semantic relations between the original sequences and this feature is indeed useful in true applications. For example, “London is the capital of England.” can be formulized as . Then given two documents one of which contains “England” and “capital”, the other contains “London”, we consider these two documents relevant.

In this paper, we explore the regularities of representations including words, phrases and sentences in the same vector space. To this end, we introduce a universal analogy task derived from Google’s word analogy dataset. To solve such task, we present BURT, a pre-trained model that aims at learning universal representations for sequences of various lengths. Our model follows the architecture of BERT but differs from its original masking and training scheme. Specifically, we propose to efficiently extract and prune meaningful segments (n-grams) from unlabeled corpus with little human supervision, and then use them to modify the masking and training objective of BERT. The n-gram pruning algorithm is based on point-wise mutual information (PMI) and automatically captures different levels of language information, which is critical to improving the model capability of handling multiple levels of linguistic objects in a unified way, i.e., embedding sequences of different lengths in the same vector space.

Overall, our pre-trained models improves the performance of baselines in both English and Chinese. In English, BURT-base reaches 0.7 percent gain on average over Google BERT-base. In Chinese, BURT-wwm-ext obtains 74.5% on the WSC test set, 13.4% point absolute improvement compared with BERT-wwm-ext and exceeds the baselines by 0.2%

0.6% point accuracy on five other CLUE tasks including TNEWS, IFLYTEK, CSL, ChID and CMRC 2018. Extensive experimental results on our universal analogy task demonstrate that BURT is able to map sequences of variable lengths into a shared vector space where similar sequences are close to each other. Meanwhile, addition and subtraction of embeddings reflect semantic and syntactic connections between sequences. Moreover, BURT can be easily applied to real-world applications such as Frequently Asked Questions (FAQ) and Natural Language Generation (NLG) tasks, where it encodes words, sentences and paragraphs into the same embedding space and directly retrieves sequences that are semantically similar to the given query based on cosine similarity. All of the above experimental results demonstrate that our well-trained model leads to universal representation that can adapt to various tasks and applications.

2 Background

2.1 Word and Sentence Embeddings

Representing words as real-valued dense vectors is a core technique of deep learning in NLP. Word embedding models

[1, 11, 12] map words into a vector space where similar words have similar latent representations. ELMo [4] attempts to learn context-dependent word representations through a two-layer bi-directional LSTM network. In recent years, more and more researchers focus on learning sentence representations. The Skip-Thought model [13] is designed to predict the surrounding sentences for an given sentence. Logeswaran and Lee (2018) [14]

improve the model structure by replacing the RNN decoder with a classifier. InferSent


is trained on the Stanford Natural Language Inference (SNLI) dataset

[16] in a supervised manner. Subramanian et al. (2018) [17] and Cer et al. (2018) [3] employ multi-task training and report considerable improvements on downstream tasks. LASER [2] is a BiLSTM encoder designed to learn multilingual sentence embeddings. Most recently, contextualized representations with a language model training objective such as OpenAI GPT [5], BERT [6], XLNet [7] are expected to capture complex features (syntax and semantics) for sequences of any length. Especially, BERT improves the pre-training and fine-tuning scenario, obtaining new state-of-the-art results on multiple sentence-level tasks. On the basis of BERT, further fine-tuning using Siamese Network on NLI data can effectively produce high quality sentence embeddings [sbert]. Nevertheless, most of the previous work concentrate on a specific granularity. In this work we extend the training goal to a unified level and enables the model to leverage different granular information, including, but not limited to, word, phrase or sentence.

2.2 Pre-Training Tasks

BERT is trained on a large amount of unlabeled data including two training targets: Masked Language Model (MLM) for modeling deep bidirectional representations, and Next Sentence Prediction (NSP) for understanding the relationship between two sentences. ALBERT [18] is trained with Sentence-Order Prediction (SOP) as a substitution of NSP. StructBERT [9] has a sentence structural objective that combines the random sampling strategy of NSP and continuous sampling as in SOP. However, RoBERTa [19] and SpanBERT [10] use single contiguous sequences of 512 tokens for pre-training and show that removing the NSP objective improves the performance. Besides, BERT-wwm [8], StructBERT [10], SpanBERT [9] perform MLM on higher linguistic levels, augmenting the MLM objective by masking whole words, trigrams or spans, respectively. Nevertheless, we concentrate on enhancing the masking and training procedures from a broader and more general perspective.

2.3 Analysis and Applications

Previous explorations of vector regularities mainly study word embeddings [1, 20]. After the introduction of sentence encoders and Transformer models [22], more works were done to investigate sentence-level embeddings. Usually the performance in downstream tasks is considered to be the measurement for model ability of representing sentences [15, 3, 23]. Some research proposes probing tasks to understand certain aspects of sentence embeddings [24, 25, 26]. Specifically, Rogers et al. (2020) [27] and Ma et al. (2019) [28] look into BERT embeddings and reveal its internal working mechanisms. Some work also explores the regularities in sentence embeddings [30, 31]. Nevertheless, little work analyzes words, phrases and sentences in the same vector space. In this paper, We work on embeddings for sequences of various lengths obtained by different models in a task-independent manner.

Transformer-based representation models have made great progress in measuring query-Question or query-Answer similarities. Damani et al. (2020) [35] make an analysis on Transformer models and propose a neural architecture to solve the FAQ task. Sakata et al. (2019) [36] come up with an FAQ retrieval system that combines the characteristics of BERT and rule-based methods. In this work, we also evaluate the well-trained universal representation models on FAQ task.

Fig. 1: An illustration of n-gram pre-training.
Fig. 2: An example from the Chinese Wikipedia corpus. n-grams of different lengths are marked with dashed boxes in different colors in the upper part of the figure. During training, we randomly mask n-grams and only the longest n-gram is masked if there are multiple matches, as shown in the lower part of the figure.

3 Methodology

Our BURT follows the Transformer encoder [22] architecture where the input sequence is first split into subword tokens and a contextualized representation is learned for each token. We only perform MLM training on single sequences as suggested in [10]. The basic idea is to mask some of the tokens from the input and force the model to recover them from the context. Here we propose a unified masking method and training objective considering different grained linguistic units.

Specifically, we apply an pruning mechanism to collect meaningful n-grams from the corpus and then perform n-gram masking and predicting. Our model differs from the original BERT and other BERT-like models in several ways. First, instead of the token-level MLM of BERT, we incorporate different levels of linguistic units into the training objective in a comprehensive manner. Second, unlike SpanBERT and StructBERT which sample random spans or trigrams, our n-gram sampling approach automatically discovers structures within any sequence and is not limited to any granularity.

3.1 N-gram Pruning

In this subsection, we introduce our approach of extracting a large number of meaningful n-grams from the monolingual corpus, which is a critical step of data processing.

First, we scan the corpus and extract all n-grams with lengths up to using the SRILM toolkit111http://www.speech.sri.com/projects/srilm/download.html [37]. In order to filter out meaningless n-grams and prevent the vocabulary from being too large, we apply pruning by means of point-wise mutual information (PMI) [38]. To be specific, mutual information describes the association between tokens and

by comparing the probability of observing

and together with the probabilities of observing and independently. Higher mutual information indicates stronger association between the two tokens.


In practice, and denote the probabilities of and , respectively, and represents the joint probability of observing followed by . This alleviates bias towards high-frequency words and allows tokens that are rarely used individually but often appear together such as “San Francisco” to have higher scores. In our application, an n-gram denoted as , where is the number of tokens in , may contains more than two words. Therefore, we present an extended PMI formula displayed as below:


where the probabilities are estimated by counting the number of observations of each token and

n-gram in the corpus, and normalizing by the size of the corpus. is an additional normalization factor which avoids extremely low scores for longer n-grams. Finally, n-grams with PMI scores below the chosen threshold are filtered out, resulting in a vocabulary of meaningful n-grams.

3.2 N-gram Masking

For a given input , where is the number of tokens in , special tokens [CLS] and [SEP] are added at the beginning and end of the sequence, respectively. Before feeding the training data into the Transformer blocks, we identify all the n-grams in the sequence using the aforementioned n-gram vocabulary. An example is shown in Figure 2, where there are overlap between n-grams, which indicates the multi-granular inner structure of the given sequence. In order to make better use of higher-level linguistic information, the longest n-gram is retained if multiple matches exist. Compared with other masking strategies, our method has two advantages. First, n-gram extracting and matching can be efficiently done in an unsupervised manner without introducing random noise. Second, by utilizing n-grams of different lengths, we generalize the masking and training objective of BERT to a unified level where different granular linguistic units are integrated.

Following BERT, we mask 15% of all tokens in each sequence. The data processing algorithm uniformly samples one n-gram at a time until the maximum number of masking tokens is reached. 80% of the time the we replace the entire n-gram with [MASK] tokens. 10% of the time it is replace with random tokens and 10% of the time we keep it unchanged. The original token-level masking is retained and considered as a special case of n-gram masking where . We employ dynamic masking as mentioned by Liu et al. (2019) [19]

, which means masking patterns for the same sequence in different epochs are probably different.

3.3 Traning Objective

As depicted in Figure 1, the Transformer encoder generates a fixed-length contextualized representation at each input position and the model only predicts the masked tokens. Ideally, a universal representation model is able to capture features for multiple levels of linguistic units. Therefore, we extend the MLM training objective to a more general situation, where the model is trained to predict n-grams rather than subwords.


where is a masked n-gram and is a corrupted version of the input sequence. represents the absolute start and end positions of .

GLUE Length #Train #Dev #Test #L
CoLA Short 8.5k 1k 1k 2
SST-2 Short 67k 872 1.8k 2
MNLI Short-Short 393k 20k 20k 3
QNLI Short-Short 105k 5.5k 5.5k 2
RTE Short-Short 2.5k 277 3k 2
MRPC Short-Short 3.7k 408 1.7k 2
QQP Short-Short 364k 40k 391k 2
STS-B Short-Short 5.8k 1.5k 1.4k -
CLUE Length #Train #Dev #Test #L
TNEWS Short 53k 10k 10k 15
IFLYTEK Long 12k 2.6k 2.6k 119
WSC Short 1.2k 304 290 2
AFQMC Short-Short 34k 4.3k 3.9k 2
CSL Long-Short 20k 3k 3k 2
OCNLI Short-Short 50k 3k 3k 3
CMRC18 Long 10k 1k 3.2k -
ChID Long 85k 3.2k 3.2k -
C3 Short 12k 3.8k 3.9k -
TABLE I: Statistics of datasets from GLUE and CLUE benchmarks. #Train, #Dev, #Test are the size of training, development and test sets, respectively. #L is the number of labels. Sequences are simply divided into two categories according to their length: “Long” and “Short”.

4 Task Setup

To evaluate the model ability of handling different linguistic units, we apply our model on downstream tasks from GLUE and CLUE benchmark. Moreover, we construct a universal analogy task based on Google’s word analogy dataset to explore the regularity of universal representation. Finally, we present an insurance FAQ task and a retrieval-based language generation task, where the key is to embed sequences of different lengths in the same vector space and retrieve sequences with similar meaning to the given query.

4.1 General Language Understanding

Statistics of the GLUE and CLUE benchmarks are listed in Table I. Besides the diversity of task types, we also find that different datasets concentrates on sequences of different lengths, which satisfies our need to examine the model ability of representing multiple granular linguistic units.

4.1.1 Glue

The General Language Understanding Evaluation (GLUE) benchmark [39] is a collection of tasks that is widely used to evaluate the performance of English language models. We divide eight NLU tasks from the GLUE benchmark into three main categories.

Single-Sentence Classification The Corpus of Linguistic Acceptability (CoLA) [40] is to determine whether a sentence is grammatically acceptability or not. The Stanford Sentiment Treebank (SST-2) [41] is a sentiment classification task that requires the model to predict whether the sentiment of a sentence is positive or negative. In both datasets, each example is a sequence of words annotated with a label.

Natural Language Inference Multi-Genre Natural Language Inference (MNLI) [42]

, Stanford Question Answering Dataset (QNIL)

[43] and Recognizing Textual Entailment (RTE) [rte] are natural language inference tasks, where a pair of sentences are given and the model is trained to identify the relationship between the two sentences from entailment, contradiction, and neutral.

Semantic Similarity Semantic similarity tasks identify whether the two sentences are equivalent or measure the degree of semantic similarity of two sentences according to their representations. Microsoft Paraphrase corpus (MRPC) [44]

and Quora Question Pairs (QQP) dataset are paraphrase datasets, where each example consists of two sentences and a label of “1” indicating they are paraphrases or “0” otherwise. The goal of Semantic Textual Similarity benchmark (STS-B)

[45] is to predict a continuous scores from 1 to 5 for each pair as the similarity of the two sentences.

4.1.2 Clue

The Chinese General Language Understanding Evaluation (ChineseGLUE or CLUE) benchmark [46] is a Chinese version of the GLUE benchmark for language understanding. We also find nine tasks from the CLUE benchmark can be classified into three groups.

Single Sentence Tasks We utilize three single-sentence classification tasks including TouTiao Text Classification for News Titles (TNEWS), IFLYTEK [47] and the Chinese Winograd Schema Challenge (WSC) dataset. Examples from TNEWS and IFLYTEK are short and long sequences, respectively, and the goal is to predict the category that the given single sequence belongs to. WSC is a coreference resolution task where the model is required to decide whether two spans refer to the same entity in the original sequence.

Sentence Pair Tasks The Ant Financial Question Matching Corpus (AFQMC), Chinese Scientific Literature (CSL) dataset and Original Chinese Natural Language Inference (OCNLI) [48] are three pairwise textual classification tasks. AFQMC contains sentence pairs and binary labels, and the model is asked to examine whether two sentences are semantically similar. Each example in CSL involves a text and several keywords. The model needs to determine whether these keywords are true labels of the text. OCNLI is a natural language inference task following the same collection procedures of MNLI.

Machine Reading Comprehension Tasks CMRC 2018 [49], ChID [50], and C3 [51] are span-extraction based, cloze style and free-form multiple-choice machine reading comprehension datasets, respectively. Answers to the questions in CMRC 2018 are spans extracted from the given passages. ChID is a collection of passages with blanks and corresponding candidates for the model to decide the most suitable option. C3 is similar to RACE and DREAM, where the model has to choose the correct answer from several candidate options based on a text and a question.

A : B :: C Candidates
boy:girl::brother daughter, sister, wife, father, son
bad:worse::big bigger, larger, smaller, biggest, better
Beijing:China::Paris France, Europe, Germany, Belgium, London
Chile:Chilean::China Japanese, Chinese, Russian, Korean, Ukrainian
TABLE II: Examples from our word analogy dataset. The correct answers are in bold.

4.2 Universal Analogy

As a new task, universal representation has to be evaluated in a multiple-granular analogy dataset. The purpose of proposing a task-independent dataset is to avoid determining the quality of the learned vectors and interpret the model based on a specific problem or situation. Since embeddings are essentially dense vectors, it is natural to apply mathematical operations on them. In this subsection, we introduce the procedure of constructing different levels of analogy datasets based on Google’s word analogy dataset.

4.2.1 Word-level analogy

Recall that in a word analogy task [1], two pairs of words that share the same type of relationship, denoted as : :: : , are involved. The goal is to solve questions like “ is to as is to ?”, which is to retrieve the last word from the vocabulary given the first three words. The objective can be formulated as maximizing the cosine similarity between the target word embedding and the linear combination of the given vectors:

where , , , represent embeddings of the corresponding words and are all normalized to unit lengths.

To facilitate comparison between models with different vocabularies, we construct a closed-vocabulary analogy task based on Google’s word analogy dataset through negative sampling. Concretely, for each question, we use GloVe to rank every word in the vocabulary and the top 5 results are considered to be candidate words. If GloVe fails to retrieve the correct answer, we manually add it to make sure it is included in the candidates. During evaluation, the model is expected to select the correct answer from 5 candidate words. Examples are listed in Table II.

4.2.2 Phrase/Sentence-level analogy

To investigate the arithmetic properties of vectors for higher levels of linguistic units, we present phrase and sentence analogy tasks based on the proposed word analogy dataset. We only consider a subset of the original analogy task because we find that for some categories, such as “Australia” : “ Australian”, the same template phrase/sentence cannot be applied on both words. Statistics are shown in Table III.

Dataset #p #q #c #l (p/s)
capital-common 23 506 5 6.0/12.0
capital-world 116 4524 5 6.0/12.0
city-state 67 2467 5 6.0/12.0
male-female 23 506 5 4.1/10.1
present-participle 33 1056 2 4.8/8.8
positive-comparative 37 1322 2 3.4/6.1
positive-negative 29 812 2 4.4/9.2
All 328 11193 - 5.4/10.7
TABLE III: Statistics of our analogy datasets. #p and #q are the number of pairs and questions for each category. #c is the number of candidates for each dataset. #l (p/s) is the average sequence length in phrase/sentence-level analogy datasets.
Fig. 3: Examples of Question-Answer pairs from our insurance FAQ dataset. The correct match to the query is highlighted.

Semantic Semantic analogies can be divided into four subsets: “capital-common”, “capital-world”, “city-state” and “male-female”. The first two sets can be merged into a larger dataset: “capital-country”, which contains pairs of countries and their capital cities; the third involves states and their cities; the last one contains pairs with gender relations. Considering GloVe’s poor performance on word-level “country-currency” questions (32%), we discard this subset in phrase and sentence-level analogies. Then we put words into contexts so that the resulting phrases and sentences also have linear relationships. For example, based on relationship Athens : Greece :: Baghdad : Iraq, we select phrases and sentences that contain the word “Athens” from the English Wikipedia Corpus222https://dumps.wikimedia.org/enwiki/latest: “He was hired by the university of Athens as being professor of physics.” and create examples: “hired by … Athens” : “hired by … Greece” :: “hired by … Baghdad” : “hired by … Iraq”. However, we found that such a question is identical to word-level analogy for BOW methods like averaging GloVe vectors, because they treat embeddings independently despite the content and word order. To avoid lexical overlap between sequences, we replace certain words and phrases with their synonyms and paraphrases, e.g., “hired by … Athens” : “employed by … Greece” :: “employed by … Baghdad” : “hired by … Iraq”. Usually sentences selected from the corpus have a lot of redundant information. To ensure consistency, we manually modify some words during the construction of templates. However, this procedure will not affect the relationship between sentences.

Category Topics
Daily Scenarios Traveling, Recipe, Skin care, Beauty makeup, Pets   22
Sport & Health Outdoor sports, Athletics, Weight loss, Medical treatment   15
Reviews Movies, Music, Poetry, Books   16
Persons Entrepreneurs, Historical/Public figures, Writers, Directors, Actors   17
General Festivals, Hot topics, TV shows     6
Specialized Management, Marketing, Commerce, Workplace skills   17
Others Relationships, Technology, Education, Literature   14
All - 107
TABLE IV: Details of the templates.

Syntactic We consider three typical syntactic analogies: Tense, Comparative and Negation, corresponding to three subsets: “present-participle”, “positive-comparative”, “positive-negative”, where the model needs to distinguish the correct answer from “past tense”, “superlative” and “positive”, respectively. For example, given phrases “Pigs are bright” : “Pigs are brighter than goats” :: “The train is slow”, the model need to give higher similarity score to the sentence that contains “slower” than the one that contains “slowest”. Similarly, we add synonyms and synonymous phrases for each question to evaluate the model ability of learning context-aware embeddings rather than interpreting each word in the question independently. For instance, “pleasantnot unpleasant” and “unpleasantnot pleasant”.

4.3 Retrieval-based FAQ

The sentence-level analogy discovers relationships between sentences by directly manipulating sentence vectors. Especially, we observe that sentences with similar meanings are close to each other in the vector space, which we find is consistent with the target of information retrieval task such as Frequently Asked Question (FAQ). Such task is to retrieve relevant documents ( FAQs) given a user query, which can be accurately done by only manipulating vectors representing the sentences, such as calculating and ranking vector distance in terms of cosine similarity. Thus, we present an insurance FAQ task in this subsection to explore the effectiveness of BURT in real-world retrieval applications.

An FAQ task involves a collection of Question-Answer (QA) pairs denoted as , where is the number of QA pairs. The goal is to retrieve the most relevant QA pairs for a given query. We collect frequently asked questions and answers between users and customer service from our partners in a Chinese online financial education institution. It contains over 4 types of insurance questions, e.g., concept explanation (“what”), insurance consultation (“why”, “how”), judgement (“whether”) and recommendation. An example is shown in Figure 3

. Our dataset is composed of 300 QA pairs that are carefully selected to avoid similar questions so that each query has only one exact match. Because queries are mainly paraphrases of the standard questions, we use query-Question similarity as the ranking score. The test set consists of 875 queries and the average lengths of questions and queries are 14 and 16, respectively. The evaluation metric is Top-1 Accuracy (Acc.) and Mean Reciprocal Rank (MRR) because there is only one correct answer for each query.

Batch size: 8, 16; Length: 128, 256; Epoch: 2, 3, 5, 50; lr: 1e-5, 2e-5, 3e-5
Models Single Sentence Sentence Pair MRC Avg.
(acc) (acc) (acc) (acc) (acc) (acc) (EM) (acc) (acc)
BERT 56.6 60.3 62.0 73.7 80.4 72.2 71.6 82.0 64.5 69.3
MLM 56.5 60.2 70.7 73.3 79.3 70.6 69.1 81.3 64.8 69.5
Span 56.7 59.6 72.1 73.5 79.7 71.0 71.6 82.2 65.3 70.2
BURT 56.9 60.5 74.1 73.1 80.8 71.3 71.7 82.2 65.7 70.7
BERT-wwm-ext 56.8 59.4 61.1 74.1 80.6 73.4 74.0 82.9 68.5 70.1
MLM 56.7 59.4 71.0 74.0 80.4 72.8 73.1 82.1 67.4 70.8
Span 56.9 58.5 73.8 73.2 80.2 71.6 72.4 82.2 67.1 70.7
BURT-wwm-ext 57.3 60.1 74.5 73.8 81.0 72.2 74.2 83.0 67.6 71.5
TABLE V: CLUE test results scored by the evaluation server444https://www.cluebenchmarks.com/rc.html. “acc” and “EM” denote accuracy and Exact Match, respectively.
Batch size: 8, 16, 32, 64; Length: 128; Epoch: 3; lr: 3e-5
Models Single Sentence NLI Semantic Similarity Avg.
(mc) (acc) m/mm(acc) (acc) (acc) (F1) (F1) (pc)
BERT-base 52.1 93.5 84.6/83.4 90.5 66.4 88.9 71.2 87.1 79.7
MLM 51.9 93.5 84.5/83.9 90.7 65.0 88.1 71.6 86.2 79.5
Span 53.3 93.8 84.5/84.0 90.9 66.6 88.0 71.6 86.1 79.9
BURT-base 55.7 94.5 84.7/84.1 91.1 67.1 88.2 71.6 86.4 80.4
BERT-large 60.5 94.9 86.7/85.9 92.7 70.1 89.3 72.1 87.6 82.2
MLM 61.1 94.5 86.6/85.6 92.5 69.2 90.2 72.3 87.0 82.1
Span 60.1 94.8 86.5/85.9 92.6 69.4 89.3 72.3 87.3 82.0
BURT-large 62.6 94.7 86.8/86.0 92.7 70.8 89.7 72.3 87.3 82.5
TABLE VI: GLUE test results scored by the evaluation server666https://gluebenchmark.com. We exclude the problematic WNLI set and recalculate the “Avg.” score. Results for BERT-base and BERT-large are obtained from [6]. “mc” and “pc” are Matthews correlation coefficient [52] and Pearson correlation coefficient, respectively.

4.4 Natural Language Generation

Moving from word and sentence vectors towards representation for sequences of any lengths, a universal language model may have the ability of capturing semantics of free text and facilitating various applications that are highly dependent on the quality of language representation. In this subsection, we introduce a retrieval-based Natural Language Generation (NLG) task. The task is to generate articles based on manually created templates. Concretely, the goal is to retrieve one paragraph at a time from the corpus which best describes a certain sentence from the template and then combine the retrieved paragraphs into a complete passage. The main difficulty of this task lies in the need to compare semantics of sentence-level queries (usually contain only a few words) and paragraph-level documents (often consist of multiple sentences).

We use articles collected by our partners in a media company as our corpus. Each article is split into several paragraphs and each document contains one paragraph. The corpus has a total of 656k documents and cover a wide range of domains, including news, stories and daily scenarios. In addition, we have a collection of manually created templates in terms of 7 main categories, as shown in Table IV. Each template provides an outline of an article and contains up to sentences. Each sentence describes a particular aspect of the topic.

The problem is solved in two steps. First, an index for all the documents is built using BM25. For each query, it will return a set of candidate documents that are related to the topic. Second, we use representation models to re-rank the top 100 candidates: each query-document pair is mapped to a score , where the scoring function is based on cosine similarity. Quality of the generated passages was assessed by two native Chinese speakers, who were asked to examine whether the retrieved paragraphs were “relevant” to the topic and “conveyed the meaning” of the given sentence.

5 Implementation

5.1 Data Processing

We download the English and Chinese Wikipedia Corpus777https://dumps.wikimedia.org and pre-process with process_wiki.py888https://github.com/panyang/Wikipedia_Word2vec/blob/master
, which extracts text from xml files. Then for the Chinese corpus, we convert the data into simplified characters using OpenCC. In order to extract high-quality n-grams, we remove punctuation marks and characters in other languages based on regular expressions, and finally get an English corpus of 2,266M words and a Chinese corpus of 380M characters.

We calculate PMI scores of all n-grams with a maximum length of for each document instead of the entire corpus considering that different documents usually describe different topics. We manually evaluate the extracted n-grams and find nearly 50% of the top 2000 n-grams contain 3 4 words (characters for Chinese), and only less than 0.5% n-grams are longer than 7. Although a larger n-gram vocabulary can cover longer n-grams, it will cause too many meaningless n-grams. Therefore, for both English and Chinese corpus, we empirically retain the top 3000 n-grams for each document, resulting in vocabularies of n-grams with average lengths of 4.6 and 4.5, respectively. Finally, for English, we randomly sample 10M sentences rather than use the entire corpus to reduce training time.

5.2 Pre-training

As in BERT, sentence pairs are packed into a single sequence and the special [CLS] token is used for sentence-level predicting. While in accordance with Joshi et al. (2020) [10], we find that single sentence training is better than the original sentence pair scenario. Thus in our experiments, the input is a continuous sequence with a maximum length of 512.

Barton’s inquiry was reasonable : Barton’s inquiry was not reasonable :: Changing the sign of numbers is an efficient algorithm
changing the sign of numbers is an inefficient algorithm GloVe: 0.96 USE: 0.89 BURT-base: 0.97
changing the sign of numbers is not an inefficient algorithm GloVe: 0.97 USE: 0.90 BURT-base: 0.96
Members are aware of their political work : Members are not aware of their political work :: This ant is a known species
This ant is an unknown species GloVe:0.94 USE:0.87 BURT-base: 0.96
This ant is not an unknown species GloVe: 0.95 USE:0.82 BURT-base: 0.95
TABLE VII: Questions and candidates from the sentence-level “positive-negative” analogy dataset and similarity scores for each candidate sentence computed by GloVe, USE and BURT-base. The correct sentences are in bold.
Models Word Phrase Sentence Avg.
semantic syntactic Avg. semantic syntactic Avg. semantic syntactic Avg.
GloVe [11] 82.6 78.0 80.3   0.0 40.9 20.5   0.2 39.8 20.0 40.3
InferSent [15] 68.8 88.7 78.8   0.0 54.1 27.0   0.0 50.8 25.4 43.7
GenSen [17] 44.5 84.4 64.5   0.0 54.4 27.2   0.0 44.9 22.4 38.0
USE [3] 73.0 83.1 78.0   1.8 63.1 32.5   0.6 44.1 22.4 44.3
LASER [2] 26.9 78.2 52.6   0.0 63.3 31.7   1.6 55.4 28.5 37.6
ALBERT-base [18] 32.2 43.1 37.7   0.0 56.4 28.2   0.0 59.2 29.6 31.8
ALBERT-xxlarge 32.1 37.5 34.8   0.9 50.5 25.7   0.3 50.3 25.3 28.6
RoBERTa-base [19] 28.6 50.5 39.5   0.0 46.1 23.0   0.1 63.6 31.8 31.5
RoBERTa-large 34.2 55.9 45.0   0.2 50.6 25.4   0.9 50.9 25.9 32.1
XLNet-base [7] 23.2 49.1 36.1   1.9 65.6 33.8   0.8 63.5 32.2 34.0
XLNet-large 23.4 42.0 32.7   4.7 53.5 29.1   5.6 48.4 27.0 29.6
BERT-base [6] 51.3 60.2 55.8   0.3 69.3 34.8   0.1 68.3 34.2 41.6
MLM 62.9 61.1 62.0   2.7 59.8 31.3   0.2 61.8 31.0 41.4
Span 63.5 58.9 61.2   1.9 68.9 35.4   0.1 63.1 31.6 42.7
BURT-base 71.1 74.4 72.8   1.7 69.1 35.4   0.6 63.4 32.0 46.7
BERT-large 49.7 46.6 48.2   0.1 67.4 33.9   0.5 61.2 30.9 37.7
MLM 65.0 50.7 57.9   1.0 63.4 32.2   0.5 56.7 28.6 39.6
Span 66.5 54.1 60.3   2.7 64.2 33.5   0.7 58.4 29.6 40.5
BURT-large 84.7 74.0 79.4   4.9 58.6 31.8   1.0 52.4 26.7 46.0
SBERT-base [sbert] 71.2 73.7 72.4 41.8 63.6 52.7 23.2 58.7 40.9 55.3
SBURT-base 82.8 77.5 80.2 33.2 70.5 51.9 30.8 69.1 50.0 60.7
SBERT-large 72.5 74.2 73.3 57.8 55.0 56.4 18.4 52.4 35.4 55.0
SBURT-large 84.4 76.6 80.5 34.7 50.1 42.4 5.7 53.1 29.4 50.8
TABLE VIII: Performance of different models on universal analogy datasets. Mean-pooling is applied to Transformer-based models to obtain fixed-length embeddings. The last column shows the average accuracy of word, phrase and sentence analogy tasks.

Instead of training from scratch, we initialize both English and Chinese models with the officially released checkpoints (bert-base-uncased, bert-large-uncased, bert-base-chinse) and BERT-wwm-ext, which is trained from the Chinese BERT using whole word masking on extended data [8]. Base models are comprised of 12 Transformer layers, 12 heads, 768 dimensional hidden states and 110M parameters in total. The English BERT-large has 24 Transformer layers, 16 heads, 1024 dimensional hidden states and 340M parameters in total. We use Adam optimizer [53] with initial learning rate of 5e-5 and linear warmup over the first 10% of the training steps. Batch size is set to 16 and dropout rate is 0.1. Each model is trained for one epoch.

5.3 Fine-tuning

Following BERT, in the fine-tuning procedure, pairs of sentences are concatenated into a single sequence with a special token [SEP] in between. For both single sentence and sentence pair tasks, the hidden state of the first token [CLS]

is used for softmax classification. We use the same sets of hyperparameters for all the evaluated models. All experiments on the GLUE benchmark are ran with a total train batch sizes between 8 and 64 and learning rates of 3e-5 for 3 epochs. For tasks from the CLUE benchmark, we set batch sizes to 8 and 16, learning rates between 1e-5 and 3e-5, and train 50 epochs on WSC and 2

5 epochs on the rest tasks.

5.4 Downstream-task Models

On GLUE and CLUE, we compare our model with three variants: pre-trained models (Chinese BERT/BERT-wwm-ext, English BERT-base/BERT-large), models trained with the same number of additional steps as our model (MLM), and models trained using random span masking with the same number of additional steps as our model (Span). For the Span model, we simply replace our n-gram module with the masking strategy as proposed by [10], where the sampling probability of span length

is based on a geometric distribution

. We follow the parameter setting that and maximum span length .

We also evaluate the aforementioned models on our universal analogy task. Baseline models include Bag-of-words (BoW) model from pre-trained word embeddings: GloVe, sentence embedding models: InferSent, GenSen, USE and LASER, pre-trained contextualized language models: BERT, ALBERT, RoBERTa and XLNet. To derive semantically meaningful embeddings, we fine-tune BERT and our model on the Stanford Natural Language Inference (SNLI) [16] and the Multi-Genre NLI Corpus [multinli] using a Siamese structure following Reimers and Gurevych (2019) [sbert].

Query: 端午节的由来 (The Origin of the Dragon Boat Festival)
: 一个中学的高级教师陈老师生动地解读端午节的由来,诵读爱好者进行原创诗歌作品朗诵,深深打动了在场的观众… (Mr. Chen, senior teacher at a middle School, vividly introduced the origin of the Dragon Boat Festival and people are reciting original poems, which deeply moved the audience…)
: 今天是端午小长假第一天…当天上午,在车厢内满目挂有与端午节相关的民俗故事及有关诗词的文字… (Today is the first day of the Dragon Boat Festival holiday…There are folk stories and poems posted in the carriage…)
: …端午节又称端阳节、龙舟节、浴兰节,是中华民族的传统节日。端午节形成于先秦,发展于汉末魏晋,兴盛于唐… (…Dragon Boat Festival, also known as Duanyang Festival, Longzhou Festival and Yulan Festival is a traditional festival of the Chinese nation. It is formed in the Pre-Qin Dynasty, developed in the late Han and Wei-Jin, and prospered in the Tang…)
Comments: and is related to the topic but does not convey the meaning of the query.
Query: 狗的喂养知识 (Dog Feeding Tips)
: …创建一个“比特狗”账户,并支付99元领养一只“比特狗”。然后购买喂养套餐喂养“比特狗”,“比特狗”就能通过每天挖矿产生BTGS虚拟货币。 (…First create a “Bitdog” account and pay 99 yuan to adopt a “Bitdog”. Then buy a package to feed the “Bitdog”, which can generate virtual currency BTGS through daily mining.)
: 要养成定时定量喂食的好习惯,帮助狗狗更好的消化和吸收,同时也要选择些低盐健康的狗粮… (It is necessary to feed your dog regularly and quantitatively to help them digest and absorb better. Meanwhile, choose some low-salt and healthy food…)
: 泰迪犬容易褪色是受到基因和护理不当的影响,其次是饮食太咸…一定要注意正确护理,定期洗澡,要给泰迪低盐营养的优质狗粮… (Teddy bear dog’s hair is easy to fade because of its genes and improper care. It is also caused by salty diet… So we must take good care of them, such as taking a bath regularly, and preparing dog food with low salt…)
: 还可以一周自制一次狗粮给狗狗喂食,就是买些肉类,蔬菜,自己动手做。偶尔吃吃自制狗粮也能增加狗狗的营养,和丰富狗狗的口味。日常的话,建议选择些适口性强的狗粮,有助磨牙,防止口腔疾病。 (You can also make dog food once a week, such as meats and vegetables. Occasionally eating homemade dog food can also supplement nutrition and enrich the taste. In daily life, it is recommended to choose some palatable dog food to help their teeth grinding and prevent oral diseases.)
Comments: is not a relevant paragraph. is relevant to the topic but is inaccurate.
TABLE IX: Examples of the retrieved paragraphs and corresponding comments from the judges. “B”-BM25, “L”-LASER, “S”-Span, “U”-BURT.
Method Acc. MRR
TF-IDF 73.7 0.813
BM25 72.1 0.802
LASER 79.9 0.856
BERT 76.8 0.831
MLM 78.3 0.843
Span 78.6 0.846
BURT 82.2 0.872
BERT-wwm-ext 76.7 0.834
MLM 76.7 0.834
Span 79.3 0.856
BURT-wwm-ext 80.7 0.863
TABLE X: Comparison of models performance on the FAQ dataset.
Judge1 60.3 63.9 65.9 65.0 69.3 71.8
Judge2 61.8 61.6 67.3 67.5 71.5 71.0
Avg. 61.1 62.8 66.6 66.3 70.4 71.4
Judge1 43.5 42.5 48.5 46.1 51.6 54.2
Judge2 41.2 38.4 47.8 45.5 53.9 56.5
Avg. 42.4 40.5 48.2 45.8 52.8 55.4
TABLE XI: Results on NLG according to human judgment. “R” and “CM” represent the percentage of paragraphs that are “relevant” and “convey the meaning”, respectively.

For FAQ and NLG, we compare our models with statistical methods such as TF-IDF and BM25, a sentence representation model LASER [2], the pre-trained BERT/BERT-wwm-ext and models trained with additional steps (MLM, Span). We observe that further training on Chinese SNLI and MNLI datasets underperforms BERT on the FAQ dataset. Therefore, we only consider pre-trained models for these two tasks.

6 Experiments

6.1 General Language Understanding

Table 6 and show the results on the GLUE and CLUE benchmarks, where we find that training BERT with additional MLM steps can hardly bring any improvement except for the WSC task. In Chinese, the Span model is effective on WSC but is comparable to BERT on other tasks. BERT-wwm-ext is better than our model on classification tasks involving pairs of short sentences such as AFQMC and OCNLI, which may be due to its relative powerful capability of modeling short sequences. Overall, both BURT and BURT-wwm-ext outperform the baseline models on 4 out of 6 tasks with considerable improvement, which sheds light on their effectiveness of modeling sequences of different lengths. The most significant improvement is observed on WSC (3.5% over the updated BERT-wwm-ext and 0.7% over the Span model), where the model is trained to determine whether the given two spans refer to the same entity in the text. We conjecture that the model benefits from learning to predict meaningful spans in the pre-training stage, so it is better at capturing the meanings of spans in the text. In English, our approach also improves the performance of BERT on various tasks from the GLUE benchmark, indicating that our proposed PMI-based masking method is general and independent with language settings.

6.2 Universal Analogy

Results on analogy tasks are reported in Table VIII. Generally, semantic analogies are more challenging than the syntactic ones and higher-level relationships between sequences are more difficult to capture, which is observed in almost all the evaluated models. On word analogy tasks, all well pre-trained language models like BERT, ALBERT, RoBERTa and XLNet hardly exhibit arithmetic characteristics and increasing the model size usually leads to a decrease in accuracy. However, our method of pre-training using n-grams extracted by the PMI algorithm significantly improves the performance on word analogies compared with BERT, obtaining 72.8% (BURT-base) and 79.4% (BURT-large) accuracy, respectively. Further training BURT-large on SNLI and MNLI results in the highest accuracy (80.5%).

Despite the leading performance on word-level analogy datasets of GloVe, InferSent and USE, they do not generalize well on higher level analogy tasks. We conjecture their poor performance is caused by synonyms and paraphrases in sentences which lead the model to produce lower similarity scores to the correct answers. In contrast, Transformer-based models are more advantageous in representing higher-level sequences and are good at identifying paraphrases and capturing relationships between sentences even if they have less lexical overlap. Moreover, fine-tuning pre-trained models achieves considerable improvements on high-level semantic analogies. Overall, SBURT-base achieves the highest average accuracy (60.7%).

Examples from the Negation subset are shown in Table VII. Notice that the word “not” does not explicitly appear in the correct answers. Instead, “inefficient” and “unaware” are indicators of negation. As expected, BOW will give a higher similarity score for the sentence that contain both “not” and “inefficient” because the word-level information is simply added and subtracted despite the context. By contrast, contextualized models like BURT capture the meanings and relationships of words within the sequence in a comprehensive way, indicating that it has indeed learned universal representations across different linguistic units.

p_man: employed by the man p_woman: hired by the woman
p_king: employed by the king p_queen: hired by the queen
p_dad: employed by his dad p_mom: hired by his mom
s_man: He was employed by the man when he was 28.
s_woman: He was hired by the woman at age 28.
s_king: He was employed by the king when he was 28.
s_queen: He was hired by the queen at age 28.
s_dad: He was employed by his dad when he was 28.
s_mom: He was hired by his mom at age 28.
TABLE XII: Annotation of phrases and sentences in Figure 4.
























Fig. 4: Two-dimensional PCA projection of the vectors representing “male” and “female” generated by BURT. Pairs are connected by dashed lines. Points in the figure as explained in detail in Table XII.
Fig. 5: t-SNE projection of patterns.

6.3 Retrieval-based FAQ

Results are reported in Table X. As we can see, LASER and all pre-trained language models significantly outperform TF-IDF and BM25, indicating the superiority of embedding-based models over statistical methods. Besides, the continued BERT training is often beneficial. Among all the evaluated models, our BURT yields the highest accuracy (82.2%) and MRR (0.872). BURT-wwm-ext achieves a slightly lower accuracy (80.7%) compared with BURT but it still exceeds its baselines by 4.0% (MLM) and 1.4% (Span), respectively.

6.4 Natural Language Generation

Results are summarized in Table XI. Although nearly 62% of the paragraphs retrieved by BM25 are relevant to the topic, only two-thirds of them actually convey the original meaning of the template. Despite LASER’s comparable performance to BURT on FAQ, it is less effective when different granular linguistic units are involved at the same time. Re-ranking using BURT substantially improves the quality of the generated paragraphs. We show examples retrieved by BM25 , LASER, the Span model and BURT in Table IX, denoted by , , and , respectively. BM25 tends to favor paragraphs that contain the keywords even though the paragraph conveys a different meaning, while BURT selects accurate answers according to semantic meanings of queries and documents.

7 Visualization

7.1 Single Pattern

Mikolov et al. (2013) [20] use PCA to project word embeddings into a two-dimensional space to visualize a single pattern captured by the Word2Vec model, while in this work we consider embeddings for different granular linguistic units. All pairs in Figure 4 belong to the “male-female” category and subtracting the two vectors results in roughly the same direction.

7.2 Clustering

Given that embeddings of sequences with the same kind of relationship will exhibit the same pattern in the vector space, we obtain the difference between pairs of embeddings for words, phrases and sentences from different categories and visualize them by t-SNE. Figure 5 shows that by subtracting two vectors, pairs that belong to the same category automatically fall into the same cluster. Only the pairs from “capital-country” and “city-state” cannot be totally distinguished, which is reasonable because they all describe the relationship between geographical entities.












Can 80-year-old peopleget accident insurance?

Can seniors buyaccident insurance?

Will managers recommend productsto clients for their own benefit?

Can life insurance lastuntil the age of 80?

Is thereany insurancethat yourecommend?

Whichinsurance issuitable for me?

Can I get insurance for my boyfriend?

Can I get insuranceafter an accident?
Fig. 6: t-SNE projection of BURT embeddings. Blue dots: queries, Red dots: sentences retrieved by BURT, Grey dots: sentences retrieved by TF-IDF and BM25.

7.3 Faq

We show examples in Figure 6 where BURT successfully retrieve the correct answer while TF-IDF and BM25 fail. Both sentences “Can 80-year-old people get accident insurance?” and “Can life insurance last until the age of 80?” contain the word “80”, which is a possible reason why TF-IDF tends to believe they highly match with each other, ignoring that the two sentences are actually describing two different issues. In contrast, using vector-based representations, BURT considers “seniors” as a paraphrase of “80-year-old people”. As depicted in Figure 6, queries are close to the correct responses and away from other sentences.

8 Conclusion

This paper formally introduces the task of universal representation learning and then presents a pre-trained language model for such a purpose to map different granular linguistic units into the same vector space where similar sequences have similar representations and enable unified vector operations among different language hierarchies.

In detail, we focus on the less concentrated language representation, seeking to learn a uniform vector form across different linguistic unit hierarchies. Far apart from learning either word only or sentence only representation, our method extends BERT’s masking and training objective to a more general level, which leverage information from sequences of different lengths in a comprehensive way and effectively learns a universal representation from words, phrases to sentences.

Overall, our proposed BURT outperforms its baselines on a wide range of downstream tasks with regard to sequences of different lengths in both English and Chinese languages. We especially provide an universal analogy task, an insurance FAQ dataset and an NLG dataset for extensive evaluation, where our well-trained universal representation model holds the promise for demonstrating accurate vector arithmetic with regard to words, phrases and sentences and in real-world retrieval applications.


  • [1] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 2013.
  • [2] M. Artetxe and H. Schwenk, “Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond,” Trans. Assoc. Comput. Linguistics, vol. 7, pp. 597–610, 2019. [Online]. Available: https://transacl.org/ojs/index.php/tacl/article/view/1742
  • [3] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil, “Universal sentence encoder for English,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.   Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 169–174.
  • [4] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of NAACL-HLT, 2018, pp. 2227–2237.
  • [5] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
  • [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018.
  • [7] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” arXiv preprint arXiv:1906.08237, 2019.
  • [8] Y. Cui, W. Che, T. Liu, B. Qin, Z. Yang, S. Wang, and G. Hu, “Pre-training with whole word masking for chinese bert,” 2019.
  • [9] W. Wang, B. Bi, M. Yan, C. Wu, J. Xia, Z. Bao, L. Peng, and L. Si, “Structbert: Incorporating language structures into pre-training for deep language understanding,” in International Conference on Learning Representations, 2020.
  • [10] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, “Spanbert: Improving pre-training by representing and predicting spans,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 64–77, 2020.
  • [11] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543.
  • [12] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” 2016.
  • [13] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in Advances in neural information processing systems, 2015, pp. 3294–3302.
  • [14] L. Logeswaran and H. Lee, “An efficient framework for learning sentence representations,” in International Conference on Learning Representations (ICLR), 2018.
  • [15]

    A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sentence representations from natural language inference data,” in

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.   Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 670–680.
  • [16] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.   Lisbon, Portugal: Association for Computational Linguistics, Sep. 2015, pp. 632–642.
  • [17] S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal, “Learning general purpose distributed sentence representations via large scale multi-task learning,” in International Conference on Learning Representations, 2018.
  • [18] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” 2019.
  • [19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019.
  • [20]

    T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” 2013.

  • [21] O. Levy and Y. Goldberg, “Linguistic regularities in sparse and explicit word representations,” in CoNLL 2014, 2014.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [23]

    X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep neural networks for natural language understanding,” in

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.   Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 4487–4496.
  • [24] Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, and Y. Goldberg, “Fine-grained analysis of sentence embeddings using auxiliary prediction tasks,” CoRR, 2016.
  • [25] A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni, “What you can cram into a single vector: Probing sentence embeddings for linguistic properties,” CoRR, 2018.
  • [26] G. Bacon and T. Regier, “Probing sentence embeddings for structure-dependent tense,” in Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2018, Brussels, Belgium, November 1, 2018, 2018.
  • [27] A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in bertology: What we know about how bert works,” 2020.
  • [28] X. Ma, Z. Wang, P. Ng, R. Nallapati, and B. Xiang, “Universal text representation from bert: An empirical study,” 2019.
  • [29] G. Jawahar, B. Sagot, and D. Seddah, “What does BERT learn about the structure of language?” in ACL 2019, 2019.
  • [30] P. Barancíková and O. Bojar, “In search for linear relations in sentence embedding spaces,” in ITAT 2019, ser. CEUR Workshop Proceedings, 2019.
  • [31] X. Zhu and G. de Melo, “Sentence analogies: Exploring linguistic relationships and regularities in sentence embeddings,” CoRR, 2020.
  • [32] K. Hammond, R. Burke, C. Martin, and S. Lytinen, “Faq finder: a case-based approach to knowledge navigation,” in Proceedings of the 11th Conference on Artificial Intelligence for Applications, 1995.
  • [33] V. Jijkoun and M. de Rijke, “Retrieving answers from frequently asked questions pages on the web,” in CIKM 2005, 2005.
  • [34] E. Sneiders, “Automated faq answering with question-specific knowledge representation for web self-service,” 2009.
  • [35] S. Damani, K. N. Narahari, A. Chatterjee, M. Gupta, and P. Agrawal, “Optimized transformer models for FAQ answering,” in PAKDD 2020, ser. Lecture Notes in Computer Science, 2020.
  • [36] W. Sakata, T. Shibata, R. Tanaka, and S. Kurohashi, “FAQ retrieval using query-question similarity and bert-based query-answer relevance,” in SIGIR 2019, 2019.
  • [37] A. Stolcke, “Srilm - an extensible language modeling toolkit,” in INTERSPEECH, 2002.
  • [38] K. W. Church and P. Hanks, “Word association norms, mutual information, and lexicography,” Computational linguistics, vol. 16, no. 1, pp. 22–29, 1990.
  • [39] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Nov. 2018, pp. 353–355.
  • [40] A. Warstadt, A. Singh, and S. R. Bowman, “Neural network acceptability judgments,” 2019.
  • [41] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Oct. 2013, pp. 1631–1642.
  • [42] N. Nangia, A. Williams, A. Lazaridou, and S. Bowman, “The RepEval 2017 shared task: Multi-genre natural language inference with sentence representations,” in Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, Sep. 2017, pp. 1–10.
  • [43] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ questions for machine comprehension of text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Nov. 2016, pp. 2383–2392.
  • [44] W. B. Dolan and C. Brockett, “Automatically constructing a corpus of sentential paraphrases,” in Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
  • [45] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation,” in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Aug. 2017, pp. 1–14.
  • [46] L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan, “CLUE: A Chinese language understanding evaluation benchmark,” in Proceedings of the 28th International Conference on Computational Linguistics, Dec. 2020, pp. 4762–4772.
  • [47] LTD. IFLYTEK CO, “Iflytek: a multiple categories chinese text classifier,” 2019.
  • [48] H. Hu, K. Richardson, L. Xu, L. Li, S. Kuebler, and L. S. Moss, “Ocnli: Original chinese natural language inference,” 2020.
  • [49] Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu, “A span-extraction dataset for Chinese machine reading comprehension,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Nov. 2019, pp. 5883–5889.
  • [50] C. Zheng, M. Huang, and A. Sun, “ChID: A large-scale Chinese IDiom dataset for cloze test,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, Jul. 2019, pp. 778–787.
  • [51] K. Sun, D. Yu, D. Yu, and C. Cardie, “Probing prior knowledge needed in challenging chinese machine reading comprehension,” ArXiv, vol. abs/1904.09679, 2019.
  • [52] B. Matthews, “Comparison of the predicted and observed secondary structure of t4 phage lysozyme,” Biochimica et Biophysica Acta (BBA) - Protein Structure, vol. 405, no. 2, pp. 442 – 451, 1975.
  • [53] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.