Representations learned by deep neural models have attracted a lot of attention in Natural Language Processing (NLP). However, previous language representation learning methods such as Word2Vec , LASER  and USE  focus on either words or sentences. Later proposed pre-trained contextualized language representations like ELMo , GPT, BERT  and XLNet  may seemingly handle different sized input sentences, but all of them focus on sentence-level specific representation still for each word, leading to unsatisfactory performance in real-world situations. Although the latest BERT-wwm-ext , StructBERT  and SpanBERT  perform MLM on a higher linguistic level, the masked segments (whole words, trigrams, spans) either follow a pre-defined distribution or focus on a specific granularity. Besides, the random sampling strategy ignores important semantic and syntactic information of a sequence, resulting in a large number of meaningless segments.
However, universal representation among different levels of linguistic units may offer a great convenience when it is needed to handle free text in language hierarchy in a unified way. As well known that, embedding representation for a certain linguistic unit (i.e., word) enables linguistics-meaningful arithmetic calculation among different vectors, also known as word analogy. For example, vector (“King”) - vector (“Man”) + vector (“Woman”) results in vector (“Queen”). Thus universal representation may generalize such good analogy features or meaningful arithmetic operation onto free text with all language levels involved together. For example, Eat an onion : Vegetable :: Eat a pear : Fruit. In fact, manipulating embeddings in the vector space reveals syntactic and semantic relations between the original sequences and this feature is indeed useful in true applications. For example, “London is the capital of England.” can be formulized as . Then given two documents one of which contains “England” and “capital”, the other contains “London”, we consider these two documents relevant.
In this paper, we explore the regularities of representations including words, phrases and sentences in the same vector space. To this end, we introduce a universal analogy task derived from Google’s word analogy dataset. To solve such task, we present BURT, a pre-trained model that aims at learning universal representations for sequences of various lengths. Our model follows the architecture of BERT but differs from its original masking and training scheme. Specifically, we propose to efficiently extract and prune meaningful segments (n-grams) from unlabeled corpus with little human supervision, and then use them to modify the masking and training objective of BERT. The n-gram pruning algorithm is based on point-wise mutual information (PMI) and automatically captures different levels of language information, which is critical to improving the model capability of handling multiple levels of linguistic objects in a unified way, i.e., embedding sequences of different lengths in the same vector space.
Overall, our pre-trained models improves the performance of baselines in both English and Chinese. In English, BURT-base reaches 0.7 percent gain on average over Google BERT-base. In Chinese, BURT-wwm-ext obtains 74.5% on the WSC test set, 13.4% point absolute improvement compared with BERT-wwm-ext and exceeds the baselines by 0.2%
0.6% point accuracy on five other CLUE tasks including TNEWS, IFLYTEK, CSL, ChID and CMRC 2018. Extensive experimental results on our universal analogy task demonstrate that BURT is able to map sequences of variable lengths into a shared vector space where similar sequences are close to each other. Meanwhile, addition and subtraction of embeddings reflect semantic and syntactic connections between sequences. Moreover, BURT can be easily applied to real-world applications such as Frequently Asked Questions (FAQ) and Natural Language Generation (NLG) tasks, where it encodes words, sentences and paragraphs into the same embedding space and directly retrieves sequences that are semantically similar to the given query based on cosine similarity. All of the above experimental results demonstrate that our well-trained model leads to universal representation that can adapt to various tasks and applications.
2.1 Word and Sentence Embeddings
Representing words as real-valued dense vectors is a core technique of deep learning in NLP. Word embedding models[1, 11, 12] map words into a vector space where similar words have similar latent representations. ELMo  attempts to learn context-dependent word representations through a two-layer bi-directional LSTM network. In recent years, more and more researchers focus on learning sentence representations. The Skip-Thought model  is designed to predict the surrounding sentences for an given sentence. Logeswaran and Lee (2018) 
improve the model structure by replacing the RNN decoder with a classifier. InferSent
is trained on the Stanford Natural Language Inference (SNLI) dataset in a supervised manner. Subramanian et al. (2018)  and Cer et al. (2018)  employ multi-task training and report considerable improvements on downstream tasks. LASER  is a BiLSTM encoder designed to learn multilingual sentence embeddings. Most recently, contextualized representations with a language model training objective such as OpenAI GPT , BERT , XLNet  are expected to capture complex features (syntax and semantics) for sequences of any length. Especially, BERT improves the pre-training and fine-tuning scenario, obtaining new state-of-the-art results on multiple sentence-level tasks. On the basis of BERT, further fine-tuning using Siamese Network on NLI data can effectively produce high quality sentence embeddings [sbert]. Nevertheless, most of the previous work concentrate on a specific granularity. In this work we extend the training goal to a unified level and enables the model to leverage different granular information, including, but not limited to, word, phrase or sentence.
2.2 Pre-Training Tasks
BERT is trained on a large amount of unlabeled data including two training targets: Masked Language Model (MLM) for modeling deep bidirectional representations, and Next Sentence Prediction (NSP) for understanding the relationship between two sentences. ALBERT  is trained with Sentence-Order Prediction (SOP) as a substitution of NSP. StructBERT  has a sentence structural objective that combines the random sampling strategy of NSP and continuous sampling as in SOP. However, RoBERTa  and SpanBERT  use single contiguous sequences of 512 tokens for pre-training and show that removing the NSP objective improves the performance. Besides, BERT-wwm , StructBERT , SpanBERT  perform MLM on higher linguistic levels, augmenting the MLM objective by masking whole words, trigrams or spans, respectively. Nevertheless, we concentrate on enhancing the masking and training procedures from a broader and more general perspective.
2.3 Analysis and Applications
Previous explorations of vector regularities mainly study word embeddings [1, 20]. After the introduction of sentence encoders and Transformer models , more works were done to investigate sentence-level embeddings. Usually the performance in downstream tasks is considered to be the measurement for model ability of representing sentences [15, 3, 23]. Some research proposes probing tasks to understand certain aspects of sentence embeddings [24, 25, 26]. Specifically, Rogers et al. (2020)  and Ma et al. (2019)  look into BERT embeddings and reveal its internal working mechanisms. Some work also explores the regularities in sentence embeddings [30, 31]. Nevertheless, little work analyzes words, phrases and sentences in the same vector space. In this paper, We work on embeddings for sequences of various lengths obtained by different models in a task-independent manner.
Transformer-based representation models have made great progress in measuring query-Question or query-Answer similarities. Damani et al. (2020)  make an analysis on Transformer models and propose a neural architecture to solve the FAQ task. Sakata et al. (2019)  come up with an FAQ retrieval system that combines the characteristics of BERT and rule-based methods. In this work, we also evaluate the well-trained universal representation models on FAQ task.
Our BURT follows the Transformer encoder  architecture where the input sequence is first split into subword tokens and a contextualized representation is learned for each token. We only perform MLM training on single sequences as suggested in . The basic idea is to mask some of the tokens from the input and force the model to recover them from the context. Here we propose a unified masking method and training objective considering different grained linguistic units.
Specifically, we apply an pruning mechanism to collect meaningful n-grams from the corpus and then perform n-gram masking and predicting. Our model differs from the original BERT and other BERT-like models in several ways. First, instead of the token-level MLM of BERT, we incorporate different levels of linguistic units into the training objective in a comprehensive manner. Second, unlike SpanBERT and StructBERT which sample random spans or trigrams, our n-gram sampling approach automatically discovers structures within any sequence and is not limited to any granularity.
3.1 N-gram Pruning
In this subsection, we introduce our approach of extracting a large number of meaningful n-grams from the monolingual corpus, which is a critical step of data processing.
First, we scan the corpus and extract all n-grams with lengths up to using the SRILM toolkit111http://www.speech.sri.com/projects/srilm/download.html . In order to filter out meaningless n-grams and prevent the vocabulary from being too large, we apply pruning by means of point-wise mutual information (PMI) . To be specific, mutual information describes the association between tokens and
by comparing the probability of observingand together with the probabilities of observing and independently. Higher mutual information indicates stronger association between the two tokens.
In practice, and denote the probabilities of and , respectively, and represents the joint probability of observing followed by . This alleviates bias towards high-frequency words and allows tokens that are rarely used individually but often appear together such as “San Francisco” to have higher scores. In our application, an n-gram denoted as , where is the number of tokens in , may contains more than two words. Therefore, we present an extended PMI formula displayed as below:
where the probabilities are estimated by counting the number of observations of each token andn-gram in the corpus, and normalizing by the size of the corpus. is an additional normalization factor which avoids extremely low scores for longer n-grams. Finally, n-grams with PMI scores below the chosen threshold are filtered out, resulting in a vocabulary of meaningful n-grams.
3.2 N-gram Masking
For a given input , where is the number of tokens in , special tokens [CLS] and [SEP] are added at the beginning and end of the sequence, respectively. Before feeding the training data into the Transformer blocks, we identify all the n-grams in the sequence using the aforementioned n-gram vocabulary. An example is shown in Figure 2, where there are overlap between n-grams, which indicates the multi-granular inner structure of the given sequence. In order to make better use of higher-level linguistic information, the longest n-gram is retained if multiple matches exist. Compared with other masking strategies, our method has two advantages. First, n-gram extracting and matching can be efficiently done in an unsupervised manner without introducing random noise. Second, by utilizing n-grams of different lengths, we generalize the masking and training objective of BERT to a unified level where different granular linguistic units are integrated.
Following BERT, we mask 15% of all tokens in each sequence. The data processing algorithm uniformly samples one n-gram at a time until the maximum number of masking tokens is reached. 80% of the time the we replace the entire n-gram with [MASK] tokens. 10% of the time it is replace with random tokens and 10% of the time we keep it unchanged. The original token-level masking is retained and considered as a special case of n-gram masking where . We employ dynamic masking as mentioned by Liu et al. (2019) 
, which means masking patterns for the same sequence in different epochs are probably different.
3.3 Traning Objective
As depicted in Figure 1, the Transformer encoder generates a fixed-length contextualized representation at each input position and the model only predicts the masked tokens. Ideally, a universal representation model is able to capture features for multiple levels of linguistic units. Therefore, we extend the MLM training objective to a more general situation, where the model is trained to predict n-grams rather than subwords.
where is a masked n-gram and is a corrupted version of the input sequence. represents the absolute start and end positions of .
4 Task Setup
To evaluate the model ability of handling different linguistic units, we apply our model on downstream tasks from GLUE and CLUE benchmark. Moreover, we construct a universal analogy task based on Google’s word analogy dataset to explore the regularity of universal representation. Finally, we present an insurance FAQ task and a retrieval-based language generation task, where the key is to embed sequences of different lengths in the same vector space and retrieve sequences with similar meaning to the given query.
4.1 General Language Understanding
Statistics of the GLUE and CLUE benchmarks are listed in Table I. Besides the diversity of task types, we also find that different datasets concentrates on sequences of different lengths, which satisfies our need to examine the model ability of representing multiple granular linguistic units.
The General Language Understanding Evaluation (GLUE) benchmark  is a collection of tasks that is widely used to evaluate the performance of English language models. We divide eight NLU tasks from the GLUE benchmark into three main categories.
Single-Sentence Classification The Corpus of Linguistic Acceptability (CoLA)  is to determine whether a sentence is grammatically acceptability or not. The Stanford Sentiment Treebank (SST-2)  is a sentiment classification task that requires the model to predict whether the sentiment of a sentence is positive or negative. In both datasets, each example is a sequence of words annotated with a label.
Natural Language Inference Multi-Genre Natural Language Inference (MNLI) 
, Stanford Question Answering Dataset (QNIL) and Recognizing Textual Entailment (RTE) [rte] are natural language inference tasks, where a pair of sentences are given and the model is trained to identify the relationship between the two sentences from entailment, contradiction, and neutral.
Semantic Similarity Semantic similarity tasks identify whether the two sentences are equivalent or measure the degree of semantic similarity of two sentences according to their representations. Microsoft Paraphrase corpus (MRPC) 
and Quora Question Pairs (QQP) dataset are paraphrase datasets, where each example consists of two sentences and a label of “1” indicating they are paraphrases or “0” otherwise. The goal of Semantic Textual Similarity benchmark (STS-B) is to predict a continuous scores from 1 to 5 for each pair as the similarity of the two sentences.
The Chinese General Language Understanding Evaluation (ChineseGLUE or CLUE) benchmark  is a Chinese version of the GLUE benchmark for language understanding. We also find nine tasks from the CLUE benchmark can be classified into three groups.
Single Sentence Tasks We utilize three single-sentence classification tasks including TouTiao Text Classification for News Titles (TNEWS), IFLYTEK  and the Chinese Winograd Schema Challenge (WSC) dataset. Examples from TNEWS and IFLYTEK are short and long sequences, respectively, and the goal is to predict the category that the given single sequence belongs to. WSC is a coreference resolution task where the model is required to decide whether two spans refer to the same entity in the original sequence.
Sentence Pair Tasks The Ant Financial Question Matching Corpus (AFQMC), Chinese Scientific Literature (CSL) dataset and Original Chinese Natural Language Inference (OCNLI)  are three pairwise textual classification tasks. AFQMC contains sentence pairs and binary labels, and the model is asked to examine whether two sentences are semantically similar. Each example in CSL involves a text and several keywords. The model needs to determine whether these keywords are true labels of the text. OCNLI is a natural language inference task following the same collection procedures of MNLI.
Machine Reading Comprehension Tasks CMRC 2018 , ChID , and C3  are span-extraction based, cloze style and free-form multiple-choice machine reading comprehension datasets, respectively. Answers to the questions in CMRC 2018 are spans extracted from the given passages. ChID is a collection of passages with blanks and corresponding candidates for the model to decide the most suitable option. C3 is similar to RACE and DREAM, where the model has to choose the correct answer from several candidate options based on a text and a question.
|A : B :: C||Candidates|
|boy:girl::brother||daughter, sister, wife, father, son|
|bad:worse::big||bigger, larger, smaller, biggest, better|
|Beijing:China::Paris||France, Europe, Germany, Belgium, London|
|Chile:Chilean::China||Japanese, Chinese, Russian, Korean, Ukrainian|
4.2 Universal Analogy
As a new task, universal representation has to be evaluated in a multiple-granular analogy dataset. The purpose of proposing a task-independent dataset is to avoid determining the quality of the learned vectors and interpret the model based on a specific problem or situation. Since embeddings are essentially dense vectors, it is natural to apply mathematical operations on them. In this subsection, we introduce the procedure of constructing different levels of analogy datasets based on Google’s word analogy dataset.
4.2.1 Word-level analogy
Recall that in a word analogy task , two pairs of words that share the same type of relationship, denoted as : :: : , are involved. The goal is to solve questions like “ is to as is to ?”, which is to retrieve the last word from the vocabulary given the first three words. The objective can be formulated as maximizing the cosine similarity between the target word embedding and the linear combination of the given vectors:
where , , , represent embeddings of the corresponding words and are all normalized to unit lengths.
To facilitate comparison between models with different vocabularies, we construct a closed-vocabulary analogy task based on Google’s word analogy dataset through negative sampling. Concretely, for each question, we use GloVe to rank every word in the vocabulary and the top 5 results are considered to be candidate words. If GloVe fails to retrieve the correct answer, we manually add it to make sure it is included in the candidates. During evaluation, the model is expected to select the correct answer from 5 candidate words. Examples are listed in Table II.
4.2.2 Phrase/Sentence-level analogy
To investigate the arithmetic properties of vectors for higher levels of linguistic units, we present phrase and sentence analogy tasks based on the proposed word analogy dataset. We only consider a subset of the original analogy task because we find that for some categories, such as “Australia” : “ Australian”, the same template phrase/sentence cannot be applied on both words. Statistics are shown in Table III.
Semantic Semantic analogies can be divided into four subsets: “capital-common”, “capital-world”, “city-state” and “male-female”. The first two sets can be merged into a larger dataset: “capital-country”, which contains pairs of countries and their capital cities; the third involves states and their cities; the last one contains pairs with gender relations. Considering GloVe’s poor performance on word-level “country-currency” questions (32%), we discard this subset in phrase and sentence-level analogies. Then we put words into contexts so that the resulting phrases and sentences also have linear relationships. For example, based on relationship Athens : Greece :: Baghdad : Iraq, we select phrases and sentences that contain the word “Athens” from the English Wikipedia Corpus222https://dumps.wikimedia.org/enwiki/latest: “He was hired by the university of Athens as being professor of physics.” and create examples: “hired by … Athens” : “hired by … Greece” :: “hired by … Baghdad” : “hired by … Iraq”. However, we found that such a question is identical to word-level analogy for BOW methods like averaging GloVe vectors, because they treat embeddings independently despite the content and word order. To avoid lexical overlap between sequences, we replace certain words and phrases with their synonyms and paraphrases, e.g., “hired by … Athens” : “employed by … Greece” :: “employed by … Baghdad” : “hired by … Iraq”. Usually sentences selected from the corpus have a lot of redundant information. To ensure consistency, we manually modify some words during the construction of templates. However, this procedure will not affect the relationship between sentences.
|Daily Scenarios||Traveling, Recipe, Skin care, Beauty makeup, Pets||22|
|Sport & Health||Outdoor sports, Athletics, Weight loss, Medical treatment||15|
|Reviews||Movies, Music, Poetry, Books||16|
|Persons||Entrepreneurs, Historical/Public figures, Writers, Directors, Actors||17|
|General||Festivals, Hot topics, TV shows||6|
|Specialized||Management, Marketing, Commerce, Workplace skills||17|
|Others||Relationships, Technology, Education, Literature||14|
Syntactic We consider three typical syntactic analogies: Tense, Comparative and Negation, corresponding to three subsets: “present-participle”, “positive-comparative”, “positive-negative”, where the model needs to distinguish the correct answer from “past tense”, “superlative” and “positive”, respectively. For example, given phrases “Pigs are bright” : “Pigs are brighter than goats” :: “The train is slow”, the model need to give higher similarity score to the sentence that contains “slower” than the one that contains “slowest”. Similarly, we add synonyms and synonymous phrases for each question to evaluate the model ability of learning context-aware embeddings rather than interpreting each word in the question independently. For instance, “pleasant” “not unpleasant” and “unpleasant” “not pleasant”.
4.3 Retrieval-based FAQ
The sentence-level analogy discovers relationships between sentences by directly manipulating sentence vectors. Especially, we observe that sentences with similar meanings are close to each other in the vector space, which we find is consistent with the target of information retrieval task such as Frequently Asked Question (FAQ). Such task is to retrieve relevant documents ( FAQs) given a user query, which can be accurately done by only manipulating vectors representing the sentences, such as calculating and ranking vector distance in terms of cosine similarity. Thus, we present an insurance FAQ task in this subsection to explore the effectiveness of BURT in real-world retrieval applications.
An FAQ task involves a collection of Question-Answer (QA) pairs denoted as , where is the number of QA pairs. The goal is to retrieve the most relevant QA pairs for a given query. We collect frequently asked questions and answers between users and customer service from our partners in a Chinese online financial education institution. It contains over 4 types of insurance questions, e.g., concept explanation (“what”), insurance consultation (“why”, “how”), judgement (“whether”) and recommendation. An example is shown in Figure 3
. Our dataset is composed of 300 QA pairs that are carefully selected to avoid similar questions so that each query has only one exact match. Because queries are mainly paraphrases of the standard questions, we use query-Question similarity as the ranking score. The test set consists of 875 queries and the average lengths of questions and queries are 14 and 16, respectively. The evaluation metric is Top-1 Accuracy (Acc.) and Mean Reciprocal Rank (MRR) because there is only one correct answer for each query.
|Batch size: 8, 16; Length: 128, 256; Epoch: 2, 3, 5, 50; lr: 1e-5, 2e-5, 3e-5|
|Models||Single Sentence||Sentence Pair||MRC||Avg.|
|Batch size: 8, 16, 32, 64; Length: 128; Epoch: 3; lr: 3e-5|
|Models||Single Sentence||NLI||Semantic Similarity||Avg.|
4.4 Natural Language Generation
Moving from word and sentence vectors towards representation for sequences of any lengths, a universal language model may have the ability of capturing semantics of free text and facilitating various applications that are highly dependent on the quality of language representation. In this subsection, we introduce a retrieval-based Natural Language Generation (NLG) task. The task is to generate articles based on manually created templates. Concretely, the goal is to retrieve one paragraph at a time from the corpus which best describes a certain sentence from the template and then combine the retrieved paragraphs into a complete passage. The main difficulty of this task lies in the need to compare semantics of sentence-level queries (usually contain only a few words) and paragraph-level documents (often consist of multiple sentences).
We use articles collected by our partners in a media company as our corpus. Each article is split into several paragraphs and each document contains one paragraph. The corpus has a total of 656k documents and cover a wide range of domains, including news, stories and daily scenarios. In addition, we have a collection of manually created templates in terms of 7 main categories, as shown in Table IV. Each template provides an outline of an article and contains up to sentences. Each sentence describes a particular aspect of the topic.
The problem is solved in two steps. First, an index for all the documents is built using BM25. For each query, it will return a set of candidate documents that are related to the topic. Second, we use representation models to re-rank the top 100 candidates: each query-document pair is mapped to a score , where the scoring function is based on cosine similarity. Quality of the generated passages was assessed by two native Chinese speakers, who were asked to examine whether the retrieved paragraphs were “relevant” to the topic and “conveyed the meaning” of the given sentence.
5.1 Data Processing
We download the English and Chinese Wikipedia Corpus777https://dumps.wikimedia.org and pre-process with process_wiki.py888https://github.com/panyang/Wikipedia_Word2vec/blob/master
/v1/process_wiki.py, which extracts text from xml files. Then for the Chinese corpus, we convert the data into simplified characters using OpenCC. In order to extract high-quality n-grams, we remove punctuation marks and characters in other languages based on regular expressions, and finally get an English corpus of 2,266M words and a Chinese corpus of 380M characters.
We calculate PMI scores of all n-grams with a maximum length of for each document instead of the entire corpus considering that different documents usually describe different topics. We manually evaluate the extracted n-grams and find nearly 50% of the top 2000 n-grams contain 3 4 words (characters for Chinese), and only less than 0.5% n-grams are longer than 7. Although a larger n-gram vocabulary can cover longer n-grams, it will cause too many meaningless n-grams. Therefore, for both English and Chinese corpus, we empirically retain the top 3000 n-grams for each document, resulting in vocabularies of n-grams with average lengths of 4.6 and 4.5, respectively. Finally, for English, we randomly sample 10M sentences rather than use the entire corpus to reduce training time.
As in BERT, sentence pairs are packed into a single sequence and the special [CLS] token is used for sentence-level predicting. While in accordance with Joshi et al. (2020) , we find that single sentence training is better than the original sentence pair scenario. Thus in our experiments, the input is a continuous sequence with a maximum length of 512.
|Barton’s inquiry was reasonable : Barton’s inquiry was not reasonable :: Changing the sign of numbers is an efficient algorithm|
|changing the sign of numbers is an inefficient algorithm||GloVe: 0.96||USE: 0.89||BURT-base: 0.97|
|changing the sign of numbers is not an inefficient algorithm||GloVe: 0.97||USE: 0.90||BURT-base: 0.96|
|Members are aware of their political work : Members are not aware of their political work :: This ant is a known species|
|This ant is an unknown species||GloVe:0.94||USE:0.87||BURT-base: 0.96|
|This ant is not an unknown species||GloVe: 0.95||USE:0.82||BURT-base: 0.95|
Instead of training from scratch, we initialize both English and Chinese models with the officially released checkpoints (bert-base-uncased, bert-large-uncased, bert-base-chinse) and BERT-wwm-ext, which is trained from the Chinese BERT using whole word masking on extended data . Base models are comprised of 12 Transformer layers, 12 heads, 768 dimensional hidden states and 110M parameters in total. The English BERT-large has 24 Transformer layers, 16 heads, 1024 dimensional hidden states and 340M parameters in total. We use Adam optimizer  with initial learning rate of 5e-5 and linear warmup over the first 10% of the training steps. Batch size is set to 16 and dropout rate is 0.1. Each model is trained for one epoch.
Following BERT, in the fine-tuning procedure, pairs of sentences are concatenated into a single sequence with a special token [SEP] in between. For both single sentence and sentence pair tasks, the hidden state of the first token [CLS]
is used for softmax classification. We use the same sets of hyperparameters for all the evaluated models. All experiments on the GLUE benchmark are ran with a total train batch sizes between 8 and 64 and learning rates of 3e-5 for 3 epochs. For tasks from the CLUE benchmark, we set batch sizes to 8 and 16, learning rates between 1e-5 and 3e-5, and train 50 epochs on WSC and 25 epochs on the rest tasks.
5.4 Downstream-task Models
On GLUE and CLUE, we compare our model with three variants: pre-trained models (Chinese BERT/BERT-wwm-ext, English BERT-base/BERT-large), models trained with the same number of additional steps as our model (MLM), and models trained using random span masking with the same number of additional steps as our model (Span). For the Span model, we simply replace our n-gram module with the masking strategy as proposed by , where the sampling probability of span length
is based on a geometric distribution. We follow the parameter setting that and maximum span length .
We also evaluate the aforementioned models on our universal analogy task. Baseline models include Bag-of-words (BoW) model from pre-trained word embeddings: GloVe, sentence embedding models: InferSent, GenSen, USE and LASER, pre-trained contextualized language models: BERT, ALBERT, RoBERTa and XLNet. To derive semantically meaningful embeddings, we fine-tune BERT and our model on the Stanford Natural Language Inference (SNLI)  and the Multi-Genre NLI Corpus [multinli] using a Siamese structure following Reimers and Gurevych (2019) [sbert].
|Query: 端午节的由来 (The Origin of the Dragon Boat Festival)|
|: 一个中学的高级教师陈老师生动地解读端午节的由来，诵读爱好者进行原创诗歌作品朗诵，深深打动了在场的观众… (Mr. Chen, senior teacher at a middle School, vividly introduced the origin of the Dragon Boat Festival and people are reciting original poems, which deeply moved the audience…)|
|: 今天是端午小长假第一天…当天上午，在车厢内满目挂有与端午节相关的民俗故事及有关诗词的文字… (Today is the first day of the Dragon Boat Festival holiday…There are folk stories and poems posted in the carriage…)|
|: …端午节又称端阳节、龙舟节、浴兰节，是中华民族的传统节日。端午节形成于先秦，发展于汉末魏晋，兴盛于唐… (…Dragon Boat Festival, also known as Duanyang Festival, Longzhou Festival and Yulan Festival is a traditional festival of the Chinese nation. It is formed in the Pre-Qin Dynasty, developed in the late Han and Wei-Jin, and prospered in the Tang…)|
|Comments: and is related to the topic but does not convey the meaning of the query.|
|Query: 狗的喂养知识 (Dog Feeding Tips)|
|: …创建一个“比特狗”账户，并支付99元领养一只“比特狗”。然后购买喂养套餐喂养“比特狗”，“比特狗”就能通过每天挖矿产生BTGS虚拟货币。 (…First create a “Bitdog” account and pay 99 yuan to adopt a “Bitdog”. Then buy a package to feed the “Bitdog”, which can generate virtual currency BTGS through daily mining.)|
|: 要养成定时定量喂食的好习惯，帮助狗狗更好的消化和吸收，同时也要选择些低盐健康的狗粮… (It is necessary to feed your dog regularly and quantitatively to help them digest and absorb better. Meanwhile, choose some low-salt and healthy food…)|
|: 泰迪犬容易褪色是受到基因和护理不当的影响，其次是饮食太咸…一定要注意正确护理，定期洗澡，要给泰迪低盐营养的优质狗粮… (Teddy bear dog’s hair is easy to fade because of its genes and improper care. It is also caused by salty diet… So we must take good care of them, such as taking a bath regularly, and preparing dog food with low salt…)|
|: 还可以一周自制一次狗粮给狗狗喂食，就是买些肉类，蔬菜，自己动手做。偶尔吃吃自制狗粮也能增加狗狗的营养，和丰富狗狗的口味。日常的话，建议选择些适口性强的狗粮，有助磨牙，防止口腔疾病。 (You can also make dog food once a week, such as meats and vegetables. Occasionally eating homemade dog food can also supplement nutrition and enrich the taste. In daily life, it is recommended to choose some palatable dog food to help their teeth grinding and prevent oral diseases.)|
|Comments: is not a relevant paragraph. is relevant to the topic but is inaccurate.|
For FAQ and NLG, we compare our models with statistical methods such as TF-IDF and BM25, a sentence representation model LASER , the pre-trained BERT/BERT-wwm-ext and models trained with additional steps (MLM, Span). We observe that further training on Chinese SNLI and MNLI datasets underperforms BERT on the FAQ dataset. Therefore, we only consider pre-trained models for these two tasks.
6.1 General Language Understanding
Table 6 and show the results on the GLUE and CLUE benchmarks, where we find that training BERT with additional MLM steps can hardly bring any improvement except for the WSC task. In Chinese, the Span model is effective on WSC but is comparable to BERT on other tasks. BERT-wwm-ext is better than our model on classification tasks involving pairs of short sentences such as AFQMC and OCNLI, which may be due to its relative powerful capability of modeling short sequences. Overall, both BURT and BURT-wwm-ext outperform the baseline models on 4 out of 6 tasks with considerable improvement, which sheds light on their effectiveness of modeling sequences of different lengths. The most significant improvement is observed on WSC (3.5% over the updated BERT-wwm-ext and 0.7% over the Span model), where the model is trained to determine whether the given two spans refer to the same entity in the text. We conjecture that the model benefits from learning to predict meaningful spans in the pre-training stage, so it is better at capturing the meanings of spans in the text. In English, our approach also improves the performance of BERT on various tasks from the GLUE benchmark, indicating that our proposed PMI-based masking method is general and independent with language settings.
6.2 Universal Analogy
Results on analogy tasks are reported in Table VIII. Generally, semantic analogies are more challenging than the syntactic ones and higher-level relationships between sequences are more difficult to capture, which is observed in almost all the evaluated models. On word analogy tasks, all well pre-trained language models like BERT, ALBERT, RoBERTa and XLNet hardly exhibit arithmetic characteristics and increasing the model size usually leads to a decrease in accuracy. However, our method of pre-training using n-grams extracted by the PMI algorithm significantly improves the performance on word analogies compared with BERT, obtaining 72.8% (BURT-base) and 79.4% (BURT-large) accuracy, respectively. Further training BURT-large on SNLI and MNLI results in the highest accuracy (80.5%).
Despite the leading performance on word-level analogy datasets of GloVe, InferSent and USE, they do not generalize well on higher level analogy tasks. We conjecture their poor performance is caused by synonyms and paraphrases in sentences which lead the model to produce lower similarity scores to the correct answers. In contrast, Transformer-based models are more advantageous in representing higher-level sequences and are good at identifying paraphrases and capturing relationships between sentences even if they have less lexical overlap. Moreover, fine-tuning pre-trained models achieves considerable improvements on high-level semantic analogies. Overall, SBURT-base achieves the highest average accuracy (60.7%).
Examples from the Negation subset are shown in Table VII. Notice that the word “not” does not explicitly appear in the correct answers. Instead, “inefficient” and “unaware” are indicators of negation. As expected, BOW will give a higher similarity score for the sentence that contain both “not” and “inefficient” because the word-level information is simply added and subtracted despite the context. By contrast, contextualized models like BURT capture the meanings and relationships of words within the sequence in a comprehensive way, indicating that it has indeed learned universal representations across different linguistic units.
|p_man: employed by the man||p_woman: hired by the woman|
|p_king: employed by the king||p_queen: hired by the queen|
|p_dad: employed by his dad||p_mom: hired by his mom|
|s_man: He was employed by the man when he was 28.|
|s_woman: He was hired by the woman at age 28.|
|s_king: He was employed by the king when he was 28.|
|s_queen: He was hired by the queen at age 28.|
|s_dad: He was employed by his dad when he was 28.|
|s_mom: He was hired by his mom at age 28.|
6.3 Retrieval-based FAQ
Results are reported in Table X. As we can see, LASER and all pre-trained language models significantly outperform TF-IDF and BM25, indicating the superiority of embedding-based models over statistical methods. Besides, the continued BERT training is often beneficial. Among all the evaluated models, our BURT yields the highest accuracy (82.2%) and MRR (0.872). BURT-wwm-ext achieves a slightly lower accuracy (80.7%) compared with BURT but it still exceeds its baselines by 4.0% (MLM) and 1.4% (Span), respectively.
6.4 Natural Language Generation
Results are summarized in Table XI. Although nearly 62% of the paragraphs retrieved by BM25 are relevant to the topic, only two-thirds of them actually convey the original meaning of the template. Despite LASER’s comparable performance to BURT on FAQ, it is less effective when different granular linguistic units are involved at the same time. Re-ranking using BURT substantially improves the quality of the generated paragraphs. We show examples retrieved by BM25 , LASER, the Span model and BURT in Table IX, denoted by , , and , respectively. BM25 tends to favor paragraphs that contain the keywords even though the paragraph conveys a different meaning, while BURT selects accurate answers according to semantic meanings of queries and documents.
7.1 Single Pattern
Mikolov et al. (2013)  use PCA to project word embeddings into a two-dimensional space to visualize a single pattern captured by the Word2Vec model, while in this work we consider embeddings for different granular linguistic units. All pairs in Figure 4 belong to the “male-female” category and subtracting the two vectors results in roughly the same direction.
Given that embeddings of sequences with the same kind of relationship will exhibit the same pattern in the vector space, we obtain the difference between pairs of embeddings for words, phrases and sentences from different categories and visualize them by t-SNE. Figure 5 shows that by subtracting two vectors, pairs that belong to the same category automatically fall into the same cluster. Only the pairs from “capital-country” and “city-state” cannot be totally distinguished, which is reasonable because they all describe the relationship between geographical entities.
We show examples in Figure 6 where BURT successfully retrieve the correct answer while TF-IDF and BM25 fail. Both sentences “Can 80-year-old people get accident insurance?” and “Can life insurance last until the age of 80?” contain the word “80”, which is a possible reason why TF-IDF tends to believe they highly match with each other, ignoring that the two sentences are actually describing two different issues. In contrast, using vector-based representations, BURT considers “seniors” as a paraphrase of “80-year-old people”. As depicted in Figure 6, queries are close to the correct responses and away from other sentences.
This paper formally introduces the task of universal representation learning and then presents a pre-trained language model for such a purpose to map different granular linguistic units into the same vector space where similar sequences have similar representations and enable unified vector operations among different language hierarchies.
In detail, we focus on the less concentrated language representation, seeking to learn a uniform vector form across different linguistic unit hierarchies. Far apart from learning either word only or sentence only representation, our method extends BERT’s masking and training objective to a more general level, which leverage information from sequences of different lengths in a comprehensive way and effectively learns a universal representation from words, phrases to sentences.
Overall, our proposed BURT outperforms its baselines on a wide range of downstream tasks with regard to sequences of different lengths in both English and Chinese languages. We especially provide an universal analogy task, an insurance FAQ dataset and an NLG dataset for extensive evaluation, where our well-trained universal representation model holds the promise for demonstrating accurate vector arithmetic with regard to words, phrases and sentences and in real-world retrieval applications.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 2013.
-  M. Artetxe and H. Schwenk, “Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond,” Trans. Assoc. Comput. Linguistics, vol. 7, pp. 597–610, 2019. [Online]. Available: https://transacl.org/ojs/index.php/tacl/article/view/1742
-  D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil, “Universal sentence encoder for English,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 169–174.
-  M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of NAACL-HLT, 2018, pp. 2227–2237.
-  A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018.
-  Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” arXiv preprint arXiv:1906.08237, 2019.
-  Y. Cui, W. Che, T. Liu, B. Qin, Z. Yang, S. Wang, and G. Hu, “Pre-training with whole word masking for chinese bert,” 2019.
-  W. Wang, B. Bi, M. Yan, C. Wu, J. Xia, Z. Bao, L. Peng, and L. Si, “Structbert: Incorporating language structures into pre-training for deep language understanding,” in International Conference on Learning Representations, 2020.
-  M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, “Spanbert: Improving pre-training by representing and predicting spans,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 64–77, 2020.
-  J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543.
-  A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” 2016.
-  R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in Advances in neural information processing systems, 2015, pp. 3294–3302.
-  L. Logeswaran and H. Lee, “An efficient framework for learning sentence representations,” in International Conference on Learning Representations (ICLR), 2018.
A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sentence representations from natural language inference data,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 670–680.
-  S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, Sep. 2015, pp. 632–642.
-  S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal, “Learning general purpose distributed sentence representations via large scale multi-task learning,” in International Conference on Learning Representations, 2018.
-  Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” 2019.
-  Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019.
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” 2013.
-  O. Levy and Y. Goldberg, “Linguistic regularities in sparse and explicit word representations,” in CoNLL 2014, 2014.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep neural networks for natural language understanding,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 4487–4496.
-  Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, and Y. Goldberg, “Fine-grained analysis of sentence embeddings using auxiliary prediction tasks,” CoRR, 2016.
-  A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni, “What you can cram into a single vector: Probing sentence embeddings for linguistic properties,” CoRR, 2018.
-  G. Bacon and T. Regier, “Probing sentence embeddings for structure-dependent tense,” in Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2018, Brussels, Belgium, November 1, 2018, 2018.
-  A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in bertology: What we know about how bert works,” 2020.
-  X. Ma, Z. Wang, P. Ng, R. Nallapati, and B. Xiang, “Universal text representation from bert: An empirical study,” 2019.
-  G. Jawahar, B. Sagot, and D. Seddah, “What does BERT learn about the structure of language?” in ACL 2019, 2019.
-  P. Barancíková and O. Bojar, “In search for linear relations in sentence embedding spaces,” in ITAT 2019, ser. CEUR Workshop Proceedings, 2019.
-  X. Zhu and G. de Melo, “Sentence analogies: Exploring linguistic relationships and regularities in sentence embeddings,” CoRR, 2020.
-  K. Hammond, R. Burke, C. Martin, and S. Lytinen, “Faq finder: a case-based approach to knowledge navigation,” in Proceedings of the 11th Conference on Artificial Intelligence for Applications, 1995.
-  V. Jijkoun and M. de Rijke, “Retrieving answers from frequently asked questions pages on the web,” in CIKM 2005, 2005.
-  E. Sneiders, “Automated faq answering with question-specific knowledge representation for web self-service,” 2009.
-  S. Damani, K. N. Narahari, A. Chatterjee, M. Gupta, and P. Agrawal, “Optimized transformer models for FAQ answering,” in PAKDD 2020, ser. Lecture Notes in Computer Science, 2020.
-  W. Sakata, T. Shibata, R. Tanaka, and S. Kurohashi, “FAQ retrieval using query-question similarity and bert-based query-answer relevance,” in SIGIR 2019, 2019.
-  A. Stolcke, “Srilm - an extensible language modeling toolkit,” in INTERSPEECH, 2002.
-  K. W. Church and P. Hanks, “Word association norms, mutual information, and lexicography,” Computational linguistics, vol. 16, no. 1, pp. 22–29, 1990.
-  A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Nov. 2018, pp. 353–355.
-  A. Warstadt, A. Singh, and S. R. Bowman, “Neural network acceptability judgments,” 2019.
-  R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Oct. 2013, pp. 1631–1642.
-  N. Nangia, A. Williams, A. Lazaridou, and S. Bowman, “The RepEval 2017 shared task: Multi-genre natural language inference with sentence representations,” in Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, Sep. 2017, pp. 1–10.
-  P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ questions for machine comprehension of text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Nov. 2016, pp. 2383–2392.
-  W. B. Dolan and C. Brockett, “Automatically constructing a corpus of sentential paraphrases,” in Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
-  D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation,” in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Aug. 2017, pp. 1–14.
-  L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan, “CLUE: A Chinese language understanding evaluation benchmark,” in Proceedings of the 28th International Conference on Computational Linguistics, Dec. 2020, pp. 4762–4772.
-  LTD. IFLYTEK CO, “Iflytek: a multiple categories chinese text classifier,” 2019.
-  H. Hu, K. Richardson, L. Xu, L. Li, S. Kuebler, and L. S. Moss, “Ocnli: Original chinese natural language inference,” 2020.
-  Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu, “A span-extraction dataset for Chinese machine reading comprehension,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Nov. 2019, pp. 5883–5889.
-  C. Zheng, M. Huang, and A. Sun, “ChID: A large-scale Chinese IDiom dataset for cloze test,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, Jul. 2019, pp. 778–787.
-  K. Sun, D. Yu, D. Yu, and C. Cardie, “Probing prior knowledge needed in challenging chinese machine reading comprehension,” ArXiv, vol. abs/1904.09679, 2019.
-  B. Matthews, “Comparison of the predicted and observed secondary structure of t4 phage lysozyme,” Biochimica et Biophysica Acta (BBA) - Protein Structure, vol. 405, no. 2, pp. 442 – 451, 1975.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.