DeepAI
Log In Sign Up

Transcormer: Transformer for Sentence Scoring with Sliding Language Modeling

Sentence scoring aims at measuring the likelihood score of a sentence and is widely used in many natural language processing scenarios, like reranking, which is to select the best sentence from multiple candidates. Previous works on sentence scoring mainly adopted either causal language modeling (CLM) like GPT or masked language modeling (MLM) like BERT, which have some limitations: 1) CLM only utilizes unidirectional information for the probability estimation of a sentence without considering bidirectional context, which affects the scoring quality; 2) MLM can only estimate the probability of partial tokens at a time and thus requires multiple forward passes to estimate the probability of the whole sentence, which incurs large computation and time cost. In this paper, we propose Transcormer – a Transformer model with a novel sliding language modeling (SLM) for sentence scoring. Specifically, our SLM adopts a triple-stream self-attention mechanism to estimate the probability of all tokens in a sentence with bidirectional context and only requires a single forward pass. SLM can avoid the limitations of CLM (only unidirectional context) and MLM (multiple forward passes) and inherit their advantages, and thus achieve high effectiveness and efficiency in scoring. Experimental results on multiple tasks demonstrate that our method achieves better performance than other language modelings.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/14/2020

Finnish Language Modeling with Deep Transformer Models

Transformers have recently taken the center stage in language modeling a...
04/20/2020

MPNet: Masked and Permuted Pre-training for Language Understanding

BERT adopts masked language modeling (MLM) for pre-training and is one o...
05/16/2019

Effective Sentence Scoring Method using Bidirectional Language Model for Speech Recognition

In automatic speech recognition, many studies have shown performance imp...
11/04/2021

A text autoencoder from transformer for fast encoding language representation

In recent years BERT shows apparent advantages and great potential in na...
08/19/2020

Context-aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training

Mispronunciation detection is an essential component of the Computer-Ass...
12/20/2022

A Measure-Theoretic Characterization of Tight Language Models

Language modeling, a central task in natural language processing, involv...

1 Introduction

Sentence scoring is to measure the quality of a sentence via language modeling. It has been widely used in many natural language processing (NLP) scenarios, especially like reranking task. Reranking aims to select the most suitable candidate from multiple candidates produced by the text generation models (e.g., machine translation or speech recognition models) based on the sentence score. Precise sentence scoring is able to select a better candidate. Therefore, how to estimate sentence score is critical for the application scenarios.

With the development of deep learning, language modeling (LM) 

[20, 40, 43, 18] has been considered as the most widely used technique for sentence rescoring, since it can produce probability estimation of the whole sentence by computing the probability of each token and summing up their values as the sentence score. Specifically, causal language modeling (CLM) and masked language modeling (MLM) are the most representative LMs. CLM is conditioned on unidirectional context for next token prediction. And MLM is to mask some tokens in the input sequence to utilize bidirectional context for prediction. Some works [6, 17, 14, 27, 28, 3, 42, 35, 34, 7, 16, 29] pre-trained these LMs on large-scale corpus and thus enhance the capacity and generalization of LMs. For example, GPT [27, 28, 3] adopted CLM to pre-train a large Transformer [36] model for natural language generation (NLG) tasks. BERT [6] and its descendants [17, 14, 7] leveraged MLM for pre-training to handle natural language understanding (NLU) tasks. Inspired by these successes of CLM and MLM in NLG and NLU tasks, some works  [30, 37, 32, 5, 39] have tried to use these pre-trained LMs for reranking on machine translation or speech recognition tasks and achieved some promising results.

However, we notice that both CLM and MLM still remain some deficiencies in calculating sentence scores. For instance, some works [39] utilized GPT-style model for ASR reranking within single inference, yet GPT model can only extract unidirectional information due to the limitation of CLM, without considering the whole sentence semantics, and thus affect the sentence score. To utilize bidirectional context, some works [37, 32, 5] applied BERT model for rescoring. However, the nature of MLM needs model to mask some tokens in the sentence for prediction, which means it requires BERT model to forward multiple times that each forward only masks one token for prediction. As a result, it is time-consuming to directly adopt MLM for scoring sentence. Overall, we find that for calculating sentence score, CLM only needs once inference but only use unidirectional information and MLM is costly in computing sentence score although it can use bidirectional context. Therefore, we raise a natural question: is it possible to design a language modeling to use bidirectional context for sentence scoring with only once inference pass.

To address the above issues, in this paper, we introduce Transcormer, a Transformer model designed for sentence scoring. More specifically, our Transcormer leverages a novel language modeling, named as sliding language modeling (SLM), that allows us to produce the probability of all tokens within single inference pass, and simultaneously utilizes bidirectional context. To fulfill this target, we innovatively design a triple-stream self-attention mechanism, which consists of two content streams (a forward stream and a backward stream) and one query stream. By employing some specific-designed mask strategies on the attention matrix, our method allows each token in the query stream to leverage all token information except itself for estimating probability (the tokens before and after it) to avoid any information leakage. To the best of our knowledge, SLM is the first language modeling tailored for sentence scoring. We pre-train our SLM on large-scale corpus, and then evaluate it on multiple datasets for reranking. Experimental results demonstrate that our Transcormer improves the baseline by up to + 0.8/0.6 BLEU score on low-resource/rich-resource machine translation tasks and gives 20% relative improvements on ASR datasets.

The contributions of our paper can be summarized as:

  • [leftmargin=*]

  • We analyze the pros and cons of CLM and MLM when using them for scoring sentences, and thus propose Transcormer with a new sliding language modeling, which uses bidirectional context for probability estimation within a single pass.

  • We introduce a novel triple-stream self-attention mechanism in SLM, which has two content streams to collect forward/backward semantics, and a query stream to estimate the probability of each token in a sentence.

  • Experimental results on multiple datasets demonstrate the effectiveness and efficiency of our SLM for sentence scoring.

2 Background

In this section, we give some descriptions about the background of sentence scoring in NLP applications, and multiple-stream self-attention.

2.1 Sentence Scoring

Sentence scoring has a long-term development history in NLP applications, especially in reranking tasks (e.g., reranking for machine translation [13, 33, 31, 21] or speech recognition [19]). Generally, given -best candidates generated by text generation models, reranking aims at scoring each candidate to select the best answer. Early works [13, 33, 31, 21] mainly used statistical LM to calculate the sentence scores and some works [1, 8]

tried to combine statistical LMs with RNN-based LM for sentence scoring. Recently, directly using end-to-end neural network based LM has became the de facto approach for scoring and has been widely used in many NLP tasks 

[40, 20, 38]. Specifically, causal language modeling (CLM) and masked language modeling (MLM) are the most representative language modeling methods, which are famous as GPT [27, 28, 3] and BERT [6] respectively. The details of CLM and MLM are described in below.

Clm

CLM is to model sequence based on the previous tokens in a chain-style rule. Assuming the target sequence is , where is the -th token, so the probability estimation function of CLM is . Hence, CLM can obtain the probability of each token of the sentence in a single pass, but can only capture unidirectional information.

Mlm

Unlike CLM, MLM use bidirectional context for prediction. Specifically, MLM replaces a random subset of tokens as special symbol for prediction. Assuming the masked subset as and the corrupted sequence as , the objective of MLM is to optimize . However, MLM only supports partial prediction as each time we can only mask a subset for prediction. To calculate the sentence score, a kind of solution [37, 32, 5] is to forward multiple times and only mask one token each time. Therefore, the MLM for sentence scoring is formulated as: . We can find that the cost of MLM for scoring needs inference passes, which is too time-consuming. To alleviate this problem, some works attempted to use stochastic estimation [37] or distillation [30] to approximately estimate the probability of each toke produced by MLM, but the approximated score cannot precisely estimate the token probability. Therefore, how to calculate sentence score efficiently is the main challenge in MLM.

Overall, CLM only needs a single forward pass to estimate the probability of all tokens but cannot extract bidirectional context, while MLM leverages bidirectional information but needs multiple inference passes. Consequently, we raise a natural question: is it possible to design a pre-trained language model to support all token prediction in a single pass and simultaneously leverage bidirectional information? This is exactly the motivation of our method.

2.2 Multiple-Stream Self-Attention

The pioneer of multiple-stream self-attention is XLNet [42], which introduces two-streams self-attention to incorporate autoregressive pre-training for language understanding. It pre-trains Transformer [36] in an autoregressive manner via maintaining a content stream and a query stream. In details, for the -th step, the content stream is able to capture the dependency from the tokens before the -th step and itself (i.e., ), while the query stream is only allowed to view the tokens before the -th step (i.e., ) to avoid information leakage. Besides, there also remain some other variants of two-stream self-attention [35, 26, 41]. For example, MPNet [35] used two-stream self-attention to build masked and permuted pre-training. ProphetNet [26]

designed multiple query streams to predict N-gram future steps for sequence-to-sequence tasks, and ERNIE-GAN 

[41] proposed a multi-flow generation model, which includes two query streams for span and word prediction. We observe these works shared some similarities, that they used a single content stream and then used one or many query streams to predict more information. Different from these works, we introduce a triple-stream self-attention mechanism, which enables query stream to leverage two content streams for prediction, and thus enjoys the benefits of additional bidirectional context for estimating token probability.

3 Transcormer

To inherent the advantages of CLM and MLM for sentence scoring and avoid their limitations, we propose Transcormer – a Transformer model with a novel sliding language modeling for sentence scoring. First, we summarize that an ideal language modeling for sentence scoring should satisfy two requirements: 1) model should be able to use bidirectional context for effective probability estimation of each token; 2) model should produce the probability of all tokens in a sentence within a single inference pass for efficiency. Therefore, to fulfill these two requirements, we formulate a new language modeling, named as sliding language modeling (SLM), and describe it in Section 3.1. In SLM, we propose a Triple-Stream Self-Attention mechanism based on Transformer (please see Section 3.2 for details) to use bidirectional context for each token prediction and avoid information leakage. We also discuss the differences between SLM and other LMs in Section 3.3. Figure 1 presents the pipeline of our Transcormer for sentence scoring with SLM.

3.1 Sliding Language Modeling

Figure 1: Transcormer with sliding language modeling. The left and the right (in blue) are forward and backward streams, respectively, and the middle (in green) is query stream. For query stream, the inputs are only the positional information. We use gray and red line to represent the allowed attended positions in the content and query streams.

Considering the pros and cons of CLM and MLM for scoring, we notice: 1) CLM can produce probability of all tokens within one forward pass, and thus obtain the unidirectional information of the whole sentence; 2) MLM for sentence scoring needs multiple inference passes, so that many context has been repeatedly calculated and cause a waste of computation. So, is it possible to reuse token information to build bidirectional context for token prediction?

Therefore, we propose sliding language modeling (SLM) to address the inherent flaws in previous LMs (i.e., CLM and MLM) for sentence scoring. Specifically, we first maintain two individual streams to collect forward (left-to-right) context and backward (right-to-left) context. And for each token prediction, we decompose the sentence information as the past tokens (the tokens before it) and future tokens (the tokens after it) respectively. As a result, our SLM enforces each token to only capture the dependency from its past tokens and its future tokens concurrently, so that each token can utilize the whole sentence information (except itself) to estimate token probability. The objective function of SLM is formulated as:

(1)

where and respectively correspond to the tokens before the -th token and after the -th token, and represents the parameters of SLM. Thanks to such design, our SLM can utilize bidirectional context to produce the probability of each token within one forward pass, and thus satisfy the above requirements for sentence scoring. However, previous experiences [42, 11] pointed out that the states with bidirectional information will cause information leakage when propagating to the next layer 111For example, assume the sequence has 3 tokens, and the hidden states of the -th token at the first layer as . So each position , and should collect the information from positions , , . However, when and are delivered to at the second layer, it will cause a cyclic leakage as should not obtain information from position 2.. Therefore, how to implement SLM to avoid information leakage and maintain different states together is still a troublesome problem.

(a)
(b)
Figure 2: (a) The structure of our triple-stream self-attention used in our sliding language modeling. The query stream reuse the hidden states from both forward and backward (content) stream as the key and value in attention. (b) The attention mask matrix used in our triple-stream self-attention. The above row is the attention matrix for the forward and backward stream and the below row is the attention matrix for the query stream. The cell in gray color means this position can not be attended.

3.2 Triple-Stream Self-Attention

Based on the Eqn 1 of our proposed SLM, model needs to maintain two states to collect forward and backward contexts for prediction, and we call these two states as the forward stream and backward stream respectively. To avoid information leakage, we additionally maintain an individual state for prediction and control it to only capture the dependency from the forward and backward streams. We name this state as the query stream. Therefore, we propose a novel Triple-Stream Self-Attention to update each stream, and the detailed design is described as following.

To fulfill our target, we choose Transformer [36] as our basic model, due to its flexibility in capturing global dependency. Assume the input sequence as and its positions as , where and represents -th token and its position in the sentence, and is the token number. For the query stream, we only use the positions as the input. For the forward and backward streams, we also maintain two individual states and both of them use the tokens plus its position (i.e., ) as the input. For the -th layer calculation, we denote the forward and backward streams of the position as and , and they are updated as:

(2)
(3)

where refers to the self-attention [36] in Transformer and Q, K, V denote the query, key and value in self-attention. Hence, collects the information before the position and itself, and collects the information after the position and itself. For the query stream, we denote as its hidden states, and concatenate forward stream and backward stream of the current layer to be the key/value of query stream. So is updated as:

(4)

Here we find that is required to only capture the dependency from the forward stream before the position and the backward stream after the position , rather than itself. Due to such design, the query stream is able to capture bidirectional context for estimating token probability and avoid information leakage, which is more effective than CLM in using context. More importantly, our triple-stream self-attention enables model to predict the probability of all tokens in a sentence within a single forward pass, which demonstrates more efficiency than MLM. Figure 2 presents the detailed design of our triple-stream self-attention. We can find that in the query stream, the masked matrix is like a sliding window to control each token to view its previous states in forward stream and its future states in backward stream. And that is why we name our model as sliding language modeling.

LM Model Cost Context Scenario
CLM GPT [28] forward NLG
MLM BERT [6] bidirectional NLU
Bi-LM ELMO [25] forward + backward NLU
SLM Transcormer bidirectional Scoring
Table 1: Comparisons between SLM and other LMs. We assume all LMs adopt the same architectures (e.g., Transformer). The “Cost” column means the relative computations compared with CLM when calculating a sentence with tokens. The “Context” column means the contextual information usage for prediction.

3.3 Discussion

To better understand our SLM, we analyze the advantages of our SLM over other LMs. The comparisons are listed in Table 1. We select three representative LMs for comparisons, which are CLM (BERT), MLM (GPT) and bidirectional LM (Bi-LM, used in ELMO [25] 222ELMO pre-trains a left-to-right and a right-to-left LSTM [10] and concatenates the outputs of each last unidirectional LSTM layer for prediction.) respectively. From Table 1, we have the following observations:

  1. [leftmargin=*]

  2. When compared with CLM, our SLM requires 3 computations. However, our SLM can fully use the whole sentence information for prediction while CLM can only use unidirectional information. Even scaling CLM as 3 parameters, it still can not use bidirectional context for prediction. This also demonstrates the effectiveness of our SLM in using context.

  3. MLM is powerful at extracting bidirectional context but it needs inferences to calculate the whole sentence information limited to its masked prediction. Our SLM just needs a single inference and uses bidirectional information for prediction with only computations. Especially, our SLM shows higher efficiency compared with MLM when is large.

  4. Bi-LM can also extract forward and backward contextual information, but it just simply concatenates the forward and backward features for the final prediction, without any interactions. Instead, our SLM can iteratively fuse the bidirectional information thanks to our triple-stream self-attention mechanism.

Overall, the design of SLM is dedicated for sentence scoring, while CLM prefers NLG tasks and MLM/Bi-LM prefer NLU tasks.

4 Experiments

In this section, we describe our experimental setup, and the results on NMT and ASR datasets.

4.1 Experimental Setup

We adopt Transformer [36] as the backbone network. Following previous works [6], we adopt a base setting for our model as Transcormer with 110M parameters, that consists of 12 transformer layers and each layer has 768 hidden size and 12 attention heads. During the pre-training, we use wikipedia plus bookcorpus (16GB) as the training corpus, to be consistent with previous works [6]. Our model is trained at the sentence-level (i.e., one sentence per sample). We choose Adam [12] as the default optimizer with learning rate of , , and , and weight decay is set as 0.01. The learning rate warms up over the first 10,000 steps and then linearly decays. We set the batch size as 8192 tokens per batch, and the training step is 125,000 steps. We use 32 NVIDIA Tesla 32GB GPUs, with FP16 speedup. The total training needs 5.5 days for Transcormer. For more experimental settings, please refer to the Appendix. Our code and pre-trained models will be released later.

IWSLT WMT
Model De Es It Nl Pl Ro Ru Tr De-En
Oracle 41.80 48.69 41.89 44.38 27.90 46.01 29.60 27.25 39.17
Baseline 34.77 41.20 34.95 37.73 22.67 38.73 24.21 21.65 32.54
CLM (GPT) 34.96 41.39 35.14 38.08 22.91 39.03 24.62 22.14 32.88
MLM (BERT) 35.14 41.54 35.54 38.14 23.00 39.21 24.65 22.36 33.07
Bi-LM (ELMO) 35.10 41.52 35.21 38.03 23.09 39.07 24.53 21.91 32.90
SLM (Transcormer) 35.24 41.86 35.52 38.45 23.29 39.34 24.69 22.41 33.10
Table 2: Reranking results on IWSLT and WMT tasks, and all LMs have the same model architecture as Transcormer. The translation direction of all IWSLT tasks is to English and all results are reported in BLEU score. All LMs are pre-trained in the wikipedia + bookcorpus (16GB) with the same optimization. The last row is the oracle score from the generated candidates.

4.2 Experiments on Neural Machine Translation

We choose IWSLT14 dataset [4], which includes multiple translation tasks from low-resource languages to English, and WMT14 English-German dataset 333Here we only evaluate GermanEnglish direction as our model is trained on English domain. for evaluation. We adopt Transformer [36] as the machine translation model with 6-6 transformer layers to generate multiple candidates for reranking, with a beam size of 10. The hidden size and attention head are set as 512/1024 and 8/16 for IWSLT and WMT tasks respectively. During the reranking, we combine the original score produced by the machine translation model and the LM score with a hyper-parameter , by following previous experiences [43, 30]. The hyper-parameter is tuned on dev set with a range of , and then select the best to evaluate the test set. The results are reported in Table 2, in terms of BLEU [23]

. Each task will be evaluated by five times based on different pre-trained checkpoints, and report the mean value, with a variance of 0.05. From Table 

2, we have the following observations:

  • [leftmargin=*]

  • Our Transcormer can obtain better performance than CLM and Bi-LM 444Here, we replace LSTM as Transformer in Bi-LM to keep the consistence in architecture. in both low-resource IWSLT and large-scale WMT tasks, which indicates the importance of bidirectional context for sentence scoring, and further validate the ability of our SLM in utilizing bidirectional information.

  • When compared with MLM, our Transcormer also achieves comparable results. Considering that the computation of MLM for scoring is linear to the input length and needs inference passes, our SLM show higher efficiency with only computations in a single pass to maintain three streams.

Overall, all of the results reveal that our model is more effective in using contextual information for probability estimation and more efficient with only a single forward pass, especially for long sentences.

Model dev-clean dev-other test-clear test-other
Baseline 2.80 6.90 3.06 7.05
CLM (GPT) 2.47 6.13 2.73 6.33
MLM (BERT) 2.30 5.65 2.59 5.90
Bi-LM (ELMO) 2.41 5.92 2.63 6.12
SLM (Transcormer) 2.23 5.54 2.49 5.72
Oracle 1.45 4.23 1.59 4.19
Table 3: Reranking results on LibrSpeech dataset. All results are reported in WER.

4.3 Experiments on Automatic Speech Recognition

We choose LibrSpeech [22] to evaluate the performance of our model for reranking on ASR task. We train a Conformer model [9] on LibrSpeech, which has 12 encoder layers and 6 decoder layers with 512 hidden size and 8 attention heads, and the beam size is set as 10. In addition, we use SpecAug [24] as a data augmentation technology to further improve the accuracy of ASR system. We use word error rate (WER) to evaluate the performance of ASR tasks. We follow the same tuning technique used in NMT tasks for hyper-parameter , but with a larger range as . The results are reported in Table 3. From Table 3, we find that our model can give nearly 20% relative improvements over the baseline and also outperform other LMs, including CLM, MLM and Bi-LM. The results on ASR task further demonstrates the generalization and effectiveness of our SLM in sentence scoring.

5 Analyses

In this section, we conduct some method analyses on our proposed SLM and CLM/MLM. Besides, we also provide more analyses in Appendix.

IWSLT WMT LibriSpeech
Model De Es De-En dev-clean dev-other
Baseline 34.77 41.20 32.54 2.80 6.90
CLM (GPT) 34.96 41.39 32.88 2.47 6.13
SLM (Transcormer) 35.05 41.58 32.94 2.48 5.95
Table 4: Comparisons between SLM (Transcormer) and CLM (GPT) under the same computation. NMT and LibriSpeech respectively uses BLEU and WER as the metric.

5.1 Comparable Computation between CLM and SLM

As aforementioned, our SLM needs computations when compared with CLM. To make a fair comparison, we also pre-train a Transcormer with 34M parameters in total, which consists 6 transformer layers and each layer has 512 hidden size and 8 attention heads. Hence, our Transcormer has approximately parameters of Transcormer, and has the similar computations as GPT

. We conduct experiments on three NMT tasks and an ASR task (LibriSpeech dataset) for comparisons and the results are listed in Table 

4. We can find that even under the same computation, our model still outperforms CLM and this result further validates the necessity of using bidirectional context for sentence scoring. Besides, considering that our Transcormer has fewer parameters, our model is also friendly to the device deployments (e.g., CPU).

Model Cost PPL dev-clean dev-other test-clear test-other
Baseline - - 2.80 6.90 3.06 7.05
MLM () 4.26 2.30 5.65 2.59 5.90
MLM () 8.41 2.41 5.87 2.70 6.20
MLM () 11.58 2.60 5.95 2.87 6.41
MLM () - 2.75 6.71 2.98 6.93
MLM () - 2.80 6.80 3.01 6.99
SLM 3.85 2.23 5.54 2.49 5.72
Table 5: Comparisons of sampling different tokens for prediction on MLM. We choose ASR reranking task on LibriSpeech dataset to evaluate the results and also report PPL on a subset of sentences with same length ().

5.2 Varying Numbers of Forward Passes in MLM

As mentioned above, MLM needs to forward passes as each time only mask one token. So what will happen if we allow each pass to mask more tokens? Therefore, we design experiments that enforces MLM to forward tokens for prediction so that it only needs passes, and investigate the effect of different . For a certain , we randomly split the sentence as sets, and each time masks one subset for prediction. The comparisons are listed in Table 5. We find that using larger will severely harm the performance, even if it can reduce the number of inference passes. When is set as , which is equal to the cost of our SLM, it can hardly give any improvements over the baseline. We think that masking more tokens in the sentence will make it more difficult to estimate the token probability at a time. These comparisons also highlight the efficiency and effectiveness of our SLM for sentence scoring.

Figure 3: The average cross-entropy loss of each LM at each position. MLM uses passes to predict each position.

5.3 Sentence Scoring Quality at Each Position

To better analyze the effectiveness of bidirectional context to estimate token probability, we count the cross-entropy loss of each position in sentences for each LM. Specifically, we sample a subset of sentences as , and each sentence has the same token number (here is set as 20). For each position , we count the average cross-entropy over the sampled subset based on the output probability of each LM. The results are displayed in Figure 3. We can find that: 1) For CLM, the cross-entropy loss is higher at the first several positions and gradually decreases for the subsequent positions but is still higher than MLM and SLM, which indicates that only using undirectional information is not enough to measure the sentence score precisely. 2) SLM can almost obtain similar loss as MLM at each position. Considering SLM just needs a single pass while MLM needs passes, this phenomenon further validates the superiority and efficiency of SLM in scoring sentences.

6 Conclusion

In this paper, we propose Transcormer, a Transformer with a novel sliding language modeling for sentence scoring. Specifically, our SLM is able to produce the probability of each token over the whole sentence within a single forward pass, and utilizes bidirectional context for prediction, and thus inherents the advantages of CLM and MLM and avoids their deficiencies. To the best of our knowledge, the proposed Transcormer is the first pre-trained language model tailored for sentence scoring. Experimental results on multiple datasets demonstrate the effectiveness of our Transcormer in computing sentence score for reranking tasks.

Besides, we summarize some potential directions of our Transcormer and SLM as the future works:

  • [leftmargin=*]

  • Currently our Transcormer is only conducted on the English domain under the base setting, due to limited computation. We expect to develop large-scale Transcormer and use different language domains or multilingual data for training in the future.

  • We design sliding language modeling for sentence scoring, and our experiments are mainly on reranking task. However, based on the characteristics of our SLM, we believe our model can also be used for other scenarios (e.g., error correction [leng2021fastcorrect, leng2021fastcorrect2, song2021neural], data selection), and we will explore the specific fine-tuning techniques when applying our SLM on different downstream tasks.

  • Besides, our Transcormer mainly pre-trains SLM on an encoder framework. However, our SLM is not limited to model structure. For example, SLM can be easily extended to encoder-decoder framework [34] based on paired data. Therefore, we also expect to explore the possibility of using SLM on different frameworks.

  • Although our paper mainly focuses on text data, we want to highlight that SLM can also be extended to other different modalities with sequential characteristic (e.g., image, speech and time series data). Consequently, how to apply SLM to other modalities is also a valuable topic in the future.

References

  • [1] M. Auli and J. Gao (2014)

    Decoder integration and expected BLEU training for recurrent neural network language models

    .
    In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers, pp. 136–142. Cited by: §2.1.
  • [2] S. Bhattacharyya, A. Rooshenas, S. Naskar, S. Sun, M. Iyyer, and A. McCallum (2021)

    Energy-based reranking: improving neural machine translation using energy-based models

    .
    In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 4528–4537. Cited by: §A.4.
  • [3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, virtual, pp. 1–16. Cited by: §1, §2.1.
  • [4] M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico (2014-12) Report on the 11th IWSLT evaluation campaign. In Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign, Lake Tahoe, California, pp. 2–17. Cited by: §4.2.
  • [5] S. Chiu and B. Chen (2021) Innovative bert-based reranking language models for speech recognition. In IEEE Spoken Language Technology Workshop, SLT 2021, Shenzhen, China, January 19-22, 2021, pp. 266–271. Cited by: §1, §1, §2.1.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §1, §2.1, Table 1, §4.1.
  • [7] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In NeurIPS, Vancouver Convention Centre, Vancouver, Canada, pp. 13042–13054. Cited by: §1.
  • [8] T. Ehara (2018) In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation, WAT@PACLIC 2018, Hong Kong, December 1-3, 2018, Cited by: §2.1.
  • [9] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. In INTERSPEECH, Cited by: §4.3.
  • [10] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: footnote 2.
  • [11] J. Kasai, J. Cross, M. Ghazvininejad, and J. Gu (2020) Non-autoregressive machine translation with disentangled context transformer. In

    Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event

    ,
    Proceedings of Machine Learning Research, Vol. 119, pp. 5144–5155. Cited by: §3.1.
  • [12] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Cited by: §4.1.
  • [13] S. Kumar and W. J. Byrne (2004) Minimum bayes-risk decoding for statistical machine translation. In Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2004, Boston, Massachusetts, USA, May 2-7, 2004, pp. 169–176. Cited by: §2.1.
  • [14] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020)

    ALBERT: A lite BERT for self-supervised learning of language representations

    .
    In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia. Cited by: §1.
  • [15] A. Lee, M. Auli, and M. Ranzato (2021) Discriminative reranking for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 7250–7264. Cited by: §A.4.
  • [16] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, Online, pp. 7871–7880. Cited by: §1.
  • [17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. Cited by: §1.
  • [18] Y. Liu, L. Zhou, Y. Wang, Y. Zhao, J. Zhang, and C. Zong (2018) A comparable study on model averaging, ensembling and reranking in NMT. In Natural Language Processing and Chinese Computing - 7th CCF International Conference, NLPCC 2018, Hohhot, China, August 26-30, 2018, Proceedings, Part II, Lecture Notes in Computer Science, Vol. 11109, pp. 299–308. Cited by: §1.
  • [19] Y. Ma, E. Cambria, and B. Bigot (2017)

    ASR hypothesis reranking using prior-informed restricted boltzmann machine

    .
    In Computational Linguistics and Intelligent Text Processing - 18th International Conference, CICLing 2017, Budapest, Hungary, April 17-23, 2017, Revised Selected Papers, Part I, Vol. 10761, pp. 503–514. Cited by: §2.1.
  • [20] N. Ng, K. Yee, A. Baevski, M. Ott, M. Auli, and S. Edunov (2019) Facebook fair’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 - Volume 2: Shared Task Papers, Day 1, pp. 314–319. Cited by: §1, §2.1.
  • [21] F. J. Och (2003) Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 7-12 July 2003, Sapporo Convention Center, Sapporo, Japan, pp. 160–167. Cited by: §2.1.
  • [22] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pp. 5206–5210. Cited by: §A.1.3, §4.3.
  • [23] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pp. 311–318. Cited by: §4.2.
  • [24] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019)

    Specaugment: a simple data augmentation method for automatic speech recognition

    .
    In INTERSPEECH, Cited by: §4.3.
  • [25] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §3.3, Table 1.
  • [26] W. Qi, Y. Yan, Y. Gong, D. Liu, N. Duan, J. Chen, R. Zhang, and M. Zhou (2020) ProphetNet: predicting future n-gram for sequence-to-sequence pre-training. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, Findings of ACL, Vol. EMNLP 2020, pp. 2401–2410. Cited by: §2.2.
  • [27] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1, §2.1.
  • [28] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §1, §2.1, Table 1.
  • [29] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    .
    arXiv preprint arXiv:1910.10683. Cited by: §1.
  • [30] J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff (2020) Masked language model scoring. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 2699–2712. Cited by: §A.3.2, §1, §2.1, §4.2.
  • [31] L. Shen, A. Sarkar, and F. J. Och (2004) Discriminative reranking for machine translation. In Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2004, Boston, Massachusetts, USA, May 2-7, 2004, pp. 177–184. Cited by: §2.1.
  • [32] J. Shin, Y. Lee, and K. Jung (2019) Effective sentence scoring method using BERT for speech recognition. In Proceedings of The 11th Asian Conference on Machine Learning, ACML 2019, 17-19 November 2019, Nagoya, Japan, Proceedings of Machine Learning Research, Vol. 101, pp. 1081–1093. Cited by: §1, §1, §2.1.
  • [33] R. Shu and H. Nakayama (2017) Later-stage minimum bayes-risk decoding for neural machine translation. CoRR abs/1704.03169. Cited by: §2.1.
  • [34] K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, Long Beach Convention Center, Long Beach, pp. 5926–5936. Cited by: §1, 3rd item.
  • [35] K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2020) MPNet: masked and permuted pre-training for language understanding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, virtual. Cited by: §1, §2.2.
  • [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30, Long Beach Convention Center, Long Beach, pp. 5998–6008. Cited by: §1, §2.2, §3.2, §4.1, §4.2.
  • [37] A. Wang and K. Cho (2019) BERT has a mouth, and it must speak: BERT as a markov random field language model. CoRR abs/1902.04094. Cited by: §1, §1, §2.1.
  • [38] Y. Wang, S. Cheng, L. Jiang, J. Yang, W. Chen, M. Li, L. Shi, Y. Wang, and H. Yang (2017) Sogou neural machine translation systems for WMT17. In Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, pp. 410–415. Cited by: §2.1.
  • [39] Y. Weng, S. S. Miryala, C. Khatri, R. Wang, H. Zheng, P. Molino, M. Namazifar, A. Papangelis, H. Williams, F. Bell, and G. Tür (2020) Joint contextual modeling for ASR correction and language understanding. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pp. 6349–6353. Cited by: §1, §1.
  • [40] Y. Xia, X. Tan, F. Tian, F. Gao, D. He, W. Chen, Y. Fan, L. Gong, Y. Leng, R. Luo, Y. Wang, L. Wu, J. Zhu, T. Qin, and T. Liu (2019) Microsoft research asia’s systems for WMT19. In Proceedings of the Fourth Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 - Volume 2: Shared Task Papers, Day 1, pp. 424–433. Cited by: §1, §2.1.
  • [41] D. Xiao, H. Zhang, Y. Li, Y. Sun, H. Tian, H. Wu, and H. Wang (2020) ERNIE-GEN: an enhanced multi-flow pre-training and fine-tuning framework for natural language generation. In

    Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020

    ,
    pp. 3997–4003. Cited by: §2.2.
  • [42] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In NeurIPS, Vol. 32, Vancouver Convention Centre, Vancouver, Canada, pp. 5753–5763. Cited by: §1, §2.2, §3.1.
  • [43] K. Yee, Y. N. Dauphin, and M. Auli (2019) Simple and effective noisy channel modeling for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 5695–5700. Cited by: §1, §4.2.

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work?

    3. Did you discuss any potential negative societal impacts of your work?

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Please see Section refsubsec:nmt and Appendix.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Please see Section 4.1

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Appendix

a.1 Datasets

a.1.1 Iwslt 2014

IWSLT 2014 is the evaluation campaign of the 11th International Workshop on Spoken Language Translation. It consist of a lot of low-resource translation tasks collected from TED talks, including German (De), Spanish (Es), Italian (It), Dutch (NL), Polish (PL), Romanian (Ro), Russian (Ru), Turkish (Tr) to English. We randomly split each dataset as the training set and dev set with a ratio of 25:1. And each task concatenates TED.tst2010, TED.tst2011, TED.dev2010 and TED.tst2012 as the test set. The statistics of each sub-task is described as:

De Es It Nl Pl Ro Ru Tr
Train 160K 169K 167K 153K 128K 167K 153K 109K
Valid 7.2K 7.6K 7.5K 6.9K 5.8K 7.6K 6.9K 4.9K
Test 5.5K 5.5K 5.5K 5.3K 5.4K 5.5K 5.5K 5.4K
Table 6: Statistical of IWSLT datasets.

a.1.2 WMT14 English-German

WMT14 English-German comprises 4.5M bilingual data collected from Europarl v7, Common Crawl corpus and News Commentary. We concatenate newstest2012 and newstest2013 as the valid set, and choose newstest2014 as the test set for WMT14 English-German. Our experiments mainly focuses on GermanEnglish.

a.1.3 LibriSpeech

LibirSpeech [22] includes 1000hr speech data, sampled at 16k Hz. LibriSpeech includes four subsets for evaluation, which are dev-clean, dev-other, test-clean and test-other.

a.2 Experimental Setup

Some pre-training hyper-parameters of Transcormer and Transcormer are described in Table 7.

Hyper-parameter Transcormer Transcormer
Number of Layers 12 6
Hidden Size 768 512
Filter Size 3072 2048
Attention heads 12 8
Dropout 0.1 0.1
Weight Decay 0.01 0.01
Learning Rate 5e-4 5e-4
Steps 125K 125K
Batch 8192 8192
Table 7: Pre-training hyper-parameters for Transcormer and Transcormer.

a.3 Analyses

a.3.1 Pre-training Strategy

In our experiment setup, we use sentence-level data as the input for pre-training. To better analyze the effect of using different data processing for pre-training, we conduct experiments by using stream-level data (concatenate multiple sentences as a fixed-length, e.g., 512) to make a comparison. We apply two strategies on NMT & ASR tasks, and then evaluate the average (top-1) accuracy of each token in a sentences with different lengths based on the output probability for SLM. The results are reported in Table 8. We can find that using stream-level data for pre-training can not achieve good accuracy when sentence length is too short. We guess that is because using stream-level data causes model can not fit short sentence since it always pre-trains under the longer sentences (i.e., 512). Considering that our downstream scenarios mainly consist of single sentence, which is usually too short, directly using stream-level data for pre-training can not achieve promising performance. As a result, we recommend to use sentence-level data for pre-training, and we also expect to explore more effective pre-training strategies in the future.

IWSLT WMT LibriSpeech # Sent Len
Model De Es De-En dev-clean dev-other 20 250 500
Transcormer 35.24 41.86 33.10 2.23 5.54 60.0% 73.0% 78.8%
Using stream-level 34.84 41.38 32.70 2.56 6.31 20.0% 55.0% 78.5%
Table 8: Comparisons between sentence-level and stream-level pre-training. The translation direction of all IWSLT tasks is to English. We sample some sentences from wikipedia with the same length (e.g., 20, 250, 500) to evaluate the token accuracy of SLM in sentences (i.e., obtain the top-1 accuracy of each token and calculate the sentence accuracy by averaging the accuracy of all tokens).

a.3.2 Domain Adaption

Following previous experiences [30], we also study the effect of using in-domain data for pre-training. For NMT tasks, we randomly sample 20GB monolingual data from NewsCrawl data to build the pre-training corpus for pre-training. And for ASR tasks, as LibriSpeech includes 4 GB in-domain data, we direct use this data as our pre-training corpus to handle ASR tasks. The results of NMT and ASR tasks are reported in Table 9 and Table 10. We can find that using in-domain data for pre-training is useful to improve the downstream tasks.

IWSLT WMT
Model De Es It Nl Pl Ro Ru Tr De-En
Oracle 41.80 48.69 41.89 44.38 27.90 46.01 29.60 27.25 39.17
Baseline 34.77 41.20 34.95 37.73 22.67 38.73 24.21 21.65 32.54
SLM (Transcormer) 35.24 41.86 35.52 38.45 23.29 39.34 24.69 22.41 33.10
+ in-domain data 35.74 42.39 35.97 39.06 23.91 39.70 24.95 23.05 33.51
Table 9: Domain adaption on NMT tasks. The translation direction of all IWSLT tasks is to English. All results are reported in BLEU.
Model dev-clean dev-other test-clear test-other
Oracle 1.45 4.23 1.59 4.19
Baseline 2.80 6.90 3.06 7.05
SLM (Transcormer) 2.23 5.54 2.49 5.72
+ in-domain data 2.01 5.12 2.12 5.23
Table 10: Domain adaption on LibrSpeech dataset. All results are reported in WER.

a.4 More Related works

Besides using language model to produce the probability of each token, there also remain some works which use discriminative language modeling to approximately estimate sentence scores. [2] borrowed the idea of the energy-based model into sentence reranking. [15] proposed a discriminative language model that minimizes the KL-divergence between the target distribution and the output distribution. These methods can be considered as the discriminative language modeling which directly predicts a single value to be the sentence score. Discriminative language model usually needs target datasets for fine-tuning, while our proposed language model is independent to downstream tasks. We think discriminative language models are complementary to our works and we leave this combination as the future works.