, and text summarizationNallapati et al. (2016), exhibiting human-level performance Johnson et al. (2017)
. Despite some drawbacks (e.g huge parallel corpora are needed to train seq2seq models, expert knowledge required to set the hyperparameters), seq2seq models are becoming increasingly popular and are now deployed in real world applicationsMcCann et al. (2019). During inference, a trained seq2seq model aims to find the best sentence given a source sentence. Since searching for all the possible paths is not practical but also computationally expensive, existing work relies on beam search based algorithms to solve this issue Lipton (2015)
. Current solutions have limited performances due to three major constraints: (1) beam search sentence selection is based on likelihood regardless of the evaluation metric, (2) during the generation process, only left to right dependencies (right to left are ignored) are considered by the model, (3) seq2seq strongly foster safe sentencesLi et al. (2016a): during generation, the influence of the input decreases while words are generated Zhang et al. (2018)
meaning that the end of the sequence is less likely to be input relevant. Those limitations constraint beam search performances: on Switchboard Corpus with a Beam Size of 50 optimal re-ranking would yield to an improvement of 128% (see Supplementary for full table). For the evaluation metric we followLi et al. (2016a); Colombo et al. (2019) adopt the BLEU-4 score to compare the algorithms.
In this work, we introduce two novel algorithms based on beam search (VBS) with different ranking procedure: (1) BidiS a generalisation of the work of Wen et al. (2015); Mimura et al. (2018) that uses a ”reverse” decoder to re-score the produced sentence penaliszing sentences whose end is less likely given the input, (2) BidiA an algorithm, that looks at the closest pair by incorporating the similarity measure between two beams of hypothesis; one with sentences generated in the regular order, the other with sentences generated in the reverse order. Our results show that leveraging the reverse order can boost beam search performance leading to higher BLEU-4 score and more diverse responses compared to VBS. Complexity analysis further shows that our proposed algorithms have dramatically reduced computational cost compared to the traditional approaches.
We notice that limitation (2) and (3) previously introduced can be solved by introducing bidirectionally in the beam search. Indeed, training a seq2seq to output the sentence in the reverse order can model right to left dependencies and reduces the path length between the input and the end of the sentence (the shorter the paths are, the easier it is to model dependencies) making the end of the sentence more dependent of the input and producing more diverse sequences and model right to left dependencies as well.
Vanilla Beam Search (VBS) We denote B as the beam size, T as the maximum sentence length and V as the vocabulary size. A RNN encoder takes an input sequence , to learn the language model word by word during the training phase. When a language task is given, the decoder travels through paths at each time and keeps the most likely sequences. At each step, VBS considers at most hypothesis. The sequence likelihood is measured by using the score function in Wu et al. (2016)
where X is the source, Y is the current target, and is the length normalization factor. We select , which produce a higher BLEU score as illustrated in Wu et al. (2016). The beam search is stopped when exactly finished candidates have been found Luong et al. (2015). In the worst case, the algorithm will run for a maximum of steps.
Regular and Reverse model training
For the bidirectional beam search we train two different networks over the same dataset. The first seq2seq network called ”regular” is trained to predict the sentence in the regular order. The second network called ”reverse” is trained to predict the sentence in the reversed order. For example, if the regular network is trained with the pair “What do you like ?” / “I like cats !”, the reversed is trained with the pair “What do you like ?” / “! cats like I”. During decoding, the reverse model estimates right-to-left dependencies while the regular model estimates left-to-right dependencies.111From graph topology viewpoints, the decoder procedure from the right side is very different from the left sides. During exploring graph from left to right the regular seq2seq faces a huge very likely first token while the reverse seq2seq has a very restraint choice (mainly punctuation). More details are included in Supplementary. Two different settings have been explored: (1) training two independent seq2seq, (2) sharing the encoder of the two seq2seq and training using the following loss where:
is the Cross Entropy loss computed with the regular decoder and is the Cross Entropy loss computed with the reverse decoder. Since both approaches exhibit comparable performances, we choose to share the encoder to minimise the number of parameters in our model.
2.2 Beam Search with Bidirectional Scoring (BidiS)
A Beam search generates word by word from left to right: the token generated at time step only depending on past token, but would not affected by the future tokens. Inspired by the work of Li et al. (2016a), we propose a Beam Search with Bidirectional Scoring (BidiS), which scores the best candidates generated by the regular seq2seq model as follows:
where and represents the final sequence in the regular order and reversed order respectively. Moreover, is computed by using the regular model while is computed by using the reverse model. 222Optimisation process shows that compensate the difference of scale between and is optimized in the validation. The intuition here is as follows: after generation of sentences from the regular seq2seq, the reverse model computes
, and assigns higher probabilities to sequences presenting a more likely right-to-left structure and more likely ending given the input. Sincebest lists produced by our models are grammatically correct, the final selected options are well-formed and present the best combination of both directions.
2.3 Beam Search with bidirectional agreement (BidiA)
The previous algorithm has two weaknesses. Firstly it introduces a hyperparameter . Secondly, the reverse model is only used to re-score the sentences generated by the regular model, meaning that potentially good sentences generated by the reverse model are not considered. We solve these two problems by proposing a Beam Search with bidirectional agreement (BidiA); a hyperparameter free algorithm that uses best sequences according to the reverse seq2seq model. Formally if and are the sets containing sequence generated by the regular and reverse model respectively, we output such that:
where represents any similarity measure between two sentences 333 does not need to be differentiable.. For our experiment, we propose two different choices: (1) an adaptation of the BLEU score, where the corpus length is set to to foster longer responses formally:
where the brevity penalty is set to 444Brevity penalty introduces diversity and foster longer sentences.,
is the geometric average of the modified n-gram precisions, using n-grams up to length N andare positive weights summing to one, (2) an adaptation of the Word Mover’s Distance () Kusner et al. (2015) (stopwords are removed and final score is multiplied by ) that captures the relationship between words, by computing the “transportation” from one phrase to another conveyed by each word 555Implementation details are given in supplementary.
3.1 Corpora and Metrics
Corpora: We evaluate our algorithms on two spoken datasets (specific phenomena appear when working with spoken language Dinkar et al. (2020) compared to written text). (1) The Switchboard Dialogue Act Corpus (SwDA)a telephone speech corpus Stolcke et al. (1998), consisting of about 2.400 two-sided telephone conversation. (2) The Cornell Movie Corpus Danescu-Niculescu-Mizil and Lee (2011) which contains around 10K movie characters and around 220K dialogues.
Metrics: To evaluate the performance and language response quality for each decoder strategy, we use two classical different metrics at the sentence level. (1) A BLEU-4 score Papineni et al. (2002) is computed on the unigrams, bigrams, trigrams and four-grams; and then micro-averaged. (2) A Diversity score: distinct-n Li et al. (2016a) is defined as the number of distinct n-grams divided by the total number of generated words. Indeed, in neural response generation, we want to avoid generate generic responses such as ”I don’t know”, ”Yes”, ”No” and foster meaningful responses.
3.2 Response Quality
Figure 1 shows our proposed system results in BLEU-4 score metric. We see that our proposed methods (BidiS and BidiA) achieve better performances than VBS showing that bidirectionality boosts performances. achieves the best result overall yielding to a relative improvement of 9% on Cornell and 5% on SWA. From Figure 1, we see that for two different metric BidiA leads to better results than both other algorithm. Improvement of BidiS over the baseline VBS shows that the optimisation of on the validation set leads to good generalisation on the test set. is slightly better than which is likely to be related to the choice of evaluation metric. From Figure 1 we can see that the BLEU-4 score of VBS stop increasing when . BidiS and BidiA keep improving the quality of the sequence while more hypothesis are proposed. This suggests that our bidirectional beam search is more efficient at extracting best sentence as the number of hypothesis increases. From Figure 1 we can see that VBS, BidiA and present a drop in the performance for a number of hypothesis of 20 and 40: when performing the beam search for 20 hypothesis we observe that the seq2seq is very confident about sentences that lead to lower BLEU-4 score. Those sentences are not considered when and better sentences are extracted when the beam size increases. does not present this drop of performance this is due to the metric choice (based on overlaps) that selected different sentences from .
3.3 Rank Analysis
In this section we compare the index returned by BidiA and Best Hypothesis as shown in Figure 2. Figure 2 illustrates one of the limitation of likelihood based ranking when an off-shell metric is used for evaluation: the very most likely sentences are not the one with the highest BLEU-4. Interestingly, index distribution of Best Hypothesis is very similar for both Cornell and Switchboard, whereas for and it varies. that has a better BLEU-4 (see Figure 1) than has a index distribution more similar to Best Hypothesis than .
3.4 Diversity of the responses
Table 1 has shown the performance in diversity metrics. Overall, BidiA has the best performance among the other strategies (improvement up to 8% over the baseline for Cornell). By looking for an agreement between the reverse seq2seq and the regular one BidiA is able to extract sequences that are less likely according to the VBS, but more diverse. In all case, we see that bidirectionally helps to have more diverse sentences. Since the influence of the input decreases during the generation bidirectional beam search will output sentences that have both meaningful beginning and ending with respect to the input.
3.5 Complexity Analysis
In practical application it is important to evaluate the algorithm complexity when a limited amount of resources are available. Table 2 shows that BidiA is computationally cheaper than VBS and that BidiS has the same complexity as VBS.
In this paper we show that bidirectional beam search strategies can be leverage to boost the performance of beam search. We have introduced two novel re-ranking criterions that select sentences with more diverse sentences and higher BLEU-4 and reduce computational complexity. Future work includes testing our novel bidirectional strategies with other pretrained models such as the one introduced in Jalalzai et al. (2020); Chapuis et al. (2021, 2020); Witon et al. (2018), with other types of data (e.g multimodal Garcia et al. (2019); Colombo et al. (2021a)), on differents tasks (e.g style transfer Colombo et al. (2021b)) as well as exploring other stoping criterions Colombo et al. (2021c).
Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: Cited by: §5.3.2, §5.4.
- Code-switched inspired losses for generic spoken dialog representations. arXiv preprint arXiv:2108.12465. Cited by: §4.
- Hierarchical pre-training for sequence labelling in spoken dialog. arXiv preprint arXiv:2009.11152. Cited by: §4.
Empirical evaluation of gated recurrent neural networks on sequence modeling. In
NIPS 2014 Workshop on Deep Learning, December 2014, (English (US)). Cited by: §5.4.
- Improving multimodal fusion via mutual dependency maximisation. arXiv preprint arXiv:2109.00922. Cited by: §4.
- A novel estimator of mutual information for learning to disentangle textual representations. arXiv preprint arXiv:2105.02685. Cited by: §4.
- Automatic text evaluation through the lens of wasserstein barycenters. arXiv preprint arXiv:2108.12463. Cited by: §4.
- Affect-driven dialog generation. External Links: Cited by: §1.
- Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs.. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011, Cited by: §3.1.
- The importance of fillers for text representations of speech transcripts. arXiv preprint arXiv:2009.11340. Cited by: §3.1.
- From the token to the review: a hierarchical multimodal approach to opinion mining. arXiv preprint arXiv:1908.11216. Cited by: §4.
- Heavy-tailed representations, text polarity classification & data augmentation. arXiv preprint arXiv:2003.11593. Cited by: §4.
Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.4.
From word embeddings to document distances.
International Conference on Machine Learning, pp. 957–966. Cited by: §2.3, §5.5.3.
- Deep learning. nature 521 (7553), pp. 436. Cited by: §5.4.
- A diversity-promoting objective function for neural conversation models. In HLT-NAACL, Cited by: §1, §2.2, §3.1.
- A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 994–1003. External Links: Cited by: §1.
- A critical review of recurrent neural networks for sequence learning. CoRR abs/1506.00019. Cited by: §1.
Effective approaches to attention-based neural machine translation.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Cited by: §2.1, §5.4.
- The natural language decathlon: multitask learning as question answering. External Links: Cited by: §1.
- Forward-backward attention decoder. In INTERSPEECH, Cited by: §1.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. In CoNLL, pp. 280–290. Cited by: §1.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.1.
- Dialog act modeling for conversational speech. In AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, pp. 98–105. Cited by: §3.1.
- Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 3104–3112. External Links: Cited by: §1.
Sequence to sequence - video to text.
The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
- Stochastic Language Generation in Dialogue using Recurrent Neural Networks with Convolutional Sentence Reranking. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), Cited by: §1.
- Disney at iest 2018: predicting emotions using an ensemble. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 248–253. Cited by: §4.
- Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144. Cited by: §2.1, §2.1.
- Learning to control the specificity in neural response generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1108–1117. Cited by: §1.
5 Supplementary Material
5.1 Corpus analysis: importance of reverse model
In this section, we discuss the importance of using a reverse model. Figure 3 shows the word distribution on Cornell for 50 most common words at each position. From top plot we observe that the regular seq2seq faces a lot of likely choices: all the fifty most common words appear more than times in position 1. The reverse seq2seq faces less than very likely choices in position 1 that appear more than time. The reverse seq2seq is then less likely to transmit a mistake at time step 2.
5.2 Ideal reranking
In Table 3 we report the BLEU-4 achieved by VBS and Best Hypothesis: the best hypothesis alive in the beam. It illustrates that the limitations of the likelihood criterion and shows that making a change in the final reranking and sentence selection criterion can yield to higher BLEU-4.
5.3 Implementation of similarity measure ()
In this section we describes the implementation of each similarity used in section 3. We have introduce a brevity penalty for two main reasons:
preliminary experiments have shown that the regular seq2seq tends to generate short sentences due to the data distribution.
if no brevity penalty is introduced and both neural networks generate “I don‘t know” the selected sentence will be “I don‘t know” since the similarity measure will be 1. With a brevity penalty, similarity metric can select a less generic choice.
uses the wm-relax librairy (https://github.com/src-/wmd-relax), embeddings used are coming from FastText librairy Bojanowski et al. (2017). At the first step stopwords according nltk list are removed, Word Mover Distance is computed and multiplied by previously defined. Formally, in Equation 9 we set .
5.4 Architecture details
We evaluate our proposed algorithms by using off-the-shelf seq2seq models. For the encoder, we use two-layer bidirectional GRU Chung et al. (2014) ( hidden layers). For the decoder, we use a one-layer uni-directional GRU ( hidden layers) with attention Luong et al. (2015). The embedding layer is initialized with fastText pre-trained word vectors (on Wikipedia 2017, the UMBC web-based corpus and the statmt.org news dataset) and the size is Bojanowski et al. (2017). We use the ADAM optimizer Kingma and Ba (2014) with a learning rate of , which is updated by using a scheduler with a patience of epochs and a decrease rate of . The gradient norm is clipped to , weight decay is set to , and dropout LeCun et al. (2015) is set to
. The models have been implemented with pytorch, they have been trained on, validated on , and tested on of the data respectively. Since our purpose is to show that bidirectionality can boost beam search we set in subsection 2.1.
5.5 Proofs of Complexity analysis
5.5.1 VBS complexity
For VBS, at each time step , hypotheses are re-ranked and the most likely are kept. The final average complexity is:
5.5.2 BidiS complexity
In the case of BidiS, the algorithm generates sequences using VBS, and then for generating sequence it computes with complexity . The final step includes a sorting of complexity .666 and have much less order compared to so they can be neglected here. BidiS complexity is:
5.5.3 BidiA complexity
Word Mover’s Distance criterion: According to Kusner et al. (2015) the computational cost of the Word Mover’s Distance computation is , where denotes the number of unique words in the documents. In our case the distance is computed between two sequences of length at most , hence . complexity with Word Mover’s Distance as selection criterion is given by the following formula:
In general and , in Equation 9 the second term is small compared to the first term, hence . Even thought V dominates the complexity of the algorithm, still is more efficient than VBS. 777For example if , , k we see that .
BLEU criterion: the computational cost of the score is polynomial in T. complexity with BLEU score as the selection criterion is given by the following formula: