Experiments with LVT and FRE for Transformer model

04/26/2020 ∙ by Ilshat Gibadullin, et al. ∙ JSC Innopolis 0

In this paper, we experiment with Large Vocabulary Trick and Feature-rich encoding applied to the Transformer model for Text Summarization. We could not achieve better results, than the analogous RNN-based sequence-to-sequence model, so we tried more models to find out, what improves the results and what deteriorates them.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are quite a lot of researches of additional busting for RNN-based Sequence-to-sequence (seq2seq) models, but due-to the novelty, there is a lack of them for Transformer model. Transformer model is becoming the state-of-the-art in machine translation task, showing significant improvements over the seq2seq models, but the performance of Transformer model in abstract-summarization task is not explicitly investigated. That motivated us to try to apply a couple of architecture independent methods to Transformer model: Large Vocabulary Trick and Feature-rich encoding, which gave improvements with seq2seq models, and compare the obtained results with the base Transformer model results and the ones obtained and evaluated in nallapati.

The paper is laid out as follows: Section 2 gives a short review of related work; Section 3 describes the approaches we applied; Section 4 describes the data and the system; Section 5 reports the results and their analysis; and finally Section 6 sums it all up.

2 Related Work

Currently there are several fundamental Neural Machine Translation models competing to be the state-of-the-art, which are also applicable for Text Summarization: recurrent and convolutional neural networks

Bahdanau et al. (2015); Gehring et al. (2017), and more recent attention-based Transformer model Vaswani et al. (2017)

. All models consist of encoder-decoder parts, but in the first one both encoder and decoder are LSTM layers with an attention layer between them, second one is based on convolutions, while in Transformer encoder stack consists of a number of layers composed of a multi-head self-attention layer and a feed-forward neural network, and decoder stack — same but each layer has an intermediate attention layer with encoder stack input. Default Transformer model uses Byte-Pair Encoding (BPE)

Sennrich et al. (2016) to encode the text data.

Motivated by Nallapati et al. (2016), we decided to apply Large Vocabulary Trick (LVT) Jean et al. (2015) and Feature-rich encoding (FRE) to Transformer model. The main idea of LVT is to only work with a subset of the vocabulary, which is relevant to the current processing batch - the words from the batch and most frequent words to fill up till the limit. LVT allows to considerably lower the training time when the vocabulary is very large. FRE encodes additional information to the input - for each word there are parts-of-speech (POS) and named-entity tags (NER), term frequency (TF) and inverse document frequency (IDF) statistics.

3 Models

3.1 BPE-based Transformer model

For comparison purposes, we used Transformer with default settings, as was described in Vaswani et al. (2017). The Byte-Pair Encoding (BPE) Sennrich et al. (2016) was used to encode the input sequence, where the size of sub-words vocabulary was set to 32K. The embedding layer was shared between the encoder and decoder parts of the model and was initialized randomly.

3.2 Baseline

As a baseline model we took Transformer model based on words vocabulary instead of BPE. The size of the vocabulary obtained from dataset is 123K. The embedding layer was separated for the encoder and decoder parts of the model. Initialization of the weights for these layers was also performed randomly.

3.3 FRE fit-to-hidden (fre-f2h)

This model uses Feature-reach encoding technique to extend the words embedding vector of Transformer encoder part’s input, so the output of encoder embedding layer is the vector, which is a concatenation of the following sub-vectors: the word embedding vector, POS of the word, TF and IDF of the word. POS vectors are represented by one-hot encoding. Continuous features ,such as TF and IDF, were converted into categorical values by discretizing them into a fixed number of bins, and one-hot encode to indicate the bin number they fall into. Word embedding weights were initialized randomly. The embedding layer of the decoder part of Transformer was the same as in Baseline model described in the previous subsection. The vectors obtained by the encoder and decoder embedding layers should have dimensions equal to the hidden size of Transformer model. Thus, we should limit the dimension of the sub-vectors dimensions in the encoder embedding layer, so the dimension of obtained vectors should fit the hidden size of the model.

3.4 FRE linear-map-to-hidden (fre-lm2h)

The difference from previous described model is that here we use an additional linear layer without bias in the encoder embedding layer to map the vector obtained by the concatenation of sub-vectors to hidden size of Transformer model. Thus, the dimension of the vector obtained by concatenation of sub-vectors does not have to fit the hidden size of the model.

3.5 Fre + Lvt

In this model we change the embedding layer of decoder part of Transformer model by Large Vocabulary Trick approach. For each training batch, we build new batch vocabulary by words from all texts in this batch. Required decoder vocabulary size is set to 2K, so in case of a lack of words in vocabulary obtained from the batch texts, we extend it by the most frequent words. The weights of decoder embedding layer is the same as in previous models, but during training we use and modify only weights of those words, which are in the current batch vocabulary. During inference we use whole vocabulary. The encoder embedding layer is the same as in the previous model.

4 Experimental Setup

4.1 Data

We used Gigaword corpus111https://github.com/alesee/abstractive-text-summarization

for training the models, it consists of 3.6 million article-title pairs. We could not acquire the annotated version of it, so we annotated it by ourselves, but we did not add named-entity tags, because Named-entity Recognition tools we tried (StanfordNERTagger, SennaNERTagger) performed poorly, since the corpus was lower-cased and it was the only version we could find. We deduplicate and divide the data into 2 parts: validation set of 2000 sentences and the training one. Validation files were used to monitor the convergence of the training.

We used DUC 2003 corpus for testing the models, so that we could also compare our results with the results in the paper Nallapati et al. (2016).

4.2 System Setup

Transformer model Vaswani et al. (2017)

with base setting from tensorflow/official/models

222https://github.com/tensorflow/models was used in the experiments. To evaluate the quality of the summarization, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric Lin (2004) was used.

4.3 Hardware

Since most of the operations inside the model were numeric and easily parallelizable, NVIDIA GTX 1080 Ti with GPU memory 11 GB was used to speed up the process.

5 Results

Firstly, we trained the BPE-based Transformer model for 5 epochs, each epoch took 3 hours 3 minutes. We got good results: 7.96 Rouge-2 and 21.54 Rouge-L (Table


Secondly, we trained baseline model also for 5 epochs, each one took 3 hours 52 minutes. The results got worse: 7.31 Rouge-2 and 20.29 Rouge-L (Table 1).

Thirdly, FRE fit-to-hidden was trained also for 5 epochs, each took 3:54. The results got worse again: 6.06 Rouge-2 and 18.88 Rouge-L (Table 1).

Fourthly, we tried FRE linear-map-to-hidden also for 5 epochs, each epoch took 3:55. The results on DUC 2003 got worse again: 5.87 Rouge-2 and 18.34 Rouge-L (Table 1), but validation scores with fre-f2h are very close and Rouge-L outperformed it on the 5th epoch, as can be seen in Figure 2.

Figure 1: Rouge-2 validation scores.
Figure 2: Rouge-L validation scores.
Model Rouge1 Rouge2 RougeL
bpe-based 23.43 7.96 21.54
baseline 21.98 7.31 20.29
fre-f2h 20.53 6.06 18.88
fre-lm2h 20.01 5.87 18.34
fre+lvt 9.62 1.47 9.16
TOPIARY 25.12 6.46 20.12
ABS 26.55 7.06 22.05
ABS+ 28.18 8.49 23.81
RAS-Elman 28.97 8.26 24.06
words-lvt2k-1s 28.35 9.46 24.59
words-lvt5k-1s 28.61 9.42 25.24
Table 1: DUC 2003 test scores. There are our models’ scores in the upper part and the scores from the paper Nallapati et al. (2016) in the lower part.

Finally, we trained LVT+FRE for 10 epochs, each took 2:18. The results plummeted: 1.47 Rouge-2 and 9.16 Rouge-L (Table 1

). Most probably, 10 epochs were not enough for embeddings to train, since only 2K of them were updating each batch. In the Figure

3, we can see that the cross-entropy loss (blue) of LVT+FRE is approximately the same, as in the Figure 4 with BPE-based model, but the evaluation line (red) is much higher, meaning worse. This is because the latter is computed using the whole vocabulary, while the former - using 2K vocabulary, which is unique for each batch.

Figure 3: Cross-entropy loss during training of FRE+LVT model.
Figure 4: Cross-entropy loss during training of BPE-based Transformer model.

6 Conclusion

In this work we evaluated default BPE-based Transformer model, Transformer with words vocabulary as baseline and tried to apply FRE and LVT approaches over it. Validation scores showed that these approaches do not give improvements and even worsen the quality against baseline. We found out that the default BPE-based Transformer model gives the best result among all evaluated models. We used DUC 2003 dataset as a test set to then compare our models with the models evaluated in Nallapati et al. (2016). BPE-based Transformer model outperforms TOPIARY and ABS models, but performs worse than the models proposed by authors of Nallapati et al. (2016). FRE and LVT approaches also performs worse than the baseline, while it performs worse than the BPE-based model.

We think that FRE doesn’t give improvements over baseline, because the quality of Gigaword annotated by ourselves is worse than original annotated Gigaword dataset. The LVT requires more iterations to converge, but the convergence is too slow, so we not sure that it will increase even till baseline results. Thus, application of LVT to Transformer doesn’t make any sense in the form in which we described it, because even if it shorten the training time of one epoch, it doesn’t improve the overall result.

Also, we did not try any model with the pretrained word embeddings

Source Document
schizophrenia patients whose medication could n’t stop the imaginary voices in their
heads gained some relief after researchers repeatedly sent a magnetic field into a
small area of their brains .
Ground truth Summary
Magnetic pulse series sent through brain may ease schizophrenic voices
schizophrenia patients gain some relief
study shows link between schizophrenia patients
study links schizophrenia to schizophrenia
researchers say they can t stop some people from schizophrenia
nasal implants pose dilemma

Source Document
china was evacuating 330,000 people friday from land along the raging yangtze river
that officials were preparing to sacrifice to flooding to safeguard cities downstream .
Ground truth Summary
Chinese military personnel conducting extensive flood control efforts along Yangtze.
china orders soldiers to fight to the death
china orders soldiers to fight yangtze floods
china orders soldiers to fight floods
china orders soldiers to fight floods
china mobilizes soldiers to safeguard potable reservoirs

Source Document
the czech republic and hungary will not compete with each other in their bids to join
nato and the european union -lrb- eu -rrb- , the hungarian telegraph agency reported
today .
Ground truth Summary
Czech Republic, Hungary vow not to compete for NATO bid.
czech republic hungary not to compete in nato
czech hungary not to compete for nato eu membership
czech republic hungary to compete in nato
hungary czech republic not to compete in nato bids
prague czechs not intend to participate in joining nato
Table 2: Examples.


  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §2.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. CoRR abs/1705.03122. External Links: Link, 1705.03122 Cited by: §2.
  • S. Jean, K. Cho, R. Memisevic, and Y. Bengio (2015) On using very large target vocabulary for neural machine translation. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers

    pp. 1–10. External Links: Link Cited by: §2.
  • C. Lin (2004) Looking for a few good metrics: automatic summarization evaluation - how many samples are enough?. In Proceedings of the Fourth NTCIR Workshop on Research in Information Access Technologies Information Retrieval, Question Answering and Summarization, NTCIR-4, National Center of Sciences, Tokyo, Japan, June 2-4, 2004, External Links: Link Cited by: §4.2.
  • R. Nallapati, B. Zhou, C. N. dos Santos, Ç. Gülçehre, and B. Xiang (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, Y. Goldberg and S. Riezler (Eds.), pp. 280–290. External Links: Link Cited by: §2, §4.1, Table 1, §6.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, External Links: Link Cited by: §2, §3.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §2, §3.1, §4.2.