There are quite a lot of researches of additional busting for RNN-based Sequence-to-sequence (seq2seq) models, but due-to the novelty, there is a lack of them for Transformer model. Transformer model is becoming the state-of-the-art in machine translation task, showing significant improvements over the seq2seq models, but the performance of Transformer model in abstract-summarization task is not explicitly investigated. That motivated us to try to apply a couple of architecture independent methods to Transformer model: Large Vocabulary Trick and Feature-rich encoding, which gave improvements with seq2seq models, and compare the obtained results with the base Transformer model results and the ones obtained and evaluated in nallapati.
The paper is laid out as follows: Section 2 gives a short review of related work; Section 3 describes the approaches we applied; Section 4 describes the data and the system; Section 5 reports the results and their analysis; and finally Section 6 sums it all up.
2 Related Work
Currently there are several fundamental Neural Machine Translation models competing to be the state-of-the-art, which are also applicable for Text Summarization: recurrent and convolutional neural networksBahdanau et al. (2015); Gehring et al. (2017), and more recent attention-based Transformer model Vaswani et al. (2017)
. All models consist of encoder-decoder parts, but in the first one both encoder and decoder are LSTM layers with an attention layer between them, second one is based on convolutions, while in Transformer encoder stack consists of a number of layers composed of a multi-head self-attention layer and a feed-forward neural network, and decoder stack — same but each layer has an intermediate attention layer with encoder stack input. Default Transformer model uses Byte-Pair Encoding (BPE)Sennrich et al. (2016) to encode the text data.
Motivated by Nallapati et al. (2016), we decided to apply Large Vocabulary Trick (LVT) Jean et al. (2015) and Feature-rich encoding (FRE) to Transformer model. The main idea of LVT is to only work with a subset of the vocabulary, which is relevant to the current processing batch - the words from the batch and most frequent words to fill up till the limit. LVT allows to considerably lower the training time when the vocabulary is very large. FRE encodes additional information to the input - for each word there are parts-of-speech (POS) and named-entity tags (NER), term frequency (TF) and inverse document frequency (IDF) statistics.
3.1 BPE-based Transformer model
For comparison purposes, we used Transformer with default settings, as was described in Vaswani et al. (2017). The Byte-Pair Encoding (BPE) Sennrich et al. (2016) was used to encode the input sequence, where the size of sub-words vocabulary was set to 32K. The embedding layer was shared between the encoder and decoder parts of the model and was initialized randomly.
As a baseline model we took Transformer model based on words vocabulary instead of BPE. The size of the vocabulary obtained from dataset is 123K. The embedding layer was separated for the encoder and decoder parts of the model. Initialization of the weights for these layers was also performed randomly.
3.3 FRE fit-to-hidden (fre-f2h)
This model uses Feature-reach encoding technique to extend the words embedding vector of Transformer encoder part’s input, so the output of encoder embedding layer is the vector, which is a concatenation of the following sub-vectors: the word embedding vector, POS of the word, TF and IDF of the word. POS vectors are represented by one-hot encoding. Continuous features ,such as TF and IDF, were converted into categorical values by discretizing them into a fixed number of bins, and one-hot encode to indicate the bin number they fall into. Word embedding weights were initialized randomly. The embedding layer of the decoder part of Transformer was the same as in Baseline model described in the previous subsection. The vectors obtained by the encoder and decoder embedding layers should have dimensions equal to the hidden size of Transformer model. Thus, we should limit the dimension of the sub-vectors dimensions in the encoder embedding layer, so the dimension of obtained vectors should fit the hidden size of the model.
3.4 FRE linear-map-to-hidden (fre-lm2h)
The difference from previous described model is that here we use an additional linear layer without bias in the encoder embedding layer to map the vector obtained by the concatenation of sub-vectors to hidden size of Transformer model. Thus, the dimension of the vector obtained by concatenation of sub-vectors does not have to fit the hidden size of the model.
3.5 Fre + Lvt
In this model we change the embedding layer of decoder part of Transformer model by Large Vocabulary Trick approach. For each training batch, we build new batch vocabulary by words from all texts in this batch. Required decoder vocabulary size is set to 2K, so in case of a lack of words in vocabulary obtained from the batch texts, we extend it by the most frequent words. The weights of decoder embedding layer is the same as in previous models, but during training we use and modify only weights of those words, which are in the current batch vocabulary. During inference we use whole vocabulary. The encoder embedding layer is the same as in the previous model.
4 Experimental Setup
We used Gigaword corpus111https://github.com/alesee/abstractive-text-summarization
for training the models, it consists of 3.6 million article-title pairs. We could not acquire the annotated version of it, so we annotated it by ourselves, but we did not add named-entity tags, because Named-entity Recognition tools we tried (StanfordNERTagger, SennaNERTagger) performed poorly, since the corpus was lower-cased and it was the only version we could find. We deduplicate and divide the data into 2 parts: validation set of 2000 sentences and the training one. Validation files were used to monitor the convergence of the training.
We used DUC 2003 corpus for testing the models, so that we could also compare our results with the results in the paper Nallapati et al. (2016).
4.2 System Setup
Since most of the operations inside the model were numeric and easily parallelizable, NVIDIA GTX 1080 Ti with GPU memory 11 GB was used to speed up the process.
Firstly, we trained the BPE-based Transformer model for 5 epochs, each epoch took 3 hours 3 minutes. We got good results: 7.96 Rouge-2 and 21.54 Rouge-L (Table1).
Secondly, we trained baseline model also for 5 epochs, each one took 3 hours 52 minutes. The results got worse: 7.31 Rouge-2 and 20.29 Rouge-L (Table 1).
Thirdly, FRE fit-to-hidden was trained also for 5 epochs, each took 3:54. The results got worse again: 6.06 Rouge-2 and 18.88 Rouge-L (Table 1).
Fourthly, we tried FRE linear-map-to-hidden also for 5 epochs, each epoch took 3:55. The results on DUC 2003 got worse again: 5.87 Rouge-2 and 18.34 Rouge-L (Table 1), but validation scores with fre-f2h are very close and Rouge-L outperformed it on the 5th epoch, as can be seen in Figure 2.
Finally, we trained LVT+FRE for 10 epochs, each took 2:18. The results plummeted: 1.47 Rouge-2 and 9.16 Rouge-L (Table 1
). Most probably, 10 epochs were not enough for embeddings to train, since only 2K of them were updating each batch. In the Figure3, we can see that the cross-entropy loss (blue) of LVT+FRE is approximately the same, as in the Figure 4 with BPE-based model, but the evaluation line (red) is much higher, meaning worse. This is because the latter is computed using the whole vocabulary, while the former - using 2K vocabulary, which is unique for each batch.
In this work we evaluated default BPE-based Transformer model, Transformer with words vocabulary as baseline and tried to apply FRE and LVT approaches over it. Validation scores showed that these approaches do not give improvements and even worsen the quality against baseline. We found out that the default BPE-based Transformer model gives the best result among all evaluated models. We used DUC 2003 dataset as a test set to then compare our models with the models evaluated in Nallapati et al. (2016). BPE-based Transformer model outperforms TOPIARY and ABS models, but performs worse than the models proposed by authors of Nallapati et al. (2016). FRE and LVT approaches also performs worse than the baseline, while it performs worse than the BPE-based model.
We think that FRE doesn’t give improvements over baseline, because the quality of Gigaword annotated by ourselves is worse than original annotated Gigaword dataset. The LVT requires more iterations to converge, but the convergence is too slow, so we not sure that it will increase even till baseline results. Thus, application of LVT to Transformer doesn’t make any sense in the form in which we described it, because even if it shorten the training time of one epoch, it doesn’t improve the overall result.
Also, we did not try any model with the pretrained word embeddings
|schizophrenia patients whose medication could n’t stop the imaginary voices in their|
|heads gained some relief after researchers repeatedly sent a magnetic field into a|
|small area of their brains .|
|Ground truth Summary|
|Magnetic pulse series sent through brain may ease schizophrenic voices|
|schizophrenia patients gain some relief|
|study shows link between schizophrenia patients|
|study links schizophrenia to schizophrenia|
|researchers say they can t stop some people from schizophrenia|
|LVT + FRE|
|nasal implants pose dilemma|
|china was evacuating 330,000 people friday from land along the raging yangtze river|
|that officials were preparing to sacrifice to flooding to safeguard cities downstream .|
|Ground truth Summary|
|Chinese military personnel conducting extensive flood control efforts along Yangtze.|
|china orders soldiers to fight to the death|
|china orders soldiers to fight yangtze floods|
|china orders soldiers to fight floods|
|china orders soldiers to fight floods|
|LVT + FRE|
|china mobilizes soldiers to safeguard potable reservoirs|
|the czech republic and hungary will not compete with each other in their bids to join|
|nato and the european union -lrb- eu -rrb- , the hungarian telegraph agency reported|
|Ground truth Summary|
|Czech Republic, Hungary vow not to compete for NATO bid.|
|czech republic hungary not to compete in nato|
|czech hungary not to compete for nato eu membership|
|czech republic hungary to compete in nato|
|hungary czech republic not to compete in nato bids|
|LVT + FRE|
|prague czechs not intend to participate in joining nato|
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §2.
- Convolutional sequence to sequence learning. CoRR abs/1705.03122. External Links: Cited by: §2.
On using very large target vocabulary for neural machine translation.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp. 1–10. External Links: Cited by: §2.
- Looking for a few good metrics: automatic summarization evaluation - how many samples are enough?. In Proceedings of the Fourth NTCIR Workshop on Research in Information Access Technologies Information Retrieval, Question Answering and Summarization, NTCIR-4, National Center of Sciences, Tokyo, Japan, June 2-4, 2004, External Links: Cited by: §4.2.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, Y. Goldberg and S. Riezler (Eds.), pp. 280–290. External Links: Cited by: §2, §4.1, Table 1, §6.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, External Links: Cited by: §2, §3.1.
- Attention is all you need. CoRR abs/1706.03762. External Links: Cited by: §2, §3.1, §4.2.