Recurrent neural networks (RNN) and architectures based on RNNs like LSTM [Hochreiter and Schmidhuber1997] has been used to process sequential data more than a decade. Recently, alternative architectures such as convolutional networks [Dauphin et al.2017, Gehring et al.2017] and transformer model [Vaswani et al.2017]
have been used extensively and achieved the state of the art result in diverse natural language processing (NLP) tasks. Specifically, pre-trained models such as the OpenAI transformer[Radford et al.2018] and BERT [Devlin et al.2018] which are based on transformer architecture, have significantly improved accuracy on different benchmarks.
In this paper, we are introducing a new dataset which we call ParagraphOrdering, and test the ability of the mentioned models on this newly introduced dataset. We have got inspiration from ”Learning and Using the Arrow of Time” paper [Wei et al.2018]
for defining our task. They sought to understand the arrow of time in the videos; Given ordered frames from the video, whether the video is playing backward or forward. They hypothesized that the deep learning algorithm should have the good grasp of the physics principle (e.g. water flows downward) to be able to predict the frame orders in time.
Getting inspiration from this work, we have defined a similar task in the domain of NLP. Given two paragraphs, whether the second paragraph comes really after the first one or the order has been reversed. It is the way of learning the arrow of times in the stories and can be very beneficial in neural story generation tasks. Moreover, this is a self-supervised task, which means the labels come from the text itself.
|First Paragraph||Now they were walking through the trees, one of them carrying him in its huge arms, quite gently. He was scarcely conscious of his surroundings. It was becoming more and more difficult to breathe.|
|Second Paragraph||Then he felt himself laid down on something soft and dry. The water was not falling on him now. He opened his eyes.|
2 Paragraph Ordering Dataset
We have prepared a dataset, ParagraphOrdreing, which consists of around 300,000 paragraph pairs. We collected our data from Project Gutenberg. We have written an API for gathering and pre-processing in order to have the appropriate format for the defined task.111API for downloading the dataset: https://github.com/ShenakhtPajouh/transposition-data. The implementation of different algorithms: https://github.com/ShenakhtPajouh/transposition-simple Each example contains two paragraphs and a label which determines whether the second paragraph comes really after the first paragraph (true order with label 1) or the order has been reversed (Table 1). The detailed statistics of the data can be found in Table 2.
Average Number of Tokens
|Average Number of Sentences||9.31|
Different approaches have been used to solve this task. The best result belongs to classifying order of paragraphs using pre-trained BERT model. It achieves aroundaccuracy on test set which outperforms other models significantly.
|BERT Features(512 tokens)+Feed-Forward||0.639|
|BERT Classifier(30 tokens / 15 tokens from each paragraph)||0.681|
|BERT Classifier(128 tokens / 64 tokens from each paragraph)||0.717|
|BERT Classifier(256 tokens / 128 tokens from each paragraph)||0.843|
3.1 Encoding with LSTM and Gated CNN
In this method, paragraphs are encoded separately, and the concatenation of the resulted encoding is going through the classifier. First, each paragraph is encoded with LSTM. The hidden state at the end of each sentence is extracted, and the resulting matrix is going through gated CNN [Dauphin et al.2017] for extraction of single encoding for each paragraph. The accuracy is barely above , which depicts that this method is not very promising.
3.2 Fine-tuning BERT
We have used a pre-trained BERT in two different ways. First, as a feature extractor without fine-tuning, and second, by fine-tuning the weights during training. The classification is completely based on the BERT paper, i.e., we represent the first and second paragraph as a single packed sequence, with the first paragraph using the A embedding and the second paragraph using the B embedding. In the case of feature extraction, the network weights freeze and CLS token are fed to the classifier. In the case of fine-tuning, we have used different numbers for maximum sequence length to test the capability of BERT in this task. First, just the last sentence of the first paragraph and the beginning sentence of the second paragraph has been used for classification. We wanted to know whether two sentences are enough for ordering classification or not. After that, we increased the number of tokens and accuracy respectively increases. We found this method very promising and the accuracy significantly increases with respect to previous methods (Table3). This result reveals fine-tuning pre-trained BERT can approximately learn the order of the paragraphs and arrow of the time in the stories.
- [Radford et al.2018] Alec. Radford, Karthik. Narasimhan, Tim. Salimans, and Ilya. Sutskever 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf.
- [Vaswani et al.2017] Ashish. Vaswani, Noam. Shazeer, Niki. Parmar, Jakob. Uszkoreit, Llion. Jones, Aidan N. Gomez, Łukasz. Kaiser, and Illia. Polosukhin. 2017. Attention is all you need Advances in Neural Information Processing Systems, 5998–6008.
- [Wei et al.2018] Donglai. Wei, Joseph J. Lim, Andrew. Zisserman, and William T. Freeman 2018. Learning and Using the Arrow of Time .
- [Devlin et al.2018] Jacob. Devlin, Ming-Wei. Chang, Kenton. Lee, and Kristina. Toutanova 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- [Gehring et al.2017] Jonas. Gehring, Michael. Auli, David. Grangier, Denis. Yarats, and Yann N. Dauphin 2017. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122.
- [Hochreiter and Schmidhuber1997] Sepp. Hochreiter and Jürgen. Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
[Dauphin et al.2017]
Yann N. Dauphin, Angela. Fan, Michael. Auli, and David. Grangier
Language modeling with gated convolutional networks
Proceedings of the 34th International Conference on Machine Learning, 70:933–941.