Transformer Vaswani et al. (2017) based neural architectures have recently achieved great advancement in representing and generating sequences of text Dai et al.; Radford et al.; Cer et al.; Devlin et al.. However, the tasks of representing and generating text that is both long and coherent still eludes state of the art (SOTA) models. Specifically, the effectiveness of these models decreases sharply when modeling sequences that are longer than the span their Self-Attention layers work upon. This problem was quantified in Dai et al. (2019), where it was referred to as Relative Effective Context Length (RECL), and it was shown that for relatively short self-attention spans (128 tokens), the RECL can extend for up to 4.5 times the self-attention span.
A recent attempt to generate relatively long texts was presented in Radford et al. (2018), where a huge Transformer based language model (1.5 billion parameters) was trained. This model was able to successfully generate long texts of ~200 words that seem nearly indistinguishable from human generated ones. However, due to the computational complexity of the Transformer’s Self-Attention layer - where n is the number of words - it is very expensive to extend the length of generated text while maintaining its coherence.
There have also been several recent attempts to encode long texts as vectors, mainly for the purpose of using those vectors for text classification tasks. In two such cases Devlin et al. (2018); Cer et al. (2018), SOTA results have been achieved for multiple tasks, while using text classification vectors of equal size to the (single token) embedding. While both the <CLS> vector presented in Devlin et al. (2018) and the summation of the contextual token vectors in the document presented in Cer et al. (2018) have achieved impressive results, the constraint of represent both words and long sequences of text as vectors in the same space becomes harder as our text grows longer.
This paper is organized as follows: Section 2 describes how long texts are embedded. Section 3 describes the training process required to make such embedding meaningful. Section 4 describes the generation and refining processes for novel texts. Section 5 describes who we keep proper nouns (such as character names) consistent throughout the document, and Section 6 reviews the entire system and its training process.
2 Text embedding
In this section we explain how text is embedded into a one dimensional vector given a pre-trained model. For this purpose, we assume that the text was preprocessed, and that sentence, paragraph and chapter breaks were identified during this stage. In order to encode a document we only need to consider a small part of the network described here, which we denote as the Hierarchal Encoder (HE). The HE creates a document encoding according to the following steps:
Tokens are embedded into , in a manner similar to Vaswani et al. (2017) using an embedding matrix.
These tokens are transformed using a Self-Attention based Encoder. Specifically, the Evolved Transformer Encoder model So et al. (2019) is used for this stage.
The encoded tokens are compressed into a 1 dimensional vector by a Recurrent Neural Network (RNN), similar toSutskever et al. (2014), with an output dimension of , such that . For each sentence, the final state of this RNN is considered to be the sentence representation vector, and we will refer to this RNN as the Compressor. Please note that what we call a Compressor is referred to as an Encoder in Sutskever et al. (2014). For the RNN we use a Bidirectional LSTM inspired by Graves and Schmidhuber (2005).
All of the sentence vectors belonging to the same paragraph are stacked and followed by a <SENTENCE_EoS> tokento get a sentence embedding matrix.
The sentence embedding matrix is padded with trainable
tokens, ensuring that all sentence embedding matrices are of the same dimensionality.
A Position Embedding matrix is added to the sentence embedding matrix in a manner similar to Vaswani et al. (2017).
Using the sentence embedding matrix, we repeat stages 2-6 to create a paragraph encodings token , such that .
Chapter embeddings are created by repeating the above process with the paragraph as the input, and so forth.
We denote all the intermediate vectors created by the HE during the embedding of a document as the Document Vector Tree (DVT). The DVT will help us during loss calculations in 3, and figure 1 illustrates the creating of a single (sentence level) node in that tree.
3 Model Training and Loss Function
The loss function used to train our model contains three separate components for each level of the DVT (except the first and last ones that only have two). These components are referred to as:
The Sequence Reconstruction (downward) loss
The MLM (in-level) loss
The Coherence (upward) loss
These losses are defined in the following subsections and their summation is the main component in the loss function of AGENT. For these steps, we will need to use the DVT. Thus, calculating it for each training example (given the current weights of the HE) is the first step in calculating these losses. Figure 2 illustrates these losses at the sentence level.
3.1 Sequence Reconstruction
When we use a Compressor to create a higher level embedding from a lower level one, for example creating a paragraph embedding vector from its sentence embedding matrix, we would like to ensure that as little data as possible is lost during the compression stage. To do so, we use another RNN (also a Bidirectional LSTM) denoted as the Decompressor, followed by a Self-Attention based Decoder that takes the hidden states of the Decompressor as the input. Specifically, we use the Evolved Transformer Decoder model So et al. (2019). When reconstructing token vectors from sentence vectors, the loss is computed in a way similar to Vaswani et al. (2017). When reconstructing vectors at the sentence level (or higher), from the output of their corresponding decoder, we make the following adjustments in order to compute the loss in the same manner:
The vector for the correct “label” of a sentence is the sentence vector that was constructed by the HE.
The vectors for the incorrect “labels” (corresponding to the embedding matrix in the token decoding stage) are all the sentence vectors created by the HE for the current batch.
The bias for each sentence is remove, which is equivalent to setting the bias vector of the token level decoder to 0.
3.2 Masked Language Modeling
For each level of the DVT, we perform the Masked Language Modeling task described in Devlin et al. (2018). In this task vectors (of tokens, sentences, paragraphs, etc…) are randomly masked, and the objective is to predict the corresponding vector that was generated by the HE. For the token level, we perform the task using the exact same loss function as Devlin et al. (2018) and reduce mismatch between training and embedding by:
Replacing the missing vector with a vector corresponding to a <MASK> token that exists in our vocabulary 80% of the time.
Replacing the missing vector with a vector corresponding to a random token that exists in our vocabulary 10% of the time.
Keeping the “missing” vector unchanged 10% of the time.
For higher levels of the DVT, we make the following adjustments:
Replacing the missing vector with a <LEVEL_MASK> vector that is a learnable model variable 80% of the time. Each level of the DVT has its own “mask” token of the appropriate dimension.
Replacing the missing vector with a random vector, uniformly selected from all of the sentence vectors generated by the HE for the current batch 10% of the time.
Keeping the “missing” vector unchanged, as it was generated by the HE, 10% of the time.
The classifier for this task shares some of its architecture and weights with parts of the HE. For the sentence level the process is as follows:
Tokens are encoded and compressed by the token level Encoder and Compressor of the HE to create sentence vectors.
A few of the sentence vectors are masked for the MLM task.
The sentence level Encoder of the HE is used to create contextual embeddings.
Like in Devlin et al. (2018), these contextual embeddings pass through one dense layer with “GELU” (Hendrycks and Gimpel, 2016) activation (which is not a part of the HE) and go through a matrix multiplication with the embedding matrix (which was generated by the HE like in 3.1
) to obtain the logits for the classification task.
This task is a generalization of what is referred to in Devlin et al. (2018) as the “Next Sentence Prediction” task and is meant to replace it, both for the token level and higher levels of the DVT. In this task (at the sentence level) all sentence vectors of the same paragraph are randomly split into two groups: group A and group B. When creating Segment Embeddings for the DVT (see 2
sentences are considered to be in group A. However, when performing
the Coherence task, a number P is randomly chosen. Half the time P
is set to 0, and half the time P~U(0,1). Then, each
sentence has a probability of P to belong to segment B.
After this selection, we add a Segment Embedding to each sentence vector, which is one of two special trainable tokens: <SENTENCE_A> and <SENTENCE_B>, thus creating a new sentence embedding matrix. We then create a paragraph vector using the sentence level Encoder and Compressor of the HE. Note that in cases where P=0, this paragraph vector is identical to the paragraph vector created by the HE. This paragraph vector is passed to a component of our system that we denoted as the Coherence Checker, which exists for each level of the DVT. The Coherence Checker of each level (), is a regressor that predicts the ratio of sentences in the paragraph that were, in-fact, replaced. It is composed of L (a hyper-parameter, see 4.1for more details) fully connected feed-forward layers with tanh activation, followed by single unit with a sigmoid activation. The loss is the mean-squared-error (MSE) between the predicted and actual ratio of replaced sentences.
3.4 Auto Encoder Regularization:
In order to create one-dimensional text embedding vectors for the various levels of the DVT we use components of our system that we have called Compressors and Decompressors. While these components perform a similar function to Encoders and Decoders, and in fact were used as such in Sutskever et al. (2014), we treat them as separate components from the Encoder and Decoder So et al. (2019) that we use along side them. The reason for that is that unlike the Transformer Encoder, the Compressor must create a one-dimensional output, while the Decompressor is not allowed to use any input but the compressed vector such as the pre-compressed embedding.
While reversing the transformation of a Compressor is not necessary to have an effective (Sequence Reconstruction loss minimizing) Decompressor, we can safely say that a Decompressor which reverses its corresponding Compressor is good enough, as it leaves us with the well tested Encoder and Decoder of So et al. (2019). To encourage this effect, we treat each corresponding Compressor-Decompressor pair as a part of an Auto-Encoder and add the following regularization term to our model for each such pair:
Where is the Compressor input, is the decompressor output and is a small number (hyper-parameter) that decreases to zero as the training progresses.
4 Text Generation
We attempt to achieve the goal of generating text that is indistinguishable from human generated text using a Generative Adversarial Network Goodfellow et al. (2014). For each level of the DVT (except for the token level), AGENT has a Generator that generates a text vector, and a Discriminator that detects whether a text vector was generated via the Generator or the HE.
To generate text at the paragraph level, we generate a random vector (where p stands for paragraph). Due to the fact that a randomly generated vector is unlikely to reside within a region of space that contains coherent text representation Bowman et al. (2015), this vector is passed through a generator . is composed of L dense layers, where all except the last one are succeeded by a tanh activation. Following the reasoning of Goodfellow et al. (2014), if could be any arbitrary function, a unique solution would exists where recovers the distribution of paragraph vectors generated by the HE. We aim to select L in such a way that could approximate this solution to a reasonable degree. The Generators of the other levels of the DVT are all created in a similar manner.
We also note that there exist an interesting similarity between the Generator and the Coherence Checker of the same level. While the takes a random point and brings it to a region in where coherent texts ought to reside, takes a point in and determines whether or not it is in that region. For this reason we use the same architecture (L dense layers) for both.
In order to improve the quality of we create a Discriminator denoted as . As in Goodfellow et al. (2014), the Discriminator aims to classify whether its input was derived from the data or produced by , while is trained to maximize the loss of . However, does not directly take the paragraph vector generated by as an input. Instead, this vector is transformed by the corresponding Decompressor and Decoder to obtain a sentence matrix for its input. On the other hand, inputs derived from real data are generated by using the same transformation on paragraphs vectors from the DVT. The Discriminators’ architecture was chosen to be a simple CNN Kim (2014) as it was proven effective when employed over LSTM generated word vectors (Zhang and Carin, 2016). Though, unlike Zhang and Carin (2016), this LSTM is not a part of the Generator and its weights do not change during the back-propagation of the Discriminator loss. Instead, its (and the Decoders’) weights are optimized to reconstruct the sentence vectors of the DVT as described in 3.1. The Discriminators for the other levels of the DVT are all created in a similar manner, and figure 3 illustrates the above process.
4.3 Hierarchical Decoding and Copyediting
After the training of AGENT is done, we use the top level Generator to create a document vector. This vector is passed to the top level Decompressor and Decoder, and each resulting vector is passed in turn to the decoder of its level. By the end of this process, which we call the Hierarchical Decoding (HD), we are left with a DVT whose leaves can be translated into tokens by the lowest level decoder. However, before generating text in that manner, we perform an additional step, which we refer to as Copyediting. In this step the DVT is repeatedly traversed level by level, where all nodes of the same level are simultaneously updated according to the following formula:
where is a small constant, is the current node after ‘i’ iterations of Copyediting, and is the same node, replaced by the <MASK> vector of its level and reconstructed using the other (unchanged) nodes of the same level in accordance with the MLM task (3.2). After updating all the nodes of a level, the lower level of the DVT is generated from the updated nodes and the traversal progresses.
The motivation behind this step is to use the knowledge we have gathered during the MLM task of how neighboring texts (sentences in the same paragraph for example) relate to each other, in order to constrain the decoded text to exhibit such constraints as well. In other words, the weights of the MLM component contains knowledge about how subsequent sentences (in the training examples) behave, and this knowledge might not overlap completely with the knowledge contained in the HD.
5 Proper Nouns Data Base
One challenge in generating a document vector is that the different chapter vectors that are derived from it need to be consistent. For example, character names must remain consistent through all chapters and all of the paragraphs that are derived from these chapters. Though this challenge can be eased by the using the Coherence loss during training and by the Copyediting stage during text generation, complete consistency throughout an entire document might still be very hard to reach. In addition, many of the sentences in book include duplicate information such as the name of the main character. While encoding those sentences into sentence vectors, the HE must make sure that names are encoded correctly (in such a way that they can be decoded). This, in turn, requires us to use sentence vector of a dimension that is large enough to ensure that they can contain all of the names that might appear in a sentence.
To address these issues and increase our model’s consistency while reducing the size of its embeddings, we introduce an additional component to our model that we refer to as Proper Nouns Data Base (PNDB). Figure 4 illustrates the PNDB and we write to it in the following manner:
Writes to the PNDB are performed alongside the construction of the DVT.
We define a question matrix , which is a learnable model parameter. ‘q’ is a hyper parameter describing the number of questions our model asks and .
For each sentence ‘i’, the HE creates a contextual token embedding matrix Ei. For each Ei we create a ‘key’ matrix Ki through a linear projection, and a ‘value’ matrix Vi through an Ignore Gate, see section 5.1 for additional details.
For each Ei we create an ‘answer’ matrix Ai by applying a dot-product attention in a manner identical to Vaswani et al. (2017) .
A global ‘answer’ matrix A is created by average pooling all the Ai matrixes.
We read from the PNDB the following manner:
Reads from the PNDB are preformed during loss calculations for the Reconstruction and Masked LM tasks.
For each sentence ‘i’ we define E2i as the decompressed token matrix for the Reconstruction task (as illustrated in 4), and as the contextual token embedding for the MLM task.
For each E2i we create a ‘key’ matrix K2i through a linear projection, and an ‘answer’ matrix A2i by applying a dot-product attention in a manner similar to Vaswani et al. (2017) .
An embedding matrix E3i is created by combining E2i and A2i through an Update Gate, see section 5.1 for additional details.
While using the PNDB we run the risk of circumventing our model’s learning process due to ‘label leaking’, which in this case means being able to save a word we are trying to predict to the PNDB, and then relying on the PNDB to ‘guess’ the missing word without needing any input from the rest of the model. To avoid such cases we have made the following decisions for the PNDB architecture:
The Q matrix is a model parameter so we would not be able to ‘ask’ questions that are specific to a certain document.
’q’, the number of possible answers, is set to be much smaller than the number of tokens in a document.
The average pooling layer prevents us from reading answers from the PNDB that are specific to the i-th sentence (for any i). Thus, to reduce the overall loss of the model, we are forced to write only data that has relevance to the entire document to the PNDB.
Alternatively, it is also possible to change ‘A’ to be the average pooling of all Ai except for i=j when calculating loss for the j-th sentence, thus avoiding this issue completely.
5.1 The Ignore and Update Gates
While calculating the Masked LM loss for the following example sentence: “Alice is happy because Bob loves her very much”, we expect that masking a pronoun (“her”), adverb (“very”) or adjective (“much”) would make our task relatively easy, while masking a proper noun (“Alice” or “Bob”) would make our task harder. However, knowing that Alice appears in somewhere else in the document can be of great help as it means she is a character in our story. So, if we are able to save and retrieve the names of proper nouns in our document, the questions of “Who is happy because…” becomes a multiple choice question which is presumably easier. Therefore, at the token level, the Ignore gate was designed to facilitate the saving of proper nouns to the PNDB.
It has been shown that the attention heads of Transformer-based Encoders learn to detect part-of-speech (PoS) Strubell et al. (2018); Vaswani et al. (2017). It follows that Transformer-encoded tokens carry data regarding their PoS and that this data can be accessed without the use of additional non linear layers. Therefore, we use a CNN Unigram with ‘F’ filters to capture this data. These filters are later divided in to groups of 8 and a softmax activation is applied over each group for each token. So for each token, each such group ‘chooses’ one of 8 possible categories. Then, for each token, the results of the ‘F’ post activation filters are fed in to a single logistic unit. The output of the Ignore Gate is equal to its input where each token is multiplied by its corresponding logistic activation.
The Update Gate is designed to detect tokens in E2i that require additional input from the PNDB in order to have a comprehendable meaning, and retrieve that input from Ai when it is needed. To detect these tokens, we think about the concept of “require additional input” as a PoS, and use the architecture of the Ignore Gate and its sigmoid unit to detect it. Then, the output of the Update Gate for th j-th token of E2i is a weighted average of the j-th token of A2i and the j-th token of E2i where the weight is determined by the activation of the sigmoid unit.
5.2 Data Base Generation
When using the PNDB component during the calculation of the Reconstruction loss, it becomes necessary to be able to generate values for the ‘A’ matrix in order to generate a document. To generate the i-th answer vector in the PNDB, we define GPNDB, that performs the following steps:
The i-th question vector in ‘Q’ is concatenated with the generated document vector and pass them through a dense layer of size .
The result is concatenated with a random vector and the i-th answer vector is obtained using L dense layer, as in 4.
In this paper we have presented a new way to encode long texts while avoiding two of the issues that hinder the performance of classifiers that are built over SOTA text representation vectors such as Devlin et al. (2018); Cer et al. (2018). We reduced the number of operations needed to learn relationships between different words from to due to the hierarchical nature of the DVT, and we have generated text vectors that can be set to be sufficiently long to represent large documents. We hypothesize that training AGENT will exhibit the following steps:
AGENT will learn high quality token representation using the MLM task, similarly to Devlin et al. (2018).
Once the token representation is relatively stable, AGENT will attempt to represent sentences in such a way that:
Tokens can be recovered (to some extent) from their sentence vector due to the downward (Reconstruction) loss.
Coherent sentences reside in a different parts of from incoherent sentences due to the upward (Coherence) loss.
Sentences with similar meaning (and different wording) should have similar vectors due to the in-level (MLM) loss. This is because surrounding sentences can give a lot of information about the meaning of a masked sentence but very little information about its exact wording and word order. Therefore, many sentences with the same meaning can be guessed to be the missing sentence, and in order to reduce the MLM loss their vectors should be close to each other but far from other, inappropriate sentences.
Once the sentence representation is relatively stable, AGENT will attempt repeat step 2 for higher levels of text representation.
The training process of the Generators and Discriminators of AGENT is characterized by the fact that back-propagating their loss does not effect the weights of the HE and HD (though a change in either of them will effect the Discriminators). So one possibility is to save training time by only training the Generator and Discriminator pairs after the rest of the network. Another possibility is to only train the Generators, while the weights of the Discriminators are set to be equal to their corresponding CC and observe the generated texts during training.
- Bowman et al.  Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. arXiv e-prints, art. arXiv:1511.06349, Nov 2015.
- Cer et al.  Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Universal sentence encoder. CoRR, abs/1803.11175, 2018. URL http://arxiv.org/abs/1803.11175.
- Dai et al.  Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv e-prints, art. arXiv:1901.02860, Jan 2019.
- Devlin et al.  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
- Goodfellow et al.  Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. arXiv e-prints, art. arXiv:1406.2661, Jun 2014.
- Graves and Schmidhuber  Alex Graves and JÃŒrgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks : the official journal of the International Neural Network Society, 18:602–10, 07 2005. doi: 10.1016/j.neunet.2005.06.042.
- Hendrycks and Gimpel  Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs). arXiv e-prints, art. arXiv:1606.08415, Jun 2016.
Convolutional neural networks for sentence classification.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751. Association for Computational Linguistics, 2014. doi: 10.3115/v1/D14-1181. URL http://aclweb.org/anthology/D14-1181.
- Radford et al.  Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2018. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
- So et al.  David R. So, Chen Liang, and Quoc V. Le. The evolved transformer, 2019.
- Strubell et al.  Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. Linguistically-Informed Self-Attention for Semantic Role Labeling. arXiv e-prints, art. arXiv:1804.08199, Apr 2018.
- Sutskever et al.  Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 3104–3112, Cambridge, MA, USA, 2014. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969033.2969173.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
- Zhang and Carin  Yizhe Zhang and Lawrence Carin. Generating text via adversarial training. 2016.