Neural Academic Paper Generation

by   Samet Demir, et al.
Boğaziçi University

In this work, we tackle the problem of structured text generation, specifically academic paper generation in , inspired by the surprisingly good results of basic character-level language models. Our motivation is using more recent and advanced methods of language modeling on a more complex dataset of source files to generate realistic academic papers. Our first contribution is preparing a dataset with source files on recent open-source computer vision papers. Our second contribution is experimenting with recent methods of language modeling and text generation such as Transformer and Transformer-XL to generate consistent code. We report cross-entropy and bits-per-character (BPC) results of the trained models, and we also discuss interesting points on some examples of the generated code.



There are no comments yet.


page 1

page 2

page 3

page 4


Syntax-driven Iterative Expansion Language Models for Controllable Text Generation

The dominant language modeling paradigms handle text as a sequence of di...

A Survey of Pretrained Language Models Based Text Generation

Text Generation aims to produce plausible and readable text in human lan...

Paranoid Transformer: Reading Narrative of Madness as Computational Approach to Creativity

This papers revisits the receptive theory in context of computational cr...

Pragmatically Informative Text Generation

We improve the informativeness of models for conditional text generation...

The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction

Recent work on Grammatical Error Correction (GEC) has highlighted the im...

DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling

Rap generation, which aims to produce lyrics and corresponding singing b...

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Recent work has demonstrated that increased training dataset diversity i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Even with the recent advances on natural language generation (NLG) methods that employ deep learning architectures, it is still a challenge to generate semantically consistent and structured essays or texts in general. Therefore, academic paper generation is still a compelling problem on vast NLG field.

Almost all academic papers in the field of computer science, including this one, are written in LaTeX typesetting system [1986lamport]. LaTeX has many syntactic rules to create objects such as headers, figures, and tables. Therefore, in an academic paper written in LaTeX format, there are both semantic and syntactic long term dependencies inside the text. The semantic dependency, which refers for consistency on the flow of the text, is aimed to maintain while writing the papers. For instance, the introduction and conclusion are consistent with the subject and the motivation of the work, as well as the rest of the paper. In addition to the semantic integrity of the essay, the syntactic dependency on keywords of LaTeX is needed to be sustained to compile the file successfully. There can be a complex table or a section with many subfigures, and sometimes it is not easy to keep up with the brackets and special keywords needed to create a table successfully. Both of these long term dependencies are a challenge for an automated academic paper generation system.

Fortunately, there are many recent studies on natural language generation focusing on generating realistic texts by taking long term dependencies into account [Attention_is_All_you_Need, devlin2018bert, transformer_xl, radford2019language, ijcai2018-567]

. There are also widely used machine learning methods such as Recurrent Neural Networks (RNN), and Long-Short Term Memory (LSTM)

[Hochreiter:1997:LSM:1246443.1246450] that are designed to handle short and long term dependencies in accordance with any data involving such dependencies. Our primary motivation is to use some of the recent methods with a new dataset consisting of academic papers written in LaTeX to see if the successful NLG models can also accomplish to generate realistic academic papers.

Rest of the paper is organized as follows: The literature on academic paper generation and character-level text generation in general is given in Section 2; details of the dataset and the models are in Section 3; Section 4 contains the experimental setup and the results, and Section 5 contains the conclusion and future work.

2 Previous Work

To the best of our knowledge, there are only a few works on automated academic paper generation. One of such works is SCIgen [SCIgenAnAutomaticCSPaperGenerator] that generates random sentences, graphs, and citations from a handwritten context-free grammar. [SCIgenAnAutomaticCSPaperGenerator] stated that some generated papers by the tool had been accepted to a few conferences. Our work aims to use modern machine learning and language modeling techniques, rather than using a handwritten grammar, to generate more realistic academic papers.

Another work by [char_rnn] showed that even simple RNN models capture interesting features and have the ability to generate any text by learning character-level language models. [char_rnn] used the LaTeX source files of a linear algebra book to generate mathematical formulas and proofs. It is claimed that it was possible to compile the generated LaTeX code with little post-processing. In this work, we take a few steps further by training some recent models with a relatively large amount of LaTeX data in order to be used in generating fake academic papers.

On general language modeling and text generation literature, many recent studies improve state-of-the-art. In one of these studies, [Attention_is_All_you_Need] introduced Transformer which consists of a mechanism called attention

and the model aims to focus on specific parts of the sequential data for language modeling and language understanding. Since it is hard to train sequential models in a parallel manner, attention mechanism completely eliminates the recurrence of the model and paves the way to much faster training with parallel GPUs while improving the success of earlier models. In Transformer architecture, the sequence information is provided with positional encodings, and the model is expected to learn the joint probability of given sequences by using an encoder-decoder network.

Another model, namely Transformer-XL [transformer_xl]

, is proposed to eliminate the drawback due to the usage of the fixed-length context vector in Transformer. In other words, Transformer architecture uses a fixed size context vector, yet Transformer-XL extends this idea to employ a variable-length context vector to learn the dependencies in the data by also preserving the temporal structure. It is shown that Transformer-XL learns longer dependencies than both RNNs and Transformers.

There is also the GPT-2 model [radford2019language] that recently set the state-of-the-art on language modeling by training a Transformer with B parameters and a huge dataset called WebText. The outstanding results showed that it is possible to drastically improve the quality of language models by using more data and more parameters.

[ijcai2018-567] also studied on language modeling, yet a more challenging task, such as writing an essay paragraph consisting of selected topics. They used an augmented LSTM structure for generating the text sequences that each of them are related to the topics as well as before sequences. Although they showed how difficult to sustain the integrity of the writing, they managed to outperform other essay generation studies on both quantitative and qualitative results.

3 Methodology

In this work, we aim to generate academic papers in LaTeX since it is a widely used typesetting system in the academia and it is self-contained such that it can be used to write academic papers by generating plain text. The process of preparing the new dataset composed of papers written in LaTeX is described in Section 3.1. Because determining word boundaries for LaTeX format is difficult, we approached the problem as a character-level language modeling problem. Besides, the diversity of vocabulary used in academic papers is extensive. Therefore the computational complexity would have been higher if word-level language modelling was used rather than character-based.

In experiments, the performance of Transformer and Transformer-XL are compared adversarial to the baseline RNN architecture which exploits LSTM structure. As commonly used in language modelling [graves2013generating, transformer_xl], the models that are explained in Section 3.2 are used to predict and overall learn the joint probability where is the given sequence of characters.

Feature Value
unique tokens 102
total tokens 37,921,928
papers 799
Table 1: Some quantitative information about the dataset.

3.1 Dataset111We decided not to share the dataset because of ethical concerns. However, our code, which is shared, can be used to recreate similar datasets.

The dataset is prepared from academic papers on arXiv [arxiv] since there did not exist a similar dataset which consists of academic papers written in LaTeX to the best of our knowledge. Table 1 shows some quantitative information about the dataset.

3.1.1 Preparation

Following steps are completed in order to create the dataset consisting of LaTeX files;

  1. Academic papers on arXiv which are tagged as Computer Vision and submitted between 2015 - 2018 are selected as a subset of academic papers to base the dataset on.

  2. The source files of the selected academic papers are downloaded.

  3. Each paper which initially consists of multiple files is compiled into one LaTeX file.

3.1.2 Preprocessing

The raw dataset is preprocessed on character-level in order to remove the noise coming from LaTeX comments. The infrequent characters that appeared less than 100 times in the whole dataset are deleted. The cleaned dataset which consists of multiple LaTeX files is concatenated into one file to feed the models. The resulting sequence is segmented into same length sequences, and these sequences form the batches of the same size.

3.2 Models

In this study, a character based LSTM structure is selected as the baseline model since [char_rnn] showed its performance in LaTeX generation. We also experimented with the Transformer [Attention_is_All_you_Need] because of its recent achievements on sequence modeling. Finally, We experimented with Transformer-XL [transformer_xl] since it is claimed that the model is able to capture long term dependencies better than the others. Char-LSTM, Transformer, and Transformer-XL models are described in Subsections 3.2.1, 3.2.2, and 3.2.3 respectively.

Figure 1: Char-LSTM model that is used in the experiments, where is the embedding vector for the -th character, N is the number of LSTM layers and M is the number of recurrent units.

3.2.1 RNN: Char-LSTM

RNN architectures provide intelligent networks to learn sequential information of inputs, and RNNs became a popular approach in sequence modelling [Sutskever2014, char_rnn, Sutskever:2011:GTR:3104482.3104610, graves2013generating]. However, [Hochreiter:91] and [Bengio:1994:LLD:2325857.2328340]

described the vanishing gradients problem that makes the RNN training difficult. Thus, new RNN architectures such as LSTM

[Hochreiter:1997:LSM:1246443.1246450] and GRU [cho-etal-2014-properties] have been developed to avoid vanishing gradients issue and make RNNs practically more useful. [char_rnn] trained a multi-layer LSTM structure on LaTeX source file of an Algebraic Geometry book and evaluated the LaTeX generation performance of the model, so it inspired us to perform this study, as well as providing a baseline model. In this baseline model, which we will call Char-LSTM for the rest of this study, a character sequence input is processed by an embedding layer initially. The layer maps the characters to a fixed-size embedding vector. Embedding vectors are then fed to sequentially connected LSTM layers. LSTM layer in step while processing the given sequence could be mathematically described as where is the hidden state and is the input at the current step. The output of the LSTM layers then given to a dense layer. Lastly, softmax function is applied to the output vector of the dense layer. A flow chart of the model is given in Figure 1.

3.2.2 Transformer

Figure 2: Transformer model, where is the number of self attention heads in a multi-head attention layer, is the number of hidden encoder and decoder layers, is the concatenation layer, + is the vector addition layer, arrows() refer to the flow of data, , and are abbreviations for , and respectively, , and are the corresponding inputs and is the index of a self attention head in a multi-head attention layer.

Recently, attention mechanisms are integrated into RNNs in order to allow modeling distance-free dependencies [bahdanau2014neural, Kim2017StructuredAN]. Transformer [Attention_is_All_you_Need] has been introduced as a sequence transduction model which is based on entirely attention mechanisms without recurrence such that the Transformer could make more use of parallel processing than the recurrent networks. In recent studies [devlin2018bert, radford2019language], Transformer and its components indicated significant success and became a popular approach on sequence modeling. Therefore, it is a prominent candidate model for academic paper generation problem.

The Transformer consists of embedding, positional encoding, multi-head attention, and other standard building blocks of deep learning architectures, such as normalization [Ioffe:2015:BNA:3045118.3045167] and dense layers. Positional encoding is used in order not to lose the positional information of the given sequence. Multi-head attention is the layer which contains multiple self attention components connected in parallel. Self attention is described mathematically as

where is the input to the self-attention, , and are parameter matrices used to project input to queries, keys, and values respectively and is the number of dimensions of the keys. The input feeds through an embedding layer, a concatenation layer where positional encoding is concatenated, and encoder layers. Furthermore, shifted input goes through a similar embedding layer, and decoder layers.

An encoder layer consists of a multi-head attention layer, an addition layer where input and output of the multi-head attention are summed just like a residual connection in ResNet

[DBLP:journals/corr/HeZRS15], a normalization, a dense layer, an addition layer where input and output of the dense layer are summed, and finally a normalization layer in the given order.

A decoder layer consist of a multi-head attention layer, an addition layer where input and output of multi-head attention is summed, a normalization, a multi-head attention layer again where at this time input for query () and key () come from the output of encoder layers and value (

) comes from the output of the normalization, dense, and addition layers where input and output of the dense layer is summed, and finally a normalization layer in the given order. At the end, the output of the decoder layers pass through a dense layer and a softmax layer. Figure

2 shows the complete model and [Attention_is_All_you_Need] describes the details of the model.

3.2.3 Transformer-XL

The Transformer model makes use of the constant number of context tokens since the model takes fixed-size sequences. The context length is generally selected as few hundreds in practice because of the computational limitations. In addition to these limitations, Transformer architecture also has no ability to carry information between context segments of the sequences. The problem that generally occurs in practice is to segment a given sequence to vectors consisting of fixed-sized context tokens without respecting semantic boundaries, which is also a problem about long term dependencies. Similarly, LaTeX includes long term dependencies such as the dependency between \begin and \end statements since there could be a long text in-between.

Besides, the study done by [khandelwal-etal-2018-sharp] showed that LSTM language models have the capacity of using 200 context tokens on average. Intuitively, a model without ability to learn interconnections on segments of the sequences would not be sufficient to generate LaTeX files that desires longer dependencies successfully. Therefore, a model that handles longer dependencies between sequences become more appropriate solution against Char-LSTM and Transformer on LaTeX generation task.

Dependencies in Transformer
Dependencies in Transformer-XL
Table 2: Example two-layer illustration of the Transformer-XL in comparison to Transformer, where is the input at time , is the hidden state on layer at time , arrows() represent dependencies, is abbreviation for , dashed line in Transformer figure represent no information flow in-between, and dashed arrows in Transformer-XL figure show the newly added dependencies.

Recently, [transformer_xl] addressed the mentioned dependency problems and introduced Transformer-XL, an extended version of Transformer which stores and makes use of previous hidden states so that it increases the capacity of the model to capture long term dependencies. The main extension made by [transformer_xl] to Transformer is to change the self attention layers as follows:

where is the input for the current self attention layer and the hidden state of the previous layer for the -th input segment of fixed-size , is the extended version of and is the concatenation operator. Transformer-XL also includes a new more suitable positional encoding since the positional encoding which is introduced in [Attention_is_All_you_Need] fails to differentiate and , where is the -th token in -th input segment.

4 Experiments333The code for the experiments and the dataset can be found at

Hyperparameter Char-LSTM Transformer Transformer-XL
sequence length 100 128 variable*
batch size 64 4096 22
hidden layers (N) 1 2 12
embedding size 256 256 512
hidden size 1024 256 512
number of heads (h) - 4 8
dropout rate - 0.1 0.1
optimizer adam adam adam
learning rate schedule - custom** cosine
learning rate 0.001 0.2 0.00025
beta1 (adam) 0.9 0.9 0.9
beta2 (adam) 0.999 0.997 0.999
epsilon (adam)
Table 3: Hyperparameters used for the models. *The randomized sequence length option provided by [transformer_xl] is selected. **The default scheduler parameters is used as follows .
(a) Char-LSTM
(b) Transformer
(c) Transformer-XL
Figure 3:

Training and validation loss per epoch/step for each models. Each epoch in Char-LSTM refers to

K steps.

Implementations provided by the authors of the selected models are used in the experiments. TensorFlow


, Keras


, PyTorch

[paszke2017automatic] and Tensor2Tensor [tensor2tensor] are used since the models are originally implemented using one or more of these.

4.1 Experimental Setup

We used three different computational resources to train different models. Char-LSTM and Transformer models are trained on a Tesla K80 GPU, while the Transformer-XL model is trained using four Tesla V100 GPUs in parallel.

Hyperparameters used for Char-LSTM, Transformer, and Transformer-XL models can be seen in Table 3. For Char-LSTM model, we experimented with different values for context vector length, hidden layer size, and hidden unit size. For the sake of simplicity, we decided to train the model with only one hidden layer as our baseline. For Transformer model, we used the default hyperparameter settings given in Tensor2Tensor [tensor2tensor]. Hyperparameters for the Transformer-XL model is also chosen as the base hyperparameter setting version of the model in the original work by [transformer_xl]. This model is given in the official code written by the authors of the paper. The base hyperparameter setting of Transformer-XL is more shallow and simpler than the actual hyperparameter setting and the reason why we choose this setting is the lack of computational resources since the actual hyperparameter setting makes the model more complex and it is originally trained on a TPU cluster.

For text generation on trained models, we experimented with different values of the softmax temperature

parameter. The temperature parameter is used for controlling the randomness of the outputs by scaling the values of the softmax outputs. In other words, computed logits of the softmax layer are divided by the temperature. As the temperature value approaches

, final output converges to the actual values of the logits and this makes the model more “random” during sampling since the probabilities are more evenly distributed. When the temperature goes to , this favors the logits with higher values and the model is more “confident” but also “conservative”, always choosing the very likely outcomes during sampling from the output probabilities.

4.2 Results

Model Training CE Validation CE Training BPC Validation BPC
Char-LSTM 0.99 1.15 1.43 1.66
Transformer 1.05 1.16 1.51 1.67
Transformer-XL 0.48 0.71 0.69 1.02
Table 4: Best training and validation cross entropy (CE) errors and bits-per-character (BPC) calculations on trained models.
Listing 1: Passage generated by Char-LSTM
We further analyze the effect of obtaining the proposed approach of our method and fine-tuning the spatial relationship between the surface instead of the annotation error (as provided by the original image) which are shown in Fig. \ref{fig:runtime}. The second step is trained on the training set in a standard environment with a single sample image of the same size with a single color (blue)

are present in the image. We then evaluated the performance of the pre-trained CNN features to obtain better performance than the state-of-the-art face recognition model in the appendix.

Listing 2: Passage generated by Transformer-XL
We verify the effectiveness of our proposed method on both the 50k \emph{and} the 2012 dataset available, all reported a reference performance of $66.1%$ on the validation set. We also drop our initial performance (denoted as RPN $\stackanchor{+}{}$) and add RRCN $\stackanchor{+}{}$

to get an average performance according to the three evaluation metrics. Results are shown in Table 

\ref{tab:accuracy}. The three evaluation metrics are measured by computing the average, and showing the difference of RRCN with respect to the obtained initial performance (denoted as \textit{initial performance}).
Table 5: Comparison of generated example passages, on which interesting points are highlighted.
\hyphil_im_impool/serre/budget & DBN \\
MNIST & Last y3\\
Listing 3: LaTeX figure generated by Char-LSTM
\caption{\small{Architecture of the VGGNet  (Siamese net), as illustrated in Figure~\ref{fig_arch}. The architectures have indeed the same architecture as VGGNet.} }
Listing 4: LaTeX figure generated by Transformer-XL
Table 6: Comparison of generated example LaTeX figures, on which basic LaTeX keywords are highlighted.
Quantitative Results

In this study, three models are employed on generating scientific text sequences that can be compiled as an academic paper. The baseline model, Char-LSTM, gave promising results in terms of both quantitative and qualitative results. However, the Transformer model could not improve the baseline model results because of ineffective use of fixed sequence length. Char-LSTM also uses fixed sequence length, but it carries the residual information between sequences, whereas the Transformer does not. This disability can be tolerated in plain text generation tasks, yet the effects are severe in such tasks that the long dependencies are required like LaTeX generation.

Although the baseline model surpassed the performance of the Transformer due to the limitations mentioned above, Transformer-XL model improved the quantitative results by allowing sequence segments to carry information one to another. Transformer-XL outperformed the rest of the models by on both cross-entropy error (CE) and bits-per-character (BPC). The detailed comparison of quantitative results between models is given in Table 4. Besides, the validation losses of both Char-LSTM and Transformer models converged to approximately , while the validation loss of Transformer-XL managed to converge to at the end of the training. The detailed figures of the losses on the training phase can be seen in Fig. 3.

Qualitative Results on Text Generation

The qualitative success of the studied models’ outcome is correlative to the quantitative results. The baseline model (Char-LSTM) has the ability to write syntactically correct sentences consistently, even though the training of the model is performed on character-level. For instance, each sentence generated by Char-LSTM starts with a capital letter and ends with a punctuation mark consistently. Furthermore, it has the ability to use explanatory phrases such as (blue) as shown in Table 5. Transformer-XL also generates syntactically correct sentences, yet the sentences are better formed and semantically longer dependent than the baseline model. An example passage written by Transformer-XL is also given in Table 5.

As a drawback most of the language models suffers from, we observed that the models might suffer from the repetition of words in the conducted experiments. Despite creating syntactically correct sequences, the repetitive outputs could occur where the semantic integrities are shallow such as the beginning sentence of the sections in the generated paper. To cope with this pitfall, the softmax temperature (as mentioned in Section 4.1) is experimented with different temperature values when generating text and through qualitative analysis. We found that using the values between and yielded better text generation since the values in the given interval are a good balance of confidence and diversity.

Qualitative Results on Syntactic Features of LaTeX

Char-LSTM and Transformer-XL models have the ability to use simple syntactic features of LaTeX, such as referencing and mathematical expressions. However, the baseline model could not learn more complex features. Table 6 shows that Char-LSTM starts the table with \begin statement, yet is unable to end the table with \end command. This problem is caused because of the bottleneck on sequence length hyperparameter. One sentence barely fits on the character sequences when the hyperparameter is set to 100 (shown in Table 3). Unlike the baseline model, Transformer-XL uses variable sequence length, thus it accomplishes to learn longer dependencies. The qualitative results also show that Transformer-XL is more successful in using complex features of LaTeX documents.

5 Conclusion and Future Work

In this study, we compiled a new dataset with the LaTeX source files of academic papers from, trained some of the recent language modeling, or in general sequence modeling methods with the new dataset, and evaluated their performance in academic paper generation. By qualitative and quantitative results, we have observed that Transformer-XL model outperformed the Char-LSTM and Transformer models.

Structured text generation, which is a super-set of academic paper generation, is an important problem for evaluating sequence models since it has more obvious dependencies than the general language modeling problems and mostly these dependencies, such as the relation between \begin and \end, can be tested in the generated outputs with basic context-free grammars. Developing evaluation metrics based on this idea may be an interesting future work. Moreover, we deduced that the outcomes tend to deviate from the general subject of the paper, although the sentences generated by the models have short term dependencies. Therefore, our motivation for the future work is implementing augmented architectures to sustain the coherence in the subject, similar to [ijcai2018-567], in order to generate more realistic academic papers.