As more and more academic papers are being submitted to conferences and journals, evaluating all these papers by professionals is time-consuming and can cause inequality due to the personal factors of the reviewers. In this paper, in order to assist professionals in evaluating academic papers, we propose a novel task: automatic academic paper rating (AAPR), which automatically determine whether to accept academic papers. We build a new dataset for this task and propose a novel modularized hierarchical convolutional neural network to achieve automatic academic paper rating. Evaluation results show that the proposed model outperforms the baselines by a large margin. The dataset and code are available at <https://github.com/lancopku/AAPR>READ FULL TEXT VIEW PDF
Traceability between published scientific breakthroughs and their
While the fast-paced inception of novel tasks and new datasets helps fos...
Tracking progress in machine learning has become increasingly difficult ...
GitHub has become a popular social application platform, where a large n...
Many AI researchers are publishing code, data and other resources that
Despite the current diversity and inclusion initiatives in the academic
The number of academic papers being published is increasing exponentiall...
Every year there are thousands of academic papers submitted to conferences and journals. Rating all these papers can be exhausting, and sometimes rating scores can be affected by the personal factors of the reviewers, leading to inequality problem. Therefore, there is a great need for rating academic papers automatically. In this paper, we explore how to automatically rate the academic papers based on their LaTeX source file and meta information, which we call the task of automatic academic paper rating (AAPR).
is one of the earliest attempts to solve the AES task by predicting the score using linear regression over expert crafted textual features. Much of the following work applied similar methods by using various classifiers with more sophisticated features including grammar, vocabulary and style(Rudner and Liang, 2002; Attali and Burstein, 2004). These traditional methods can work almost as well as human raters. However, they all demand a large amount of feature engineering, which requires a lot of expertise.
Recent studies turn to use deep neural networks, claiming that deep learning models can relieve the system from heavy feature engineering. Alikaniotis et al. (2016)
proposed to use long short term memory network(Hochreiter and Schmidhuber, 1997) with a linear regression output layer to predict the score. They added a score prediction loss to the original C&W embedding (Collobert and Weston, 2008; Collobert et al., 2011), so that the word embeddings are related to the quality of the essay. Taghipour and Ng (2016)
also applied recurrent neural networks to process the essay, except that they put a convolutional layer ahead of the recurrent layer to extract local features.Dong and Zhang (2016) proposed to apply a two-layer convolutional neural network (CNN) to model the essay. The first layer is responsible for encoding the sentence and the second layer is to encode the whole essay. Dong et al. (2017) further proposed to add attention mechanism to the pooling layer to automatically decide which part is more important in determining the quality of the essay.
Although there has been a lot of work dealing with AES task, researchers have not attempted the AAPR task. Different from the essay in language capability tests, academic papers are much longer with much more information, and the overall quality is affected by a variety of factors besides the writing. Therefore, we propose a model that considers the overall information of one academic paper, including the title, authors, abstract and the main content of the LaTeX source file of the paper.
Our main contributions are listed as follows:
We propose the task of automatically rating academic papers and build a new dataset for this task.
We propose a modularized hierarchical convolutional neural network model that considers the overall information of the source paper. Experimental results show that the proposed method outperforms the baselines by a large margin.
A source paper usually consists of several modules, such as abstract111Italicized words represent modules of the source paper., title and so on. There is also a hierarchical structure from word-level to sentence-level in each module. The structure information is likely to be helpful to make more accurate predictions. Besides, the model can be improved by considering the difference in contributions of various parts of the source paper. Based on this observation, we propose a modularized hierarchical CNN. An overview of our model is shown in Figure 1. We assume that a source paper has modules, with words and the filter size is (detailed explanations can be referred to Section 2.1 and Section 2.2). and are set to be 3, 3, 2, respectively in the Figure 1 for simplicity.
Given a complete source paper , represented by a sequence of tokens, we first divide it into several modules based on the general structure of the source paper (abstract, title, authors, introduction, related work, methods and conclusion). For each module, the one-hot representation of the -th word
is embedded to a dense vectorthrough an embedding matrix. For the following modules (abstract, introduction, related work, methods, conclusion), we use the attention-based CNN (illustrated in Section 2.2) in word-level to get the representation of the -th sentence. Another attention-based CNN layer is applied to encode the sentence-level representations into the representation of the -th module.
There is only one sentence in the title of the source paper, so it is reasonable to get the module-level representation of title only using attention-based CNN in word-level. Besides, the weighted average method is applied to obtain the module-level representation of authors by Equation (1) because the authors are independent of each other.
where is the weight parameter. is the embedding vector of the -th author in the source paper, which is randomly initialized and can be learned at the training stage. is the maximum length of the author sequence.
Representations of all modules are aggregated to form the paper-level representation of the source paper with an attentive pooling layer. A layer is used to take
as input and predict the probability of being accepted. At the training stage, the cross entropy loss function is optimized as objective function, which is widely used in various classification tasks.
Attention-based CNN consists of a convolution layer and an attentive pooling layer. The convolution layer is used to capture local features and attentive pooling layer can automatically decide the relative weights of words, sentences, and modules.
Convolution layer: A sequence of vectors of length is represented as the row concatenation of -dimensional vectors: . A filter convolves with the window vectors at each position to generate a feature map . Each element of the feature map is calculated as follows:
where is element-wise multiplication, is a bias term, and
is a non-linear activation function. Here we choose
to be ReLUNair and Hinton (2010). different filters can be used to extract multiple feature maps . We get new feature representations as the column concatenation of feature maps . The -th row of is the new feature representation generated at position .
Attentive pooling layer: Given a sequence , which are -dimensional vectors, the attentive pooling is applied to aggregate the representations of the sequence by measuring the contribution of each vector to form the high-level representation of the whole sequence. Formally, we have
are weight matrix and bias vector, respectively.is a randomly initialized vector, which can be learned at the training stage.
In this section, we evaluate our model on the dataset we build for this task. We first introduce the dataset, evaluation metric, and experimental details. Then, we compare our model with baselines. Finally, we provide the analysis and the discussion of experimental results.
Arxiv Academic Paper Dataset:
As there is no existing dataset that can be used directly, we create a dataset by collecting data on academic papers in the field of artificial intelligence from the website222https://arxiv.org/. The dataset consists of 19,218 academic papers. The information of each source paper consists of the venue which marks whether the paper is accepted, and the source LaTeX file. We divide the dataset into training, validation, and test parts. The details are shown in Table 1.
Since the author names are different from the common scientific words in the paper, we separately build up vocabulary for authors and text words of source papers with the size of and , respectively.
We use the training strategies mentioned in Zhang and Wallace (2015) for CNN classifier to tune the hyper-parameters based on the accuracy on the validation set. The word or author embedding is randomly initialized and can be learned during training. The size of word embedding or author embedding is 128 and the batch size is 32. Adam optimizer Kingma and Ba (2014) is used to minimize cross entropy loss function. We apply dropout regularization Srivastava et al. (2014) to avoid overfitting and clip the gradients Pascanu et al. (2013) to the maximum norm of 5.0.
During training, we train the model for a fixed number of epochs and monitor its performance on the validation set after every 50 updates. Once training is finished, we select the model with the highest accuracy on the validation set as our final model and evaluate its performance on the testing set.
We compare our model with the following baselines:
Randomly predict (RP): We randomly decide whether the source paper can be accepted. In other words, the probability of acceptance of every source paper is always 0.5 using this strategy.
Traditional machine learning algorithms:
We use various machine learning classifiers to predict the labels based on the tf-idf features of the text.
In this subsection, we present the results of evaluation by comparing our proposed method with the baselines. Table 2 reports experimental results of various models. As is shown in Table 2, the proposed MHCNN outperforms all the above mentioned baselines. The best baseline model SVM achieves the accuracy of 61.6%, while the proposed model achieves the accuracy of 67.7%. In addition, our MHCNN outperforms other representative deep-learning models by a large margin. For instance, the proposed MHCNN achieves an improvement of 6.4% accuracy over the traditional CNN. This shows that our MHCNN can learn better representation by considering modularized hierarchical structure in the source paper. Our proposed MHCNN aims to divide a long text into several modules and using attention mechanism to aggregate the representations of each module to form a final high-level representation of a complete source paper. By incorporating knowledge of the structure of the source paper and automatically selecting the most informative words, the model is capable of making more accurate predictions.
Here we perform further analysis on the model and experiment results.
As is shown in Table 2, our MHCNN model outperforms all baselines by a large margin. Compared with the basic CNN model, the proposed model has a modularized hierarchical structure and uses multiple attention mechanisms. In order to explore the impact of internal structure of the model, we remove the modularized hierarchical structure and attention mechanisms in turn. The performance is shown in Table 3. “w/o Attention” means that we still use modularized hierarchical structure while do not use any attention mechanism. “w/o Module” means that we do not use both attention mechanism and modularized hierarchical structure, which is the same as the CNN model in the baselines.
As is shown in Table 3, the accuracy of the model drops by 0.9% when the attention mechanism is removed from the model. This shows that there are differences in the contribution of textual content. For instance, the abstract of a source paper is more important than its title. Attention mechanism can automatically decide the relative weights of modules, which makes model predictions more accurate. However, the accuracy of the model drops by 6.4% when we remove the modularized hierarchical structure, which is much larger than 0.9%. It shows that the modularized hierarchical structure of the model is of great help to obtain better representations by incorporating knowledge of the structure of the source paper.
One interesting issue is which part of the source paper best determines whether it can be accepted. To explore this issue, we subtract each module from complete source papers in turn and observe the change in the performance of the model. The experimental result is shown in Table 4.
As is shown in Table 4, the performance of the model shows different degrees of decline when we remove different modules of the source paper. This shows that there are differences in the contribution of different modules of the source paper to its acceptance, which further illustrates the reasonableness of our use of modularized hierarchical structure and attention mechanism. All the declines are significant with under the -test.
When we remove authors module, the accuracy drops by 3.1%, which is the largest decline. This shows that the authors of the source paper largely determines whether it can be accepted. Obviously, a source paper written by a proficient scholar tends to be good work, which has a higher probability of being accepted. Except for authors, the two most significant modules affecting the probability of being accepted are conclusions and abstract. Because they are the essence of the entire source paper, which can directly reflect the quality of the source paper. However, the methods module of the source paper has little effect on the probability of being accepted according to Table 4. The reason may be that the methods
of different source papers vary widely, which means that there exists high variance in this module. Therefore, our model may not do well in capturing a unified internal pattern to make prediction. The impact of thetitle is the smallest and the accuracy of the model drops by only 1.1% when title is removed from the source paper.
|w/o Related work||66.0%*||1.7%|
The most relevant task for our work is automatic essay scoring (AES). There are two main types of methods for the AES task: traditional machine learning algorithms and neural network models.
Most traditional methods for the AES task use supervised learning algorithms, including classificationLarkey (1998); Rudner and Liang (2002); Yannakoudakis et al. (2011); Chen and He (2013), regression Attali and Burstein (2004); Phandi et al. (2015); Zesch et al. (2015) and so on. However, they all require lots of manual features, for instance, bag of words, spelling errors, or lengths, which can be time-consuming and requires a large amount of expertise.
In recent years, some neural network models have also been used for the AES task, which have achieved great success. Alikaniotis et al. (2016) proposed to use the LSTM model with a linear regression output layer to predict the score. Taghipour and Ng (2016) applied the CNN model followed by a recurrent layer to extract local features and model sequence dependencies. A two-layer CNN model was proposed by Dong and Zhang (2016) to cover more high-level and abstract information. Dong et al. (2017) further proposed to add attention mechanism to the pooling layer to automatically decide which part is more important in determining the quality of the essay. Song et al. (2017)
proposed a multi-label neural sequence labeling approach for discourse mode identification and showed that features extracted by this method can further improve the AES task.
In this paper, we propose the task of automatic academic paper rating (AAPR), which aims to automatically determine whether to accept academic papers. We propose a novel modularized hierarchical CNN for this task to make use of the structure of a source paper. Experimental results show that the proposed model outperforms various baselines by a large margin. In addition, we find that the conclusion and abstract parts have the most influence on whether the source paper can be accepted when setting aside the factor of authors.
This work is supported in part by National Natural Science Foundation of China (No. 61673028), National High Technology Research and Development Program of China (863 Program, No. 2015AA015404), and the National Thousand Young Talents Program. Xu Sun is the corresponding author of this paper.
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1741–1752.
Automated essay scoring using bayes’ theorem.The Journal of Technology, Learning and Assessment, 1(2).