Argumentative Zoning using word2vec
In comparison with document summarization on the articles from social media and newswire, argumentative zoning (AZ) is an important task in scientific paper analysis. Traditional methodology to carry on this task relies on feature engineering from different levels. In this paper, three models of generating sentence vectors for the task of sentence classification were explored and compared. The proposed approach builds sentence representations using learned embeddings based on neural network. The learned word embeddings formed a feature space, to which the examined sentence is mapped to. Those features are input into the classifiers for supervised classification. Using 10-cross-validation scheme, evaluation was conducted on the Argumentative-Zoning (AZ) annotated articles. The results showed that simply averaging the word vectors in a sentence works better than the paragraph to vector algorithm and by integrating specific cuewords into the loss function of the neural network can improve the classification performance. In comparison with the hand-crafted features, the word2vec method won for most of the categories. However, the hand-crafted features showed their strength on classifying some of the categories.READ FULL TEXT VIEW PDF
Argumentative Zoning using word2vec
One of the crucial tasks for researchers to carry out scientific investigations is to detect existing ideas that are related to their research topics. Research ideas are usually documented in scientific publications. Normally, there is one main idea stated in the abstract, explicitly presenting the aim of the paper. There are also other sub-ideas distributed across the entire paper. As the growth rate of scientific publication has been rising dramatically, researchers are overwhelmed by the explosive information. It is almost impossible to digest the ideas contained in the documents emerged everyday. Therefore, computer assisted technologies such as document summarization are expected to play a role in condensing information and providing readers with more relevant short texts. Unlike document summarization from news circles, where the task is to identify centroid sentences  or to extract the first few sentences of the paragraphs , summarization of scientific articles involves extra text processing stage . After highest ranked texts are extracted, rhetorical status analysis will be conducted on the selected sentences. Rhetorical sentence classification, also known as argumentative zoning (AZ) , is a process of assigning rhetorical status to the extracted sentences. The results of AZ provide readers with general discourse context from which the scientific ideas could be better linked, compared and analyzed. For example, given a specific task, which sentences should be shown to the reader is related to the features of the sentences. For the task of identifying a paper’s unique contribution, sentences expressing research purpose should be retrieved with higher priority. For comparing ideas, statements of comparison with other works would be more useful. Teufel et. al.  introduced their rhetorical annotation scheme which takes into account of the aspects of argumentation, metadiscourse and relatedness to other works. Their scheme resulted seven categories of rhetorical status and the categories are assigned to full sentences. Examples 111These texts were randomly selected from Argumentative Zoning Corpus, which is described in dataset section. of human annotated sentences with their rhetorical status are shown in Table. 1. The seven categories are aim, contrast, own, background, other, basis and textual.
|AIM||This paper discusses the lexicographical concept of lexical functions|
|Mel’cuk and Zolkovsky 1984 and their potential exploitation in the|
development of a machine translation lexicon designed to handle collocations.
|CTR||In two of the tasks, the training data is generated by|
|a probabilistic context-free grammar and in both tasks our|
|algorithm outperforms the other techniques.|
|OWN||We have explored examples of the kinds of tree sets|
|and string languages that this system can generate.|
|BKG||English has a very limited system, marking little|
|more than plurality on nouns and a restricted range of verb properties.|
|OTH||For this small example, writing such an apply predicate|
|is not difficult.|
|BAS||Following Pereira et al. 1993, we measure word|
|similarity by the relative entropy, or Kullback-Leibler|
|distance, between the corresponding conditional distributions.|
|TXT||The next section describes the binary representation|
|and the length formul derived from it in detail;|
|readers satisfied with the intuitive descriptions|
|presented so far should skip ahead to the Phonotactics sub-section.|
Analyzing the rhetorical status of sentences manually requires huge amount of efforts, especially for structuring information from multiple documents. Fortunately, computer algorithms have been introduced to solve this problem. With the development of artificial intelligence, machine learning and computational linguistics, Natural Language Processing (NLP) has become a popular research area[5, 6]. NLP covers the applications from document retrieval, text categorization , document summarization 9, 10]. Those applications are targeting different types of text resources, such as articles from social media  and scientific publications . There are several approaches to tackle these tasks. From machine learning prospective, text can be analysed via supervised , semi-supervised  and unsupervised  algorithms.
Document summarization from social media and news circles has received much attention for the past decades. Those problems have been addressed from many angles, one of which is feature extraction and representation. At the early stage of document summarization, features are usually engineered manually. Although the hand-crafted features have shown the ability for document summarization and sentiment analysis[14, 10], there are not enough efficient features to capture the semantic relations between words, phrases and sentences. Moreover, building a sufficient pool of features manually is difficult, because it requires expert knowledge and it is time-consuming. Teufel et. al.  have built feature pool of sixteen types of features to classify sentences, such as the position of sentence, sentence length and tense. Widyantoro et. al. used content features, qualifying adjectives and meta-discourse features  to explore AZ task. It took efforts to engineer these features and it is also time consuming to optimize the combination of the entire features. With the advent of neural networks , it is possible for computers to learn feature representations automatically. Recently, word embedding technique  has been widely used in the NLP community. There are plenty of cases where word embedding and sentence representations have been applied to short text classification  and paraphrase detection . However, the effectiveness of this technique on AZ needs further study. The research question is, is it possible to extract word embeddings as features to classify sentences into the seven categories mentioned above using supervised machine learning approach?
The tool of word2vec proposed by Mikolov et al. 
has gained a lot attention recently. With word2vec tool, word embeddings can be learnt from big amount of text corpus and the semantic relationships between words can be measured by the cosine distances between the vectors. The idea behind word embeddings is to use distributed representation to map each word into k-dimension vector. How these vectors are generated using word2vec tool? The common method to derive the vectors is using neural probabilistic language model . The underlying word representations for each word are obtained while training the language model. Similar to the mechanism in language model, Mikolov et al. 
introduced two architectures: Skip-gram model and continuous bag of words (CBOW) model. Each of the model has two different training strategies, such as hierarchical softmax and negative sampling. Both these two models have three layers: input, projection and output layer. The word vectors are obtained once the models are optimized. Usually, this optimizing process is done using stochastic gradient descent method. It doesn’t need labels when training the models, which makes word2vec algorithm more valuable compared with traditional supervised machine learning methods that require a big amount of annotated data. Given enough text corpus, the word2vec can generate meaningful representations.
Word2vec has been applied to sentiment analysis [22, 23, 24] and text classification . Sadeghian and Sharafat  explored averaging of the word vectors in a sentiment review statement. Their results indicated that word2vec models significantly outperform the vanilla bag-of-words model. Amongst the word2vec based models, softmax provides the best form of classification. Tang et al.  used the concatenation of vectors derived from different convolutional layers to analyze the sentiment statements. They also trained sentiment-specific word embeddings to improve the twitter sentiment classification results. This work is aiming at learning word embeddings for the task of AZ. The results were compared from three aspects: the impact of the training corpus, the effectiveness of specific word embeddings and different ways of constructing sentence representations based on the learned word vectors.
Le and Mikolov  introduced the concept of word vector representation in a formal way:
Given a sequence of training words
, the objective of the word2vec model is to maximize the average log probability:
Using softmax technique, the prediction can be formalized as:
p = (2)
Each of is un-normalized log probability for each output word :
In this study, sentence embeddings were learned from large text corpus as features to classify sentences into seven categories in the task of AZ. Three models were explored to obtain the sentence vectors: averaging the vectors of the words in one sentence, paragraph vectors and specific word vectors.
The first model, averaging word vectors (), is to average the vectors in word sequence . The main process in this model is to learn the word embedding matrix :
where is the word embedding for word , which is learned by the classical word2vec algorithm .
The second model, , is aiming at training paragraph vectors. It is also called distributed memory model of paragraph vectors (PV-DM) , which is an extension of word2vec. In comparison with the word2vec framework, the only change in PV-DM is in the equation (3), where is constructed from and , where matrix is the word vector and holds the paragraph vectors in such a way that every paragraph is mapped to a unique vector represented by a column in matrix .
The third model is constructed for the purpose of improving classification results for a certain category. In this study specifically, the optimization task was focused on identifying the category 222This is a general case to show how to improve the classification result by integrating cuewords to the embeddings.. In this study, specific word embeddings were trained () inspired by Tang et al. ’s model: Sentiment-Specific Word Embedding (unified model: ). After obtaining the word vectors via , the same scheme was used to average the vectors in one sentence as in the model .
The learned word embeddings are input into a classifier as features under a supervised machine learning framework. Similar to sentiment classification using word embeddings , where they try to predict each tweet to be either positive or negative, in the task of AZ, the embeddings are used to classify each sentence into one of the seven categories.
To evaluate the classification performance, precision, recall and F-measure were computed.
collection. ACL Anthology Reference Corpus 333- contains the canonical 10,921 computational linguistics papers, from which 622,144 sentences were generated after filtering out sentences with lower quality.
collection contains 6,778 sentences, extracted from the titles and abstracts of publications provided by WEB OF SCIENCE 444.
Argumentative Zoning Corpus ( corpus) consists of 80 AZannotated conference articles in computational linguistics, originally drawn from the Cmplg arXiv. 555\(http://www.cl.cam.ac.uk/\)~. After Concatenating sub-sentences, 7,347 labeled sentences were obtained.
To compare the three models effectiveness on the AZ task, the three models on a same ACL dataset (introduced int he dataset section) were trained. The word2vec were also trained using different parameters, such as different dimension of features. To evaluate the impact from different domains, the first model was trained on different corpus.
The characteristics of word embeddings based on different model and dataset are listed in Table. 2.
|Number of features||Vocabulary size|
Inspired by the work from Sadeghian and Sharafat  666\(https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors\), the word to vector features were set up as follows: the Minimum word count is 40; The number of threads to run in parallel is 4 and the context window is 10.
In imbalanced data sets, some classes are significantly outnumbered by other classes , which affects the classification results. In this experiment, the test dataset is an imbalanced data set. Table. 3 shows the distribution of rhetorical categories from the test dataset. The categories OWN and OTH are significantly outnumbering other categories.
|Category||Number of Sentences||Percentage|
To deal with the problem of classification on unbalanced data, synthetic Minority Over-sampling TEchnique (SMOTE)  were performed on the original dataset. 10-cross validation scheme was adopted and the results were averaged from 10 iterations.
Table. 4 and 5 show the classification performance of different methods. 777Note that it is not completely compatible with Teufel 2002 results, since the dataset is different due to the sentence concatenation in this paper. But Teufel’s reports could be a reference.
|Brown model 100||0.19/0.73/0.30||0.38/0.56/0.45||0.19/0.55/0.28||0.05/0.72/0.10|
|Brown model 100||0.30/0.72/0.42||0.56/0.52/0.54||0.42/0.66/0.51|
The results were examined from the following aspects:
When the feature dimension is set to 100 and the training corpus is ACL, the results generated by different models were compared (AVGWVEC,
PARAVEC and AVGWVEC+BSWE for BAS category only). Looking at the F-measure, AVGWVEC performs better than PARAVEC, but PARAVEC gave a better precision results on several categories, such as AIM, CTR, TXT and OWN. The results showed that PARAVEC model is not robust, for example, it performs badly for the category of BAS. For specific category classification, take the BAS category for example, the BSWE model outperforms others in terms of F-measure.
When the model is fixed to AVGWVEC and the training corpus is ACL, the feature size impact (300 and 100 dimensions) was investigated. From the F-measure, it can be seen that for some categories, 300-dimension features perform better than the 100-dimension ones, for example, CTR and BKG, but they are not as good as 100-dimension features for some categories, such as BAS.
When the model is set to AVGWVEC and the feature dimension is 100, the results computed from different training corpus were compared (ACL+AZ, MixedAbs and Brown corpus). ACL+AZ outperforms others and brown corpus is better than MixedAbs for most of the categories, but brown corpus is not as good as MixedAbs for the category of OWN.
Finally, the results were compared between word embeddings and the methods of cuewords, Teufel 2002 and baseline. To evaluate word embeddings on AZ, the model AVGWVEC trained on ACL+AZ was used for the comparison. It can be seen from the table. 4, the model of word embeddings is better than the method using cuewords matching. It also outperforms Teufel 2002 for most of the cases, except AIM, BAS and OWN. It won baseline for most of the categories, except OWN.
The classification results showed that the type of word embeddings and the training corpus affect the AZ performance. As the simple model, performs better than others, which indicate averaging the word vectors in a sentence can capture the semantic property of statements. By training specific argumentation word embeddings, the performance can be improved, which can be seen from the case of detecting BAS status using model.
Feature dimension doesn’t dominate the results. There is no significant difference between the resutls generated by 300-dimension of features and 100 dimensions.
Training corpus affects the results. ACL+AZ outperforming others indicates that the topics of the training corpus are important factors in argumentative zoning. Although Brown corpus has more vocabularies, it doesn’t win ACL+AZ.
In general, the classification performance of word embeddings is competitive in terms of F-measure for most of the categories. But for classifying the categories AIM, BAS and OWN, the manually crafted features proposed by Teufel et al.  gave better results.
In this paper, different word embedding models on the task of argumentative zoning were compared . The results showed that word embeddings are effective on sentence classification from scientific papers. Word embeddings trained on a relevant corpus can capture the semantic features of statements and they are easier to be obtained than hand engineered features.
To improve the sentence classification for a specific category, integrating word specific embedding strategy helps. The size of the feature pool doesn’t matter too much on the results, nor does the vocabulary size. In comparison, the domain of the training corpus affects the classification performance.
T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,”Machine learning, vol. 42, no. 1-2, pp. 177–196, 2001.
P. Wang, B. Xu, J. Xu, G. Tian, C.-L. Liu, and H. Hao, “Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification,”Neurocomputing, 2015.
R. Socher, E. H. Huang, J. Pennin, C. D. Manning, and A. Y. Ng, “Dynamic pooling and unfolding recursive autoencoders for paraphrase detection,” inAdvances in Neural Information Processing Systems, 2011, pp. 801–809.
J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and word2vec for text classification with semantic features,” inCognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on. IEEE, 2015, pp. 136–140.
International Journal of Knowledge Engineering and Soft Data Paradigms, vol. 3, no. 1, pp. 4–21, 2011.