Automatic Argumentative-Zoning Using Word2vec

by   Haixia Liu, et al.

In comparison with document summarization on the articles from social media and newswire, argumentative zoning (AZ) is an important task in scientific paper analysis. Traditional methodology to carry on this task relies on feature engineering from different levels. In this paper, three models of generating sentence vectors for the task of sentence classification were explored and compared. The proposed approach builds sentence representations using learned embeddings based on neural network. The learned word embeddings formed a feature space, to which the examined sentence is mapped to. Those features are input into the classifiers for supervised classification. Using 10-cross-validation scheme, evaluation was conducted on the Argumentative-Zoning (AZ) annotated articles. The results showed that simply averaging the word vectors in a sentence works better than the paragraph to vector algorithm and by integrating specific cuewords into the loss function of the neural network can improve the classification performance. In comparison with the hand-crafted features, the word2vec method won for most of the categories. However, the hand-crafted features showed their strength on classifying some of the categories.



There are no comments yet.


page 1

page 2

page 3

page 4


Sentiment Analysis of Citations Using Word2vec

Citation sentiment analysis is an important task in scientific paper ana...

A Comparative Study of Neural Network Models for Sentence Classification

This paper presents an extensive comparative study of four neural networ...

A Comparison of Feature-Based and Neural Scansion of Poetry

Automatic analysis of poetic rhythm is a challenging task that involves ...

Comparing Feature-Engineering and Feature-Learning Approaches for Multilingual Translationese Classification

Traditional hand-crafted linguistically-informed features have often bee...

Integrated Eojeol Embedding for Erroneous Sentence Classification in Korean Chatbots

This paper attempts to analyze the Korean sentence classification system...

Catching Attention with Automatic Pull Quote Selection

Pull quotes are an effective component of a captivating news article. Th...

Improving Automatic Hate Speech Detection with Multiword Expression Features

The task of automatically detecting hate speech in social media is gaini...

Code Repositories


Argumentative Zoning using word2vec

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the crucial tasks for researchers to carry out scientific investigations is to detect existing ideas that are related to their research topics. Research ideas are usually documented in scientific publications. Normally, there is one main idea stated in the abstract, explicitly presenting the aim of the paper. There are also other sub-ideas distributed across the entire paper. As the growth rate of scientific publication has been rising dramatically, researchers are overwhelmed by the explosive information. It is almost impossible to digest the ideas contained in the documents emerged everyday. Therefore, computer assisted technologies such as document summarization are expected to play a role in condensing information and providing readers with more relevant short texts. Unlike document summarization from news circles, where the task is to identify centroid sentences [1] or to extract the first few sentences of the paragraphs [2], summarization of scientific articles involves extra text processing stage [3]. After highest ranked texts are extracted, rhetorical status analysis will be conducted on the selected sentences. Rhetorical sentence classification, also known as argumentative zoning (AZ) [4], is a process of assigning rhetorical status to the extracted sentences. The results of AZ provide readers with general discourse context from which the scientific ideas could be better linked, compared and analyzed. For example, given a specific task, which sentences should be shown to the reader is related to the features of the sentences. For the task of identifying a paper’s unique contribution, sentences expressing research purpose should be retrieved with higher priority. For comparing ideas, statements of comparison with other works would be more useful. Teufel et. al. [3] introduced their rhetorical annotation scheme which takes into account of the aspects of argumentation, metadiscourse and relatedness to other works. Their scheme resulted seven categories of rhetorical status and the categories are assigned to full sentences. Examples 111These texts were randomly selected from Argumentative Zoning Corpus, which is described in dataset section. of human annotated sentences with their rhetorical status are shown in Table. 1. The seven categories are aim, contrast, own, background, other, basis and textual.

Rhetorical Status Examples
AIM This paper discusses the lexicographical concept of lexical functions
Mel’cuk and Zolkovsky 1984 and their potential exploitation in the

development of a machine translation lexicon designed to handle collocations.

CTR In two of the tasks, the training data is generated by
a probabilistic context-free grammar and in both tasks our
algorithm outperforms the other techniques.
OWN We have explored examples of the kinds of tree sets
and string languages that this system can generate.
BKG English has a very limited system, marking little
more than plurality on nouns and a restricted range of verb properties.
OTH For this small example, writing such an apply predicate
is not difficult.
BAS Following Pereira et al. 1993, we measure word
similarity by the relative entropy, or Kullback-Leibler
distance, between the corresponding conditional distributions.
TXT The next section describes the binary representation
and the length formul derived from it in detail;
readers satisfied with the intuitive descriptions
presented so far should skip ahead to the Phonotactics sub-section.
Table 1: Examples of annotated sentences with their rhetorical status

Analyzing the rhetorical status of sentences manually requires huge amount of efforts, especially for structuring information from multiple documents. Fortunately, computer algorithms have been introduced to solve this problem. With the development of artificial intelligence, machine learning and computational linguistics, Natural Language Processing (NLP) has become a popular research area

[5, 6]. NLP covers the applications from document retrieval, text categorization [7], document summarization [8]

to sentiment analysis

[9, 10]. Those applications are targeting different types of text resources, such as articles from social media [11] and scientific publications [3]. There are several approaches to tackle these tasks. From machine learning prospective, text can be analysed via supervised [3], semi-supervised [12] and unsupervised [13] algorithms.

Document summarization from social media and news circles has received much attention for the past decades. Those problems have been addressed from many angles, one of which is feature extraction and representation. At the early stage of document summarization, features are usually engineered manually. Although the hand-crafted features have shown the ability for document summarization and sentiment analysis

[14, 10], there are not enough efficient features to capture the semantic relations between words, phrases and sentences. Moreover, building a sufficient pool of features manually is difficult, because it requires expert knowledge and it is time-consuming. Teufel et. al. [3] have built feature pool of sixteen types of features to classify sentences, such as the position of sentence, sentence length and tense. Widyantoro et. al. used content features, qualifying adjectives and meta-discourse features [15] to explore AZ task. It took efforts to engineer these features and it is also time consuming to optimize the combination of the entire features. With the advent of neural networks [16], it is possible for computers to learn feature representations automatically. Recently, word embedding technique [17] has been widely used in the NLP community. There are plenty of cases where word embedding and sentence representations have been applied to short text classification [18] and paraphrase detection [19]. However, the effectiveness of this technique on AZ needs further study. The research question is, is it possible to extract word embeddings as features to classify sentences into the seven categories mentioned above using supervised machine learning approach?

2 Related Work

The tool of word2vec proposed by Mikolov et al. [17]

has gained a lot attention recently. With word2vec tool, word embeddings can be learnt from big amount of text corpus and the semantic relationships between words can be measured by the cosine distances between the vectors. The idea behind word embeddings is to use distributed representation

[20] to map each word into k-dimension vector. How these vectors are generated using word2vec tool? The common method to derive the vectors is using neural probabilistic language model [21]. The underlying word representations for each word are obtained while training the language model. Similar to the mechanism in language model, Mikolov et al. [17]

introduced two architectures: Skip-gram model and continuous bag of words (CBOW) model. Each of the model has two different training strategies, such as hierarchical softmax and negative sampling. Both these two models have three layers: input, projection and output layer. The word vectors are obtained once the models are optimized. Usually, this optimizing process is done using stochastic gradient descent method. It doesn’t need labels when training the models, which makes word2vec algorithm more valuable compared with traditional supervised machine learning methods that require a big amount of annotated data. Given enough text corpus, the word2vec can generate meaningful representations.

Word2vec has been applied to sentiment analysis [22, 23, 24] and text classification [25]. Sadeghian and Sharafat [26] explored averaging of the word vectors in a sentiment review statement. Their results indicated that word2vec models significantly outperform the vanilla bag-of-words model. Amongst the word2vec based models, softmax provides the best form of classification. Tang et al. [22] used the concatenation of vectors derived from different convolutional layers to analyze the sentiment statements. They also trained sentiment-specific word embeddings to improve the twitter sentiment classification results. This work is aiming at learning word embeddings for the task of AZ. The results were compared from three aspects: the impact of the training corpus, the effectiveness of specific word embeddings and different ways of constructing sentence representations based on the learned word vectors.

Le and Mikolov [27] introduced the concept of word vector representation in a formal way:

Given a sequence of training words

, the objective of the word2vec model is to maximize the average log probability:

p (1)

Using softmax technique, the prediction can be formalized as:

p = (2)

Each of is un-normalized log probability for each output word :


3 Methodology

3.1 Models

In this study, sentence embeddings were learned from large text corpus as features to classify sentences into seven categories in the task of AZ. Three models were explored to obtain the sentence vectors: averaging the vectors of the words in one sentence, paragraph vectors and specific word vectors.

The first model, averaging word vectors (), is to average the vectors in word sequence . The main process in this model is to learn the word embedding matrix :


where is the word embedding for word , which is learned by the classical word2vec algorithm [17].

The second model, , is aiming at training paragraph vectors. It is also called distributed memory model of paragraph vectors (PV-DM) [27], which is an extension of word2vec. In comparison with the word2vec framework, the only change in PV-DM is in the equation (3), where is constructed from and , where matrix is the word vector and holds the paragraph vectors in such a way that every paragraph is mapped to a unique vector represented by a column in matrix .

The third model is constructed for the purpose of improving classification results for a certain category. In this study specifically, the optimization task was focused on identifying the category 222This is a general case to show how to improve the classification result by integrating cuewords to the embeddings.. In this study, specific word embeddings were trained () inspired by Tang et al. [22]’s model: Sentiment-Specific Word Embedding (unified model: ). After obtaining the word vectors via , the same scheme was used to average the vectors in one sentence as in the model .

3.2 Classification and evaluation

The learned word embeddings are input into a classifier as features under a supervised machine learning framework. Similar to sentiment classification using word embeddings [22], where they try to predict each tweet to be either positive or negative, in the task of AZ, the embeddings are used to classify each sentence into one of the seven categories.

To evaluate the classification performance, precision, recall and F-measure were computed.

4 Experimental Evaluation

4.1 Training Dataset

collection. ACL Anthology Reference Corpus 333- contains the canonical 10,921 computational linguistics papers, from which 622,144 sentences were generated after filtering out sentences with lower quality.

collection contains 6,778 sentences, extracted from the titles and abstracts of publications provided by WEB OF SCIENCE 444.

4.2 Test Dataset

Argumentative Zoning Corpus ( corpus) consists of 80 AZannotated conference articles in computational linguistics, originally drawn from the Cmplg arXiv. 555\(\)~. After Concatenating sub-sentences, 7,347 labeled sentences were obtained.

4.3 Training strategy

To compare the three models effectiveness on the AZ task, the three models on a same ACL dataset (introduced int he dataset section) were trained. The word2vec were also trained using different parameters, such as different dimension of features. To evaluate the impact from different domains, the first model was trained on different corpus.

The characteristics of word embeddings based on different model and dataset are listed in Table. 2.

Number of features Vocabulary size
ACL+AZ 300 300 13685
ACL+AZ 100 100 14325
ACL+AZ 100 100 74261
MixedAbs 100 100 4126
100 100 643
Brown model 100 56057
Table 2: Characteristics of word embeddings based on different model and dataset

4.4 Parameters

Inspired by the work from Sadeghian and Sharafat [26] 666\(\), the word to vector features were set up as follows: the Minimum word count is 40; The number of threads to run in parallel is 4 and the context window is 10.

4.5 Strategy of dealing with unbalanced data

In imbalanced data sets, some classes are significantly outnumbered by other classes [28], which affects the classification results. In this experiment, the test dataset is an imbalanced data set. Table.  3 shows the distribution of rhetorical categories from the test dataset. The categories OWN and OTH are significantly outnumbering other categories.

Category Number of Sentences Percentage
OWN 4868 0.54
OTH 1927 0.21
BKG 644 0.07
BAS 641 0.07
CTR 451 0.05
AIM 303 0.03
TXT 191 0.02
Table 3: Distribution of rhetorical categories

To deal with the problem of classification on unbalanced data, synthetic Minority Over-sampling TEchnique (SMOTE) [29] were performed on the original dataset. 10-cross validation scheme was adopted and the results were averaged from 10 iterations.

4.6 Results of classification for per category

Table. 4 and  5 show the classification performance of different methods. 777Note that it is not completely compatible with Teufel 2002 results, since the dataset is different due to the sentence concatenation in this paper. But Teufel’s reports could be a reference.

ACL+AZ 300 0.29/0.82/0.43 0.34/0.75/0.47 0.36/0.72/0.48 0.10/0.72/0.17
ACL+AZ 100 0.29/0.85/0.43 0.29/0.80/0.42 0.36/0.68/0.47 0.11/0.87/0.20
ACL+AZ 100 0.60/0.03/0.06 0.20/0.004/0.009 0.39/0.02/0.04 0.00/0.00/0.00
MixedAbs 100 0.11/0.73/0.19 0.11/0.71/0.20 0.14/0.62/0.23 0.04/0.65/0.08
Brown model 100 0.19/0.73/0.30 0.38/0.56/0.45 0.19/0.55/0.28 0.05/0.72/0.10
100 - - - 0.14/0.63/0.23
Cuewords 0.13/0.55/0.21 0.33/0.20/0.25 - 0.08/0.36/0.13
Teufel 2002 0.44/0.65/0.52 0.34/0.20/0.26 0.40/0.50/0.45 0.37/0.40/0.38
Baseline 0.30/0.07/0.11 0.31/0.12/0.17 0.32/0.17/0.22 0.15/0.05/0.07
Table 4: Performance of sentence classification per category I (precision/recall/F-measure)
ACL+AZ 300 0.51/0.87/0.64 0.61/0.71/0.65 0.49/0.65/0.56
ACL+AZ 100 0.47/0.88/0.61 0.59/0.68/0.63 0.49/0.69/0.57
ACL+AZ 100 0.52/0.11/0.18 0.62/0.98/0.76 0.35/0.004/0.009
MixedAbs 100 0.15/0.75/0.25 0.72/0.56/0.63 0.21/0.61/0.31
Brown model 100 0.30/0.72/0.42 0.56/0.52/0.54 0.42/0.66/0.51
Teufel 2002 0.57/0.66/0.61 0.84/0.88/0.86 0.52/0.39/0.44
Baseline 0.56/0.15/0.23 0.78/0.90/0.83 0.47/0.42/0.44
Table 5: Performance of sentence classification per category II (precision/recall/F-measure)

The results were examined from the following aspects:

When the feature dimension is set to 100 and the training corpus is ACL, the results generated by different models were compared (AVGWVEC,
PARAVEC and AVGWVEC+BSWE for BAS category only). Looking at the F-measure, AVGWVEC performs better than PARAVEC, but PARAVEC gave a better precision results on several categories, such as AIM, CTR, TXT and OWN. The results showed that PARAVEC model is not robust, for example, it performs badly for the category of BAS. For specific category classification, take the BAS category for example, the BSWE model outperforms others in terms of F-measure.

When the model is fixed to AVGWVEC and the training corpus is ACL, the feature size impact (300 and 100 dimensions) was investigated. From the F-measure, it can be seen that for some categories, 300-dimension features perform better than the 100-dimension ones, for example, CTR and BKG, but they are not as good as 100-dimension features for some categories, such as BAS.

When the model is set to AVGWVEC and the feature dimension is 100, the results computed from different training corpus were compared (ACL+AZ, MixedAbs and Brown corpus). ACL+AZ outperforms others and brown corpus is better than MixedAbs for most of the categories, but brown corpus is not as good as MixedAbs for the category of OWN.

Finally, the results were compared between word embeddings and the methods of cuewords, Teufel 2002 and baseline. To evaluate word embeddings on AZ, the model AVGWVEC trained on ACL+AZ was used for the comparison. It can be seen from the table. 4, the model of word embeddings is better than the method using cuewords matching. It also outperforms Teufel 2002 for most of the cases, except AIM, BAS and OWN. It won baseline for most of the categories, except OWN.

5 Discussion

The classification results showed that the type of word embeddings and the training corpus affect the AZ performance. As the simple model, performs better than others, which indicate averaging the word vectors in a sentence can capture the semantic property of statements. By training specific argumentation word embeddings, the performance can be improved, which can be seen from the case of detecting BAS status using model.

Feature dimension doesn’t dominate the results. There is no significant difference between the resutls generated by 300-dimension of features and 100 dimensions.

Training corpus affects the results. ACL+AZ outperforming others indicates that the topics of the training corpus are important factors in argumentative zoning. Although Brown corpus has more vocabularies, it doesn’t win ACL+AZ.

In general, the classification performance of word embeddings is competitive in terms of F-measure for most of the categories. But for classifying the categories AIM, BAS and OWN, the manually crafted features proposed by Teufel et al. [3] gave better results.

6 Conclusion

In this paper, different word embedding models on the task of argumentative zoning were compared . The results showed that word embeddings are effective on sentence classification from scientific papers. Word embeddings trained on a relevant corpus can capture the semantic features of statements and they are easier to be obtained than hand engineered features.

To improve the sentence classification for a specific category, integrating word specific embedding strategy helps. The size of the feature pool doesn’t matter too much on the results, nor does the vocabulary size. In comparison, the domain of the training corpus affects the classification performance.


  • [1] D. R. Radev, H. Jing, and M. Budzikowska, “Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies,” in Proceedings of the 2000 NAACL-ANLP Workshop on Automatic summarization.   Association for Computational Linguistics, 2000, pp. 21–30.
  • [2] C.-Y. Lin and E. Hovy, “Identifying topics by position,” in Proceedings of the fifth conference on Applied natural language processing.   Association for Computational Linguistics, 1997, pp. 283–290.
  • [3] S. Teufel and M. Moens, “Summarizing scientific articles: experiments with relevance and rhetorical status,” Computational linguistics, vol. 28, no. 4, pp. 409–445, 2002.
  • [4] S. Teufel et al., “Argumentative zoning: Information extraction from scientific text,” Ph.D. dissertation, Citeseer, 2000.
  • [5] C. D. Manning and H. Schütze, Foundations of statistical natural language processing.   MIT press, 1999.
  • [6] J. Hirschberg and C. D. Manning, “Advances in natural language processing,” Science, vol. 349, no. 6245, pp. 261–266, 2015.
  • [7] P. Jackson and I. Moulinier, Natural language processing for online applications: Text retrieval, extraction and categorization.   John Benjamins Publishing, 2007, vol. 5.
  • [8] Z. Cao, F. Wei, L. Dong, S. Li, and M. Zhou, “Ranking with recursive neural networks and its application to multi-document summarization,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • [9] T. Nasukawa and J. Yi, “Sentiment analysis: Capturing favorability using natural language processing,” in Proceedings of the 2nd international conference on Knowledge capture.   ACM, 2003, pp. 70–77.
  • [10] B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Foundations and trends in information retrieval, vol. 2, no. 1-2, pp. 1–135, 2008.
  • [11] S. Asur, B. Huberman et al., “Predicting the future with social media,” in Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, vol. 1.   IEEE, 2010, pp. 492–499.
  • [12] V. Sindhwani and P. Melville, “Document-word co-regularization for semi-supervised sentiment analysis,” in Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.   IEEE, 2008, pp. 1025–1030.
  • [13]

    T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,”

    Machine learning, vol. 42, no. 1-2, pp. 177–196, 2001.
  • [14] D. Das and A. F. Martins, “A survey on automatic text summarization,” Literature Survey for the Language and Statistics II course at CMU, vol. 4, pp. 192–195, 2007.
  • [15] D. H. Widyantoro, M. L. Khodra, B. Riyanto, and A. Aziz, “A multiclass-based classification strategy for rhetorical sentence categorization from scientific papers,” FormaMente: Rivista internazionale di ricerca sul futuro digitale, no. 3-2014, p. 223, 2015.
  • [16] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
  • [17] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  • [18]

    P. Wang, B. Xu, J. Xu, G. Tian, C.-L. Liu, and H. Hao, “Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification,”

    Neurocomputing, 2015.
  • [19]

    R. Socher, E. H. Huang, J. Pennin, C. D. Manning, and A. Y. Ng, “Dynamic pooling and unfolding recursive autoencoders for paraphrase detection,” in

    Advances in Neural Information Processing Systems, 2011, pp. 801–809.
  • [20] M. J. Hinton, Geoffrey and D. Rumelhart, “Distributed representations,” 1986.
  • [21] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” The Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.
  • [22] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin, “Learning sentiment-specific word embedding for twitter sentiment classification,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1, 2014, pp. 1555–1565.
  • [23] B. Xue, C. Fu, and Z. Shaobin, “A study on sentiment computing and classification of sina weibo with word2vec,” in Big Data (BigData Congress), 2014 IEEE International Congress on.   IEEE, 2014, pp. 358–363.
  • [24] D. Zhang, H. Xu, Z. Su, and Y. Xu, “Chinese comments sentiment classification based on word2vec and svm perf,” Expert Systems with Applications, vol. 42, no. 4, pp. 1857–1863, 2015.
  • [25]

    J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and word2vec for text classification with semantic features,” in

    Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on.   IEEE, 2015, pp. 136–140.
  • [26] A. Sadeghian and A. R. Sharafat, “Bag of words meets bags of popcorn.”
  • [27] Q. V. Le and T. Mikolov, “Distributed representations of sentences and documents,” arXiv preprint arXiv:1405.4053, 2014.
  • [28] H. M. Nguyen, E. W. Cooper, and K. Kamei, “Borderline over-sampling for imbalanced data classification,”

    International Journal of Knowledge Engineering and Soft Data Paradigms

    , vol. 3, no. 1, pp. 4–21, 2011.
  • [29] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, pp. 321–357, 2002.