The number of papers published each year has grown greatly. For example, as shown in Figure 1, the number of new papers on IEEE Xplore111https://ieeexplore.ieee.org/Xplore/home.jsp increases sharply over the decade.
Paper boom in academic fields results in many severe problems. Cortes et al. (Cortes and Lawrence, 2021) examine 2014 NeurIPS and find that it is not able to pick out excellent researches, and could identify terrible papers. Chu et al. (Chu and Evans, 2021) reveal that too many papers published each year in a field hinder its development. They state this opinion in two aspects. First, researchers are busy coping with a lot of papers, but don’t have enough time to fully learn novel ideas; Second, the focused attention on a promising idea might be broken up by the deluge of new ideas.
The reason for the sharp increase in papers is that evaluation metrics for researchers and scholars focus on the number of papers. From the scientific output, research funding, to the evaluation of professional rank, papers play a very important role, and the more papers, the better. However, It is time to make changes. Quantitative metrics could not evaluate the real academic impact of a scholar or a paper. They ignore the essential differences between citations, which is a fatal error. Seglen expresses strong opposition to impact factors that measure the academic influence of journals for committees seldom have the specialist’ insights to assess primary researches(Seglen, 1997).
We propose Phocus, a novel evaluation mechanism for scholars and publications. Phocus analyzes the sentence containing a citation and its contexts to predict the sentiment polarity towards the corresponding reference. Besides, Phocus also considers the total number of citations, the number of citations per sentence, author overlap, and the number of references, similar to (Valenzuela et al., 2015). Given those factors above, Phocus uses Naive Bayesian Classifier to divide citations coarsely into 4 categories and utilizes the LambdaMART model to sort all references within a paper. Combining the categories and the ranking results, every reference gets its local influential factor within , related to the citing paper. The global influential factor of the reference to the citing paper is the product of the local influential factor and the total influential factor of the citing paper. Consequently, an author’s academic influential factor is the sum of his contributions to each paper he co-authors.
2. Related Work
Our work involves citation classification, aspect-based sentiment analysis, ranking model and evaluation metrics for academics, which will be introduced in subsections below respectively.
2.1. Citation Classification
In fact, there are already many kinds of research that have focused on citation classification. For example, Teufel et al. (Teufel et al., 2006) classify citation intents into 12 classes, using simple regular match to extract features. Valenzuela et al. (Valenzuela et al., 2015)
divide citations into 4 classes: highly influential, background, method and results citations, using SVM with an RBF kernel and random forests, taking 13 features into consideration: total number of direct citations, number of direct citations per section, the total number of indirect citations and number of indirect citations per section, author overlap, is considered helpful, citation appears in table and caption,, , the similarity between abstracts, PageRank(Page et al., 1999), number of total citing papers after transitive closure, and field of the cited paper. While Jurgens et al. (Jurgens et al., 2016) define 7 classes of citation intents: background, motivation, uses, extension, continuation, comparison or contrast, and future, with a Random Forest classifier trained using 4 types of features: structural features, lexical, morphological and grammatical features, field, and usage. Cohan et al. (Cohan et al., 2019) propose a multitask model using BiLSTM and attention mechanism to classify citation intents that is the primary task and predict the section where the citation occurs and where a sentence needs a citation that is auxiliary tasks and is used to assist the primary task222https://github.com/allenai/scicite. They categorize intents into 3 classes: background information, method, and result comparison. Besides, Cohan builds a citation intent dataset SciCite. Those works simply classify citations according to intents but ignore the sentiment citing paper towards references, which is vital.
Butt et al. (Butt et al., 2015)
utilize Naive-Bayes Classifier to predict the sentiment polarity of a sentence containing a citation and its contexts. Whereas Liu et al.(Liu, 2017)
use averaged word embeddings to represent sentence vectors and to classify sentiment polarities. However, this method generates the overall sentiment of text, rather than the precise sentiment towards the cited paper, which is unable to apply directly.
2.2. Aspect-based Sentiment Analysis
Aspect-based sentiment analysis (ABSA) is proposed to define such a task. Usually, ABSA consists of two stages: locating aspects and analyzing sentiment. Some works solve this problem also in a two-stage way, while some jointly.
To detect citation span in Wikipedia, Fetahu et al. (Fetahu et al., 2017) propose a sequence classification method using a linear chain CRF to decide which text fragments are covered by a citation at the sub-sentence level. Whereas Kaplan et al. (Kaplan et al., 2016) detect non-explicit citing sentences that surround an explicit citing sentence, utilizing relational, entity, lexical, and grammatical coherence between them. (Ma et al., 2018)(Zerva et al., 2020)even try to find the most relative sentences in reference paper with the citing sentences. Qazvinian and Radev (Qazvinian and Radev, 2010) proposed a method based on probabilistic inference to extract non-explicit citing sentences by modelling the sentences in an article and their lexical similarities as a Markov Random Field tuned to detect the patterns that context data create and employ a Belief Propagation mechanism to detect likely context sentences. Abu-Jbara and Radev (Abu-Jbara and Radev, 2012) determine the citation block by first segmenting the sentences and then classifying each word in the sentence as being inside or outside the citation block. Finally, they aggregate the labels of all the words contained in a segment to assign a label to the whole segment using three different label aggregation rules(majority label of the words, at least one of the words, or all of them). Kaplan et al. (Kaplan et al., 2009) proposed a new method based on coreference-chains for extracting citation blocks from research papers.
Given aspects, Sun et al. (Sun et al., 2019) construct an auxiliary sentence from a aspect, and feed the sentence-pair into BERT-based model. Gao et al. (Gao et al., 2019) utilize three target-dependent variations of the model. Bai et al. (Bai et al., 2021) propose a novel relational graph attention network333https://github.com/muyeby/RGAT-ABSA, which integrates typed syntactic dependency information.
As the errors are cumulated in the pipeline, some researchers explore solutions that detect aspects and classify sentiment jointly. Wang et al. (Wang et al., 2010) propose a latent aspect rating analysis problem that aims at analyzing reviewers’ latent opinions on an entity from several aspects. For a certain entity, they define a set of keywords of aspects and segment reviews into the aspect level. Given aspect segmentation results, they use a novel latent rating regression model to calculate aspect ratings and corresponding weights. However, Wang et al. ignore the inter-dependencies between words and sentences, which causes great information loss. This class problem is also called aspect-based sentiment analysis (ABSA). Ruder et al. (Ruder et al., 2016) proposes a hierarchical bidirectional LSTM to model the inter-dependencies of sentences within a review. The aspect is represented by the average of its entity and attribute embeddings. Hoang et al. (Hoang et al., 2019) propose to use a sentence pair classifier model from BERT(Devlin et al., 2019) to solve ABSA at sentence and text levels. Hu et al. (Hu et al., 2019) propose a span-based extract-then-classify framework based on BERT444https://github.com/huminghao16/SpanABSA. Xu et al. (Xu et al., 2019) build a dataset, ReviewRC555https://howardhsu.github.io/dataset/, and extend BERT with an extra tasking-specific layer to tune each task. Wallaart et al. (Wallaart and Frasincar, 2019)
propose a two-stage algorithm to solve the ABSA for restaurant reviews: predicting the sentiment with a lexicalized domain ontology, and using a neural network with a rotatory attention mechanism (LCR-Rot) as a backup algorithm. The order of rotatory attention mechanism operation is changed and the rotatory attention mechanism is iterated multiple times. Trusca et al. extend(Wallaart and Frasincar, 2019) with deep contextual word embeddings and add an extra attention layer to its high-level representations(Trusca et al., 2020). To address the imbalance issue and utilize the interaction between aspect terms, Luo et al. (Luo et al., 2020) propose a gradient harmonized and cascaded labelling model based on BERT. Chen et al. (Chen et al., 2020) utilize directional graph convolutional networks to perform end-to-end ABSA task.
2.3. Ranking Model
The ranking model is based on LambdaMART, which is the boosted tree version of LambdaRank(Burges et al., 2006). This algorithm solves the gradients of non-smooth cost functions used in ranking models. Burges et al. (Burges, 2010) give a review on RankNet, LambdaRank, and LambdaMART.
To illustrate the ranking network, we use to denote the -th citation of the -th reference paper. Our ranking network receives an matrix of shape , where 4 stands for the feature quaternion of (au_overlap, n_cit, cit_word, sen_label). Among which cit_word is calculated as the total number of words in . The network calculate a score on each time of citation individually, averaging on duplicate citations to get the score of each reference paper . Then is used to rank all the reference paper, outputting .
2.4. Evaluation Metrics
In the academic field, there are journal-level, author-level and paper-level metrics that measure their impacts.
The Impact Factor (IF)(Milstead, 1980) and CiteScore666https://service.elsevier.com/app/answers/detail/a_id/14880/supporthub/scopus/ are used to measure the impact of a journal based on the number of times articles cited during a fixed period published by the journal. Besides, Journal Citation Reports (JCR) give ranking for journals777https://jcr.clarivate.com/jcr/home, Eigenfactor scores(Bergstrom, 2007) measure how likely a journal is to be used, and SCImago Journal Rank (SJR)(Gonzalez-Pereira et al., 2009) regards the citations issued by more import journals as more important than those issued by less important ones. Whereas Source Normalized Impact per Paper (SNIP)(Moed, 2010) indicates that a single citation is much more important in subject areas where citations are less, and vice versa.
Author-level metrics include h-index, g-index, i10-index and so on. H-index also called index , is proposed by Jorge E. Hirsch(Hirsch, 2005), and its definition is the number of papers with citation numbers higher or equal to . The g-index is defined as the largest number such that the top articles received together at least citations(Egghe, 2006). Google Scholar proposes the i10-index that is the number of a publication with at least 10 citations. Those metrics are derived from citations and do not reveal the truth among citations.
Paper-level metrics are usually the number of citations. Especially, Semantic Scholar makes the first step towards citation classification. It divided citations into 4 classes: highly influential, background, method and results citations(Valenzuela et al., 2015), using SVM with an RBF kernel and random forests. The features Semantic Scholar use are the total number of direct citations, number of direct citations per section, the total number of indirect citations and number of indirect citations per section, author overlap, is considered helpful, citation appears in table and caption, , , the similarity between abstracts, PageRank(Page et al., 1999), number of total citing papers after transitive closure, and field of the cited paper.
As shown in Figure 2, our algorithm consists of 4 stages: pre-processing, calculating factors, evaluating contribution, and propagating influential factors. In pre-processing stage, we clean raw data, and obtain simple factors. Complex factors, like sentiment polarity are calculated in second stage. When get all factors needed, we classify citations into four classes and rank all references, and figure out the local contribution factor of each reference. We initialize all new paper to the database with an academic influential factor 1.0, and propagate its impact on references iteratively. The factors extracted from papers are listed out in Table 1
|list of authors|
Given a paper of string format, a series of steps process the raw data for the next stage: parsing, segmentation, and matching. Paring is aimed at dividing the input text into title, authors, sections, and references. We utilize flari888https://pypi.org/project/flair/ to parse the title, authors and publish year of the input paper and its references. We segment the input paper into two-level: section level and sentence level. Section segmentation is based on keywords matching and classified into three categories: 0 representing related work, introduction or other background citation; 1 representing main body including methodology, experiments and so on; 2 representing conclusion and other parts. Sentences are segmented using regular expression matching and are then labelled by their ID according to their appearing order. Reference parsing generates title, authors, publish year and even their citation markers in the paper. Given that information, we locate citations in each sentence and match citation markers with their corresponding reference papers. Then we could easily get the factor n_cit and cit_text. Factor au_overlap is calculated according to the following equation:
where A is the author set of citing paper, and B is the author set of reference paper.
3.2. Calculating Factors
There are still three factors unsolved: context_a, context_b, and sen_label. We obtain context_a, context_b with BERT, and propose a novel aspect-based sentiment analysis algorithm to classify citation sentiment.
We fine-tune BERT on a manually annotated dataset containing over 1,000 sentence pairs labelled as ”related” or ”irrelevant”. Each sentence pair is generated from a single academic paper. We get an accuracy of 94.5% on the evaluation dataset. To obtain the context of cit_context, we apply the above classifier iteratively on sentence pair ( representing the list of all sentences in the paper) where increases from 1. Once an ”irrelevant” pair is reported, the iteration is aborted and we take as context_a. Another stopping criterion is that should always be in the same paragraph with . A similar procedure is performed on to get context_b.
3.3. Evaluating Contribution
After gathering all needed factors, we train a classifier to categorize citation into 4 classes: very important, important, neutral, and terrible. And we also train a ranking model to predict the related order of references in terms of their contributions to the paper.
|2||using the work|
|0||negative sentiment towards the work|
First, we classify citations into four categories with a Naive Bayesian classifier. The classifying standards are shown in Table 2, and a larger number of labels represents more contributions.
The ranking model is based on LambdaMART, which is the boosted tree version of LambdaRank(Burges et al., 2006). This algorithm solves the gradients of non-smooth cost functions used in ranking models. Burges et al. (Burges, 2010) give a review on RankNet, LambdaRank, and LambdaMART.
Based on the classes and order of references, we project them into to get their influential factors.
3.4. Propagating Influential Factors
Given a list of references and their influential factors of the citing paper, we design some rules to propagate their influence. The main idea is shown in Figure 3.
denote a citing paper with academic influential factor initialized as 1, set , denote all references of , and their corresponding local contribution to , and is the local contribution of reference i to . is the set of all papers that cite , and for , is A’s local contribution to . Then, the academic influential factor of is:
For author who publishes a set of papers , and his contribution to paper is , his academic influential factor is:
For paper , and its authors, . There are two problems to prove to ensure that our method is logical. The first one is margin effects. And the second one the propagation rules.
We conduct several experiments to demonstrate our new metrics that measure the influential factors of an individual scientist or scholar and the citation impact of the publications.
As the influential factor of a paper is the weighted sum of all papers that cite it and its corresponding contribution to them, the final and full network of paper and network should be constructed. However, we cannot complete this job yet out of no access to some databases, not enough time or computational resources. We will select some scholars and their publications as targets, and utilize primary citation and secondary citation relationships. Besides, we also compare our modules to other state-of-art algorithms to show the improvement we achieve.
4.1. Peer Comparison
Scholar and their publications. Let Scholar Y denote some scholar. We will show the difference between Scholar Y and the Turing Award winner Pat. Hanrahan999https://scholar.google.com/citations?hl=zh-CN&user=RzEnQmgAAAAJ. As we emphasize, Pat. Hanrahan is much more influential than scholar Y is not only for that he wins Turing Award, but also is based on solid statistics of citations. For example, He et al. (He et al., 2015) take one paper of scholar Y as a baseline that performs only better than one baseline among eleven. Table 4 shows evaluation results of scholar Y and Pat. Hanrahan on Aminer101010https://www.aminer.cn/, Google Scholar111111https://scholar.google.com/, Semantic Scholar121212https://www.semanticscholar.org/ and Phocus.
|Scholar||Aminer||Google Scholar||Semantic Scholar|
Table 3 lists the number of publications and citations of scholar Y and Pat. Hanrahan. It’s obviously that scholar Y is more productive than Pat. Hanrahan. However, those numbers covers up some significant truths that not all papers are equal influential and not all citations mean agreement with the cited ones.
|Scholar||Aminer||Google Scholar||Semantic Scholar||Phocus (Primary)|
where h represents h-index, g represents g-index, i10 means i10-index, and HIC is the number of highly influential citations. H-index, also called index , is proposed by Jorge E. Hirsch(Hirsch, 2005), and its definition is the number of papers with citation number higher or equal to . The g-index is defined as the largest number such that the top articles received together at least citations(Egghe, 2006). Google Scholar proposes i10-index that is the number of a publication with at least 10 citations. Those metrics are derived from citations and do not reveal the truth among citations. Semantic Scholar makes the first step towards citation classification. It divided citations into 4 classes: highly influential, background, method and results citations(Valenzuela et al., 2015), using SVM with a RBF kernel and random forests. The features Semantic Scholar use are total number of direct citations, number of direct citations per section, total number of indirect citations and number of indirect citations per section, author overlap, is considered helpful, citation appears in table and caption, , , similarity between abstracts, PageRank(Page et al., 1999), number of total citing papers after transitive closure, and field of the cited paper.
We collect XX papers that cite scholar Y from 78663, and XX papers that cite Patrick Hanrahan from 56383. Only utilizing primary citations, we get the global academic influential factors of scholar Y and Patrick Hanrahan is 0.40 and 0.52 respectively. Figure 4
4.2. Mathematical Invariance
To verify the model, we conduct a series of experiments to prove it’s reasonable.
First, given a set of references within a paper, removing anyone reference from the set won’t change the related order of left references. And when removing a reference at a time, the left references also keep related orders.
Also, the final score should be stable and insensitive to propagating order under a certain paper pool. Our strategy starts from a default influential factor 1.0, traversing through each paper and updating the influential factor successively. It is proven through experiments that regardless of the updating order, the final score of each paper remains the same.
4.3. Citation Span
We conduct some experiments guided by (Abu-Jbara and Radev, 2012) as our baseline. We annotate the citation span for about 345 citing sentences as our data set to train and test the baseline model.
First, we use the tokenizer tool that SpaCy131313https://spacy.io/ provides to segment the text of each citing sentence into tokens, and use tagger and parser tool to assign part-of-speech-tags and dependency labels to each token.
Then, we extract features listed in Table 5
. as the input of the baseline model. The training is performed using SVM, Logistic Regression, and CRF, respectively. We use 10-fold cross-validation for training and testing.
Table 6 lists the precision, recall, and F1 for the three model.
As shown in Table 4, Phocus figures out that the global academic influential factors of scholar Y and Patrick Hanrahan are 0.40 and 0.52 respectively, and Patrick Hanrahan is 30% higher than scholar Y. It’s the results that only utilize primary citation data. While the evaluation results from Aminer, Google Scholar and even Semantic Scholar shows that scholar Y is more productive and influential than Patrick Hanrahan.
In this paper, we come up with Phocus, a novel set of academic evaluation metrics for authors and publications based on citation judgements that utilize aspect-based sentiment analysis. To verify our evaluation mechanism, peer comparison and ablation studies have been conducted. The results show that our metrics are able to identify the truly worthiness of a paper or a scholar, which is difficult to citation times based metrics, like h-index, g-index and others.
Phocus still need improvements. As shown in Section Experiments, we only use primary citation data, which is not enough to fully prove the reliability of Phocus. Besides, using more data such as secondary and tertiary citations could further reflect the gaps between scholars and between metrics. There are still many problems unsolved, such as “citation circles” (groups of researchers who cite one another’s work), and self-citation.
- Reference scope identification in citing sentences. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 80–90. Cited by: §2.2, §4.3.
- Investigating typed syntactic dependencies for targeted sentiment classification using graph attention neural network. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 29, pp. 503–514. External Links: Cited by: §2.2.
- Eigenfactor measuring the value and prestige of scholarly journals. College & Research Libraries News 68, pp. 314–316. Cited by: §2.4.
- Learning to rank with nonsmooth cost functions. In NIPS, Cited by: §2.3, §3.3.
- From ranknet to lambdarank to lambdamart: an overview. Cited by: §2.3, §3.3.
- Classification of research citations (crc). In CLBib@ISSI, Cited by: §2.1.
- Joint aspect extraction and sentiment analysis with directional graph convolutional networks. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 272–279. External Links: Cited by: §2.2.
- Slowed canonical progress in large fields of science. Proceedings of the National Academy of Sciences of the United States of America 118. Cited by: §1.
- Structural scaffolds for citation intent classification in scientific publications. ArXiv abs/1904.01608. Cited by: §2.1.
- Inconsistency in conference peer review: revisiting the 2014 neurips experiment. ArXiv abs/2109.09774. Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. ArXiv abs/1810.04805. Cited by: §2.2.
- Theory and practise of the g-index. Scientometrics 69, pp. 131–152. Cited by: §2.4, §4.1.
Fine grained citation span for references in Wikipedia.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1990–1999. External Links: Cited by: §2.2.
- Target-dependent sentiment classification with bert. IEEE Access 7, pp. 154290–154299. Cited by: §2.2.
- The sjr indicator: a new indicator of journals’ scientific prestige. External Links: Cited by: §2.4.
- Deep residual learning for image recognition. External Links: Cited by: §4.1.
- An index to quantify an individual’s scientific research output. Proc. Natl. Acad. Sci. USA 102, pp. 16569–16572. Cited by: §2.4, §4.1.
- Aspect-based sentiment analysis using bert. In NODALIDA, Cited by: §2.2.
- Open-domain targeted sentiment analysis via span-based extraction and classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 537–546. External Links: Cited by: §2.2.
- Citation classification for behavioral analysis of a scientific field. ArXiv abs/1609.00435. Cited by: §2.1.
- Automatic extraction of citation contexts for research paper summarization: a coreference-chain based approach. In Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL), pp. 88–95. Cited by: §2.2.
- Citation block determination using textual coherence. Journal of Information Processing 24 (3), pp. 540–553. External Links: Cited by: §2.2.
- Sentiment analysis of citations using word2vec. ArXiv abs/1704.00177. Cited by: §2.1.
- GRACE: gradient harmonized and cascaded labeling for aspect-based sentiment analysis. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 54–64. External Links: Cited by: §2.2.
- Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset. Scientometrics 116 (2), pp. 1303–1330. External Links: Cited by: §2.2.
- Citation indexing—its theory and application in science, technology and humanities. wiley, oxford (1979), 274, $15.95. Information Processing and Management 16. Cited by: §2.4.
- Measuring contextual citation impact of scientific journals. Journal of Informetrics 4 (3), pp. 265–277. External Links: Cited by: §2.4.
- The pagerank citation ranking: bringing order to the web.. Technical Report Technical Report 1999-66, Stanford InfoLab, Stanford InfoLab. Note: Previous number = SIDL-WP-1999-0120 External Links: Cited by: §2.1, §2.4, §4.1.
- Identifying non-explicit citing sentences for citation-based summarization.. In Proceedings of the 48th annual meeting of the association for computational linguistics, pp. 555–564. Cited by: §2.2.
- A hierarchical model of reviews for aspect-based sentiment analysis. In EMNLP, Cited by: §2.2.
- Why the impact factor of journals should not be used for evaluating research. BMJ 314, pp. 497. Cited by: §1.
- Utilizing bert for aspect-based sentiment analysis via constructing auxiliary sentence. In NAACL, Cited by: §2.2.
- Automatic classification of citation function. In EMNLP, Cited by: §2.1.
- A hybrid approach for aspect-based sentiment analysis using deep contextual word embeddings and hierarchical attention. In ICWE, Cited by: §2.2.
- Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data, Cited by: §1, §2.1, §2.4, §4.1.
- A hybrid approach for aspect-based sentiment analysis using a lexicalized domain ontology and attentional neural models. In ESWC, Cited by: §2.2.
- Latent aspect rating analysis on review text data: a rating regression approach. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. Cited by: §2.2.
- BERT post-training for review reading comprehension and aspect-based sentiment analysis. In NAACL, Cited by: §2.2.
- Cited text span identification for scientific summarisation using pre-trained encoders. Scientometrics (English). External Links: Cited by: §2.2.