Venue Analytics: A Simple Alternative to Citation-Based Metrics

04/29/2019 ∙ by Leonid Keselman, et al. ∙ Carnegie Mellon University 0

We present a method for automatically organizing and evaluating the quality of different publishing venues in Computer Science. Since this method only requires paper publication data as its input, we can demonstrate our method on a large portion of the DBLP dataset, spanning 50 years, with millions of authors and thousands of publishing venues. By formulating venue authorship as a regression problem and targeting metrics of interest, we obtain venue scores for every conference and journal in our dataset. The obtained scores can also provide a per-year model of conference quality, showing how fields develop and change over time. Additionally, these venue scores can be used to evaluate individual academic authors and academic institutions. We show that using venue scores to evaluate both authors and institutions produces quantitative measures that are comparable to approaches using citations or peer assessment. In contrast to many other existing evaluation metrics, our use of large-scale, openly available data enables this approach to be repeatable and transparent.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

There exist many tools to evaluate professional academic scholarship. For example Elseiver’s Scorpus provides many author-level and journal-level metrics to measure the impact of scholars and their work (Colledge et al., 2010; da Silva and Memon, 2017). Other publishers, such as the Public Library of Science, provide article-level metrics for their published work (Fenner, 2013). Large technology companies, such as Google and Microsoft provide their own publicly available metrics for scholarship (Butler, 2011). Even independent research institutes, such as the Allen Institute’s Semantic Scholar (Ammar et al., 2018), manage their own corpus and metrics for scholarly productivity. However, these author-based metrics (often derived from citation measurements) can be inconsistent, even across these large, established providers (da Silva and Dobránszki, 2018).

In this work, we propose a method for evaluating a comprehensive collection of published academic work by using an external evaluation metric. By taking a large collection of papers and using only information about their publication venue, who wrote them, and when, we provide a largely automated way of discovering not only venue’s value. Further, we also develop a system for automatic organization of venues. This is motivated by the desire for an open, reproducible, and objective metric that is not subject to some of the challenges inherent to citation-based methods (da Silva and Dobránszki, 2018; Bornmann and Daniel, 2008; Galiani and Gálvez, 2017).

We accomplish this by setting up a linear regression from a publication record to some metric of interest. We demonstrate three valid regression targets: status as a faculty member (a classification task), awarded grant amounts, and salaries. By using DBLP 

(Ley, 2002) as our source of publication data, NSF grants as our source of awards, University of California data for salaries, and CSRankings (Berger, 2018) for faculty affiliation status, we’re able to formulate these as large data regression tasks, with design matrix dimensions on the order of a million in each dimension. However, since these matrices are sparse, regression weights can be obtained efficiently on a single laptop computer. Details of our method are explained in section 4.

We call our results venue scores and validate their performance in the tasks of evaluating conferences, evaluating professors, and ranking universities. We show that our venue scores correlate highly with other influence metrics, such as h-index (Hirsch, 2005), citations or highly-influential citations (Valenzuela et al., 2015). Additionally, we show that university rankings, derived from publication records correlate highly with both established rankings (News and Report, 2018; Education, 2018; Rankings, 2018) and with recently published quantitative metrics (Berger, 2018; Blackburn et al., 2018; Vucetic et al., 2018; Clauset et al., 2015). To help others build upon this work, all of our code and data is available at https://github.com/leonidk/venue_scores.

2. Related Work

Quantitative measures of academic productivity tend to focus on methods derived from citation counts. By using citation count as the primary method of scoring a paper, one can decouple an individual article from the authors who wrote it and the venue it was published in. Then, robust citation count statistics, such as h-index (Hirsch, 2005), can be used as a method of scoring either individual authors, or a specific venue. Specific critiques of h-index scores arose almost as soon as the h-index was published, ranging from a claimed lack-of-utility (Lehmann et al., 2006), to a loss of discriminatory power (Tol, 2008).

Citations can also be automatically analyzed for whether or not they’re highly influential to the citing paper, producing a ”influential citations” metric used by Semantic Scholar (Valenzuela et al., 2015). Even further, techniques from graph and network analysis can be used to understand systematic relationships in the citation graph (Bergstrom, 2007). Citation-based metrics can even be used to provide a ranking of different universities (Blackburn et al., 2018).

Citations-based metrics, despite their wide deployment in the scientometrics community, have several problems. For one, citation behavior varies widely by fields (Bornmann and Daniel, 2008). Additionally, citations often exhibits non-trivial temporal behavior (Galiani and Gálvez, 2017), which also varies greatly by sub-field. These issues highly affect one’s ability to compare across disciplines and produce different scores at different times. Recent work suggests that citation-based metrics struggle to effectively capture a venue’s quality with a single number (Walters, 2017). Comparing citation counts with statistical significance requires an order-of-magnitude difference in the citation counts (Kurtz and Henneken, 2017), which limits their utility in making fine-grained choices. Despite these quality issues, recent work (Vucetic et al., 2018) has demonstrated that citation-based metrics can be used to build a university ranking that correlates highly with peer assessment; we show that our method provides a similar quality of correlation.

Our use of straightforward publication data (Section 3) enables a much simpler model. This simplicity is key, as the challenges in maintaining good citation data have resulted in the major sources of h-index scores being inconsistent with one another (da Silva and Dobránszki, 2018).

While there exist many forms of ranking journals such as Eigenfactor (Bergstrom, 2007) or SJR (Falagas et al., 2008), these tend to focus on journal-level metrics while our work focuses on all venues, including conferences.

2.1. Venue Metrics

We are not the first to propose that scholars and institutions can be ranked by assigning scores to published papers(Ren and Taylor, 2007). However, in prior work the list of venues is often manually curated and assigned equal credit, a trend that is true for studies in 1990s (Geist et al., 1996) and their modern online versions (Berger, 2018). Instead, we propose a method for obtaining automatic scores for each venue, and in doing so, requires no manual curation of valid venues.

Previous work (Yan and Lee, 2007) has developed methods for ranking venues automatically, generating unique scores based only on author data by labeling and propagating notions of ”good” papers and authoritative authors. However, this work required a manually curated seed of what good work is. It was only demonstrated to work on a small sub-field of conferences, as new publication cliques would require new labeling of ”good” papers. Recent developments in network-based techniques for ranking venues have included citation information (Zhang and Wu, 2018), and are able to produce temporal models of quality. In contrast, our proposed model doesn’t require citation data to produce sensible venue scores.

Our work, in some ways, is most similar to that of CSRankings (Berger, 2018)

. CSRankings maintains a highly curated set of top-tier venues; venues selected for inclusion are given 1 point per paper, while excluded venues are given 0 points. Additionally, university rankings produced by CSRankings include a manually curated set of categories, and rankings are produced via a geometric mean over these categories. In comparison, in this work, we produce unique scores for every venue and simply sum together scores for evaluating authors and institutions.

In one of the formulations of our method, we use authors status as faculty (or not faculty) to generate our venue scores. We are not alone in this line of analysis, as recent work has demonstrated that faculty hiring information can be used to generate university prestige rankings (Clauset et al., 2015).

Many existing approaches either focus only on journals (Falagas et al., 2008), or do not have their rankings available online. For our dataset, we do not have citation-level data available, so we are unable to compare against certain existing methods on our dataset. However, as these methods often deploy a variant of PageRank (Page et al., 1999), we describe a PageRank baseline in Section 6.1 and report its results.

3. Data

Our primary data is the dblp computer science bibliography (Ley, 2002). DBLP contains millions of articles, with millions of authors across thousands of venues in Computer Science. We produced the results in this paper by using the dblp-2019-01-01 snapshot. We restricted ourselves to only consider conference and journal publications, skipping books preprints, articles below 6 pages and over 100 pages. We also merged dblp entries corresponding to the same conference (e.g. dblp contained ”ECCV (1)” through ”ECCV (16)”). This lead to a dataset featuring 2,965,464 papers, written by 1,766,675 authors, across 11,255 uniquely named venues and 50 years of publications (1970 through 2019).

Our first metric of interest is an individual’s status as a faculty member at a university. For this, we used the faculty affiliation data from CSRankings (Berger, 2018), which are manually curated and contain hundreds of universities across the world and about 15,000 professors. For evaluation against other university rankings, we used the ScholarRank (Vucetic et al., 2018) data to obtain faculty affiliation, which contains a more complete survey of American universities (including more than 50 not currently included in CS Rankings). While CSRankings data is curated to have correct DBLP names for faculty, the ScholarRank data does not. To obtain a valid affiliation, the names were automatically aligned with fuzzy string matching, resulting in about 4,000 faculty with good seemingly unique DBLP names and correct university affiliation. Although those two methods are manually curated, automatic surveys of faculty affiliations have recently been demonstrated (Morgan et al., 2018b).

Our second metric of interest was National Science Foundation award data, where we used data from 1970 until 2018. This data is available directly from the NSF (Foundation, 2018)

. We adjusted all award amounts by using annual CPI inflation data. To perform our analysis, we restricted ourselves to awards that had finite amount, where we could match at least half the Principal Investigators on the grant to DBLP names and the grant was above $20,000. Award amounts over 10 million dollars were clipped in a smooth way to avoid matching to a few extreme outliers. This resulted in 407,012 NSF grants used in building our model.

Our third and final metric of interest was University of California salary data (Institute, 2018). This was inspired by a paper that tried to predict ACM/IEEE Fellowships for 87 professors and used salary data from the public universities in Florida (Nocka et al., 2014). We looked at professors across the entire University of California system, matching their names to DBLP entries in an automated way. We used the maximum salary amount for a given individual across the 2015, 2016 and 2017 datasets, skipping individuals making less than or over dollars. This resulted in 2,436 individuals, down from 3,102 names that we matched and an initial set of about 20,000 initial professors. As DBLP contains some chemistry, biology, and economics venues, we expect that some of these are likely not Computer Science professors.

4. Method

Our basic statistical model is that a paper in a given publication venue (either a conference or a journal), having passed peer review, is worth a certain amount of value. Some publication venues are more prestigious, impactful or selective, and thus should have higher scores. Other venues have less strict standards, or perhaps simply provide less opportunity to disseminate their ideas through a community (Morgan et al., 2018a) and should be worth less. While this model explicitly ignores the differences in paper quality in a given publication venue, discarding this information allows us to use a large quantity of data to develop a statistically robust scoring system.

This methodology is not valid for all fields of science, nor is it valid for all models of how highly impactful ideas are developed and disseminated. In general, our method requires individual authors have multiple publications across many different venues, which is more true in Computer Science than some of the natural sciences or humanities, where publishing rates are lower (Lubienski et al., 2018). If we assume that all research ideas produce only a single paper, and that passing peer-review is a noisy measurement of quality (Shah et al., 2018), then our proposed method would not work very well. Instead, the underlying process which would make our methodology valid would be that good research ideas produce multiple research publications in selective venues; better ideas would produce more individual publications in higher quality venues. The concept of All models are wrong, but some are useful is our guiding principle here. This assumption allows us to obtain venue scores in an automatic way, and then use these scores to evaluate both authors and institutions.

4.1. Formal Setup

In general, we will obtain a score for each venue by setting up a linear regression task in the form of equation 1.

(1)

Authors are listed along the rows, and conferences (or any venue) are listed along the columns. The number of publications that an author has in a conference is noted in the design matrix. There is an additional column of 1s in order to learn a bias offset in the regression.

Different forms of counting author credit are discussed in section 4.6, while different regression targets are discussed in section 3. In equation 1

, the regression target is shown as a binary variable indicating whether or not that author is currently a professor. If this linear system,

is solved, then the vector

will contain real-valued scores for every single publishing venue. Since our system is over-determined, there is generally no exact solution.

Instead of solving this sparse linear system directly, we instead solve a regularized regression, using a robust loss and

regularization. That is, we iteratively minimize the following expression via stochastic gradient descent 

(Robbins and Monro, 1951)

(2)

The

regularization enforces a Gaussian prior on the learned conference scores. We can perform this minimization in Python using common machine learning software packages 

(Pedregosa et al., 2011)

. We tend to use a robust loss function, such as the

Huber loss in the case of regression (Huber, 1964), which is quadratic for errors of less than and linear for errors of larger than . It can be written as

(3)

In the case of classification, we have labels and use the modified Huber loss (Zhang, 2004),

(4)

We experimented with other loss functions, such as the logistic loss, and while they tended to produce similar rankings and results, we found that the modified Huber loss provided better empirical performance in our test metrics, even though the resulting curves looked very similar upon qualitative inspection.

Faculty NSF Salary
Faculty 1.00 0.90 0.74
NSF 0.90 1.00 0.79
Salary 0.74 0.79 1.00
Table 1. Spearman’s correlation between rankings produced by targeting different metrics of interest.

4.2. Metrics of Interest

As detailed in section 3, we targeted three metrics of interest: status as a faculty member, NSF award sizes, and professor salaries. Each of these metrics came from a completely independent data source, and we found that they each had their own biases and strengths (more in section 5).

For the faculty status, we used the modified huber loss classification loss metric and CSRankings (Berger, 2018) faculty affiliations. To build venue scores that reward top-tier conferences more highly, we only gave professors in the top- ranked universities positive labels. We tried , and found we got qualitatively different results that performed roughly the same on our test metrics. Unless otherwise stated, we used . The university ranking used to select top- was CSRankings itself, and included international universities, covering the Americas, Europe and Asia. This classification is performed across all authors, leading to 1.7 million rows in our design matrix.

For the NSF awards, every Principal Investigator on the award had their papers up to the award year as the input features. We used a Huber loss,

, and experimented with normalizing our award data to have zero mean and unit variance. Additionally, we built models for both raw award sizes and log of award sizes; the raw award sizes seem to follow a power-law while the log award sizes seem distributed as approximately a Gaussian. Another model choice is whether to regress each NSF grant as an independent measurement or instead a marginal measurement which tracks the cumulative total of NSF grants received by the authors. If not all authors on the grant were matched to DBLP names, we only used the fraction of the award corresponding to the fraction of identified authors. This regression had

million rows in its design matrix .

For the salary data, we found that normalizing the salary data to have zero mean and unit variance led to a very poor regression result, while having no normalization produced a good result. This regression only had datapoints, and thus provided information about fewer venues than the other metrics of interest.

4.3. Modeling Change Over Time

In modeling conference values, we wanted to build a model that could adjust for different values in different years. For example, a venue may be considered excellent in the 1980s, but may have declined in influence and prestige since then. To account for this behavior, we break our dataset into chunks of years and create a different regression variable for each conference for each chunk. The non-temporal model is obtained simply by setting .

Our more sophisticated model creates an independent variable for each year for each conference. After setting in the block model, we splat each publication as a truncated Gaussian at a given year. By modifying , we are able to control the smoothness of our obtained weights. We apply the truncated Gaussian via a sparse matrix multiply of the design matrix with the appropriate band diagonal sparse matrix . The use of a truncated Gaussian (where is clipped and the Gaussian is re-normalized) enables our matrix to maintain a sparse structure. Unless otherwise stated, we used a , which produced an effective window size of about 10 years. This can be seen visually in Figure 2.

Our different temporal models are compared in Table 5 by correlating against existing author-level, journal-level and university-level metrics. For evaluation details see Section 6.

4.4. Normalizing Differences Across Years

The temporal models described in the previous section have an inherent bias. Due to the temporal window of the DBLP publication history, there is variation in distributed value due to changes annual NSF funding, the survivorship of current academic faculty, etc. To adjust for this bias, we scale each conference-year value by the standard deviation of conference values in that year. This scaling can help or hurt performance, depending on which metric of interest the model is built against. It generally produces flatter value scores over time but leads to some artifacts. For example, it assigns extremely high value to theory/algorithms conferences in the 1970s (such as STOC/FOCS). Despite these issues, unless otherwise noted, all of our experiments performed this scaling. The effects of this normalization are shown in Figure 

3 and Table 2.

Figure 2. Truncated Gaussian () used to splat a publication’s value across multiple years. Examples centered at the year 2000 and the year 2018

Truncated Gaussian

Figure 3. Results showing the effect of performing a normalization for venue year and size. See Sections 4.4 and 4.5.

4.5. Normalizing Differences In Venue Size

Our model uses regularization on the venue scores, which tends to squash the value of variables with less explanatory power. We found that this process often resulted in the under-valuing of smaller, more selective venues. To correct for this size bias, instead of giving each paper point of value in the design matrix, we give each paper credit, where is the number of papers at that venue in that year; this step is performed before Gaussian splatting.

Normalization influential citations (author) h-index (author) h-index (university) h-index (venue)
None 0.72 0.61 0.60 0.26
Year 0.72 0.66 0.58 0.18
Size 0.74 0.63 0.61 0.25
Year + Size 0.72 0.61 0.60 0.26
Table 2. Spearman correlation between our model and existing metrics, showing the effect of different normalization schemes. See Sections 4.4 and 4.5 for model details. See Section 6 for evaluation details.

4.6. Modeling Author Position

Another question to consider is how credit for a paper is divided up amongst the authors. We consider four models of authorship credit assignment:

  1. Authors get credit for each paper, where is the number of authors on the paper. Used by  (Berger, 2018).

  2. All authors get full credit (1 point) for each paper

  3. Authors receive less credit for later positions (, normalized so total credit sums to 1). This model awards earlier authors more value and has been documented in the literature (Sekercioglu, 2008) and is used by (Huang, 2018).

  4. The same as (3), except the last author is explicitly assigned equal credit with the first author before normalization.

Using Spearman correlation with Semantic Scholar’s ”highly influential citations”, an evaluation metric described in depth in Sec. 6.3, we can evaluate each of these models. Specifically, there are two places where venues scores require a selection of authorship model. The first is how much credit is assigned to each paper when performing regression (in the case of our faculty metric of interest). The second is when evaluating authors with the obtained regression vector. See table 3 for a summary of experimental results.

Evaluation Author Model
1 2 3 4

Regression

Author Model

1 0.70 0.72 0.65 0.70
2 0.68 0.71 0.61 0.67
3 0.71 0.73 0.66 0.71
4 0.70 0.72 0.65 0.71
Table 3. Correlation (Spearman’s ) between our model and Semantic Scholar (Valenzuela et al., 2015), showing the properties of different authorship models. For details see section 4.6.

For the purposes of evaluation, assigning full credit to authors (model 2) produced the best results, while model 3 consistently produced the lowest quality correlations. On the other hand, for the purposes of performing the classification task, the roles are flipped. Assigning full credit (model 2) consistently produces the worst quality correlations while using model 3 produces the highest quality correlations.

4.7. Combining Models

Since the proposed metrics of interest (faculty status, NSF awards, salaries) were generated from different independent regression targets, with different sized design matrices, there may be value in combining them to produce a joint model. The value of ensemble models is well documented in both theory (Freund and Schapire, 1997) and practice (Bell et al., 2009). In the absence of a preferred metric with which to cross-validate our model, we simply perform an unweighted average of our models to obtain a gold model. To ensure that the weights are of similar scale, the conference scores are normalized to have zero mean and unit variance before combining them. Venue scores that are too large or too small are clipped at 12 standard deviations. For the temporal models, this normalization is performed on a per-year basis. While table 4

shows results for a simple combination, one could average together many models with different choices of hyperparameters, regression functions, datasets, filters to scrub the data, etc.

Model citations h-index (Hirsch, 2005) influential citations (Valenzuela et al., 2015)
Faculty 0.59 0.68 0.71
NSF 0.63 0.66 0.67
Salary 0.36 0.36 0.41
Combined 0.69 0.77 0.75
Table 4. Correlation between our model and traditional measures of scholarly output on the dataset of CMU faculty. For model details see Section 4.7. For evaluation details see Section 6.3.

5. Results

A visual example of some of venue scores is shown in Figure 1. We kept the y-axis fixed across all the different Computer Science sub-disciplines to show how venue scores can be used to compare different fields in a unified metric. There are additional results in our qualitative demonstration of normalization methods, Figure 3.

Due to the variation in rankings produced by one’s choice of hyperparameters, and the large set of venues being evaluated, we do not have a canonical set of rankings that can be presented succinctly here. Instead, we will focus on quantitative evaluations of our results in the following section.

Years Metric AI AH USN VH VC
Faculty 0.73 0.69 0.74 0.63 0.42
10 0.67 0.57 0.76 0.57 0.35
50 0.75 0.68 0.76 0.38 0.21
NSF 0.64 0.62 0.62 0.61 0.59
10 0.68 0.68 0.60 0.59 0.60
50 0.67 0.65 0.63 0.64 0.67
Salary 0.62 0.58 0.59 0.48 0.55
10 0.65 0.62 0.57 0.45 0.55
50 0.66 0.63 0.56 0.43 0.63
Table 5. Correlation between rankings produced by our model against rankings produced by traditional scholarly metrics. Different rows correspond to different hyperparameters choices for our model. Each column is a corresponds to a traditional metric. AI = Author Highly Influential Citations, AH = Author H-index, USN = US News 2018, VH = Venue H-index, VC = Venue Citations.

6. Evaluation

To validate the venue scores obtained by our regression methods, our evaluation consists of correlating our results against existing rankings and metrics. We consider three classes of existing scholarly measurements to correlate against: those evaluating universities, authors, and venues. Each of these classes has different standard techniques, and a different evaluation dataset, so they will be described separately in Sections 6.26.3, and 6.4.

In the case of our proposed method, venue scores, we have a simple way to turn them from a journal-based to an author-based or institution-based metric. Venues are evaluated directly with the scores. Authors are evaluated as the dot product of venue scores and the publication vector of an author. Universities are evaluated as the dot product of venue scores and the total publication vector of all faculty affiliated with that university.

6.1. PageRank Baseline

Many existing approaches build on the idea of eigenvalue centrality 

(Bergstrom, 2007; Yan and Lee, 2007; Zhang and Wu, 2018). We implemented PageRank (Page et al., 1999) using the power iteration method to compute a centrality measure to use for both author-level and venue-level metrics. Unlike most versions of PageRank, which use citation counts, we implement two variants based solely on co-authorship information.

Author-level PageRank (PageRankA) is computed on the 1.7M x 1.7M sized co-authorship graph, where an edge is added for every time two authors co-author a paper. We found that the authors with highest centrality measures are often common names with insufficient disambiguation information in DBLP.

Journal-level PageRank (PageRankC) is computed on the 11,000 x 11,000 co-authorship graph, where an edge is added for every author who publishes in both venues. When ran on the unfiltered DBLP data, the highest scoring venue was arXiv, an expected result.

6.2. University Ranks

For this work, we produce university rankings simply as an evaluation method to demonstrate the quality and utility of our venue scoring system. The reader is cautioned that university ranking systems can tend to produce undesirable gaming behavior (Johnes, 2018), and are prone to manipulation.

We obtained and aligned many existing university rankings for Computer Science departments. These include rankings curated by journalistic sources, such as the US News Rankings (News and Report, 2018), the QS Rankings (Rankings, 2018), Shanghai Ranking (Rankings, 2015), Times Higher Education Rankings (Education, 2018) and the National Research Council report (Clauset et al., 2015). In addition, we consider purely quantitative evaluation systems such as ScholarRank (Vucetic et al., 2018), CSRankings (Berger, 2018), CSMetrics (Blackburn et al., 2018), and Prestige Rankings (Clauset et al., 2015). We additionally include ScholarRank’s t10sum across the matched faculty that our venue scores result uses.

We follow a recent paper (Vucetic et al., 2018), which demonstrated the efficacy of a citation-based metric in producing rankings with large correlation against US News rankings. We extend these experiments to include more baselines. In contrast with  (Vucetic et al., 2018), we use a rank correlation metric (namely Kendall’s ), which naturally handles ordinal ranking systems. While ScholarRank (Vucetic et al., 2018) claimed a correlation of with US News, this was under Pearson’s correlation coefficient, and the result under Kendall’s is 0.768 in the published version and 0.757 using full precision ScholarRank.

Our faculty-based regression is able to generate a result with the highest correlation against the US News rankings. We perform even better than ScholarRank, which was designed to optimize this metric (although under a non-rank correlation metric).

Ranking Correlation with US News 2018
USN2018 (News and Report, 2018)   1.000
USN2010 (Clauset et al., 2015)   0.928
Venue Scores   0.780
ScholarRank (Vucetic et al., 2018)   0.768
ScholarRankFull   0.757
CSMetrics (Blackburn et al., 2018)   0.746
CSRankings (Berger, 2018)   0.724
Times (Education, 2018)   0.721
NRC95 (Clauset et al., 2015)   0.713
t10Sum (Vucetic et al., 2018)   0.713
Prestige (Clauset et al., 2015)   0.666
Citations (Vucetic et al., 2018)   0.665
Shanghai (Rankings, 2015)   0.586
# of papers   0.585
BestPaper (Huang, 2018)   0.559
PageRankA   0.535
PageRankC   0.532
QS (Rankings, 2018)   0.518
Table 6. Section 6.2. Kendall’s correlation across different University Rankings.

6.3. Author-level Metrics

To evaluate our venue scores in the application of generating author-level metrics, we will use rank correlation (also known as Spearman’s (Spearman, 1904) between our venue scores and traditional author-level metrics such as h-index. Google Scholar was used to obtain citations, h-index (Hirsch, 2005), i10-index, and Semantic Scholar used to obtain highly influential citations (Valenzuela et al., 2015). Prior work has critiqued the h-index measure (Yong, 2014) and proposed an alternative metric, derived from a citation count. However, our use of a rank correlation  means that monotonically transformed approximations of citation counts would lead to identical scores.

For evaluation, we collected a dataset for the largest Computer Science department in CSRankings (). The results are shown in Table 7. We can see that venue scores highly correlate with h-index, influential citations, i10-scores and CSRankings scores. The results from the author-based PageRank are surprisingly similar to our venue scores. However, the conference-based PageRank performed worse than venue scores on every correlation metric.

papers citations h-index i10 CSR (Berger, 2018) venue score PageRankA PageRankC influence (Valenzuela et al., 2015)
papers 1.00 0.66 0.79 0.81 0.71 0.94 0.94 0.89 0.76
citations 0.66 1.00 0.93 0.88 0.49 0.66 0.68 0.60 0.81
h-index 0.79 0.93 1.00 0.97 0.56 0.75 0.81 0.68 0.80
i10 0.81 0.88 0.97 1.00 0.53 0.75 0.82 0.69 0.73
CSR (Berger, 2018) 0.71 0.49 0.56 0.53 1.00 0.84 0.64 0.80 0.64
venue score 0.94 0.66 0.75 0.75 0.84 1.00 0.86 0.92 0.78
PageRankA 0.94 0.68 0.81 0.82 0.64 0.86 1.00 0.83 0.72
PageRankC 0.89 0.60 0.68 0.69 0.80 0.92 0.83 1.00 0.67
influence (Valenzuela et al., 2015) 0.76 0.81 0.80 0.73 0.64 0.78 0.72 0.67 1.00
Table 7. Correlation between different author-level metrics for a dataset of professors (N=148). Details are in Section 6.3.
papers citations h-index PageRank C venue scores
papers 1.00 0.81 0.52 0.95 0.57
citations 0.81 1.00 0.66 0.77 0.61
h-index 0.52 0.66 1.00 0.52 0.64
PageRankC 0.95 0.77 0.52 1.00 0.57
venue scores 0.57 0.61 0.64 0.57 1.00
Table 8. Spearman’s correlation between different journal-level metrics (N=1,308). For details see Section 6.4.

6.4. Journal-level metrics

To evaluate the fidelity of our venue scores for journals and conferences, we obtain the h-index (Hirsch, 2005) and citation count for 1,308 conferences from Microsoft Academic Graph (Schauerte, 2014). We continue to use Spearman’s as our correlation metric, even though rank-correlation metrics can be highly impacted by noisy data (Abdullah, 1990).

Under this metric, venue scores correlated highly with h-index. Notably, venue scores for conferences correlated with each conference’s h-index about as well its h-index correlated with its number of citations. See Figure 8 for detailed results.

7. Discussion

Our results show a medium to strong correlation of venue scores against existing scholarly metrics, such as citation count and h-index. For author metrics, venue scores correlate with influential citations (Valenzuela et al., 2015) or h-index about as well as such measures correlate against each other or raw citation counts (see Table 7). For venue metrics, venue scores correlate with h-index (0.64) and citations (0.61) nearly as well as citations correlate with h-index (0.66). For university metrics, venue scores correlate as well with measures of peer assessment as citation-based metrics do (Vucetic et al., 2018).

As h-index and citation counts have their flaws, obtaining perfect correlation is not necessarily a desirable goal. Instead, these strong correlations serve as evidence for the viability of venue scores.

Venue scores have been shown to be robust against hyperparameter choices (Tables 2345). Even venue scores produced from completely different data sources tend to look very similar (Table 1). Additionally, venue scores can naturally capture the variation of conference quality over time (Figures 13).

As with any inductive method, venue scores are data-driven and will be subject to past biases. For example, venue scores can clearly be biased by hiring practices, pay inequality and NSF funding priorities. As these are the supervising metrics, bias in those datasets will be encoded in our results. For example, we found that the faculty hiring metric prioritized Theoretical Computer Science, while using NSF awards prioritized Robotics. The faculty classification task may devalue publishing areas where candidates pursue industry jobs, while the NSF grant regression task may devalue areas with smaller capital requirements. By using large datasets and combining multiple metrics in a single model (Section 4.7), the final model could reduce the biases in any individual dataset.

Each of our metrics of interest has an inherent bias in timescale, which our temporal normalization tries to correct for, but likely does an incomplete job of. Salaries are often higher for senior faculty. NSF Awards can have a long response time and a preferences towards established researchers. Faculty classification prioritizes the the productive years of existing faculty. Additionally, faculty hiring as a metric will have a bias towards work from prestigious universities (Clauset et al., 2015) and their venue preferences. Some of these issues also exist in citation metrics, and may be why our uncorrected models correlated better with them (Table 2).

Figure 4. Automatic clustering for venues in Computer Science, the largest venues in each cluster are labeled.

Automatic clustering for venues in Computer Science

8. Similarity Metrics

While the previous sections of this paper have focused on evaluation, the same dataset can be used to organize venues into groups. For organization, we use a much smaller dataset, using data since 2005 and only evaluating the 1,155 venues that have at least 20 R1 universities with faculty publishing in them. We then build the venue author matrix, counting the number of papers each that author published in each venue. Performing a -dimensional Latent Dirichlet Allocation (Blei et al., 2003), we obtain a 50-dimensional vector representing each conference in a meaningful way.

These vectors can then be clustered (Arthur and Vassilvitskii, 2007) to produce automatic categories for each conference. These high dimensional vectors can also be embedded (van der Maaten and Hinton, 2008) into two dimensions to produce a visual map of Computer Science. See Figure  4

for our result. These clusters represent natural categories in Computer Science. For example, it is easy to see groups that could be called Theory, Artificial Intelligence, Machine Learning, Graphics, Vision, Parallel Computing, Software Engineering, Human-Computer Interaction, among many others.

While some clusters are distinct and repeatable, others are not. When datasets contain challenging cases, ideal clustering can be hard to estimate 

(Ben-David, 2015). Using silhouette scores (Rousseeuw, 1987), we can estimate how many natural clusters of publishing exist in Computer Science. In our experiments, silhouette scores were maximized with 40 to 45 clusters. As the clustering process is stochastic, we were unable to determine the optimal cluster number with statistical significance.

By embedding each author with the weighted average of their publication’s vectors, we can also obtain a fingerprint that shows which areas of Computer Science each university focuses on. See Figure  6 for an example of such fingerprints for many top departments. The same clustering method can be used to analyze the focus areas of a single department, for an example see Figure 7.

9. Conclusion

We have presented a method for ranking and organizing a scholarly field based only on simple publication information- namely a list of papers, each labeled with only their published venue, authors, and year. By regressing venue scores from metrics of interest, one obtains a plausible set of venue scores. These scores can be compared across sub-fields and aggregated into author-level and institution-level metrics. The scores provided by this system, and their resulting rankings, correlate highly with other established metrics. As this system is based on easily obtainable, publicly available data, it is transparent and reproducible. Our method builds on simple techniques and demonstrates that their application to large-scale data can produce surprisingly robust and useful tools for scientometric analysis.

Figure 5. The career arcs of several accomplished Computer Scientists. The first row uses a simple model where all papers are all given equal weight; first using raw counts and then normalizing by the number of papers published each year. The second row shows our model.
Figure 6. Heatmap showing the differences in research focus areas across top Computer Science universities. Fig 4 can be used as a guide.

Different university’s fingerprints

Figure 7.

An embedding of Carnegie Mellon University’s School of Computer Science with colors indicating sub-departments. For example, the Robotics Institute (RI) has clear clusters for Robotics, Graphics and Computer Vision.

Acknowledgements.
Martial Hebert suggested the use of correlation metrics as a technique for quantitative evaluation, thereby providing the framework for every table of results in this paper. Emery Berger developed CSRankings (Berger, 2018), which was highly influential in the design and implementation of this project. Joseph Sill (Sill, 2010), Wayne Winston, Jeff Sagarin and Dan Rosenbaum developed Adjusted Plus-Minus, a sports analytics technique that partially inspired this work. Kevinjeet Gill was unrelenting in advocating for a year-by-year regression model to avoid sampling and quantization artifacts.

References

  • (1)
  • Abdullah (1990) Mokhtar Bin Abdullah. 1990. On a Robust Correlation Coefficient. Journal of the Royal Statistical Society. Series D (The Statistician) 39, 4 (1990), 455–460. http://www.jstor.org/stable/2349088
  • Ammar et al. (2018) Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew Peters, Joanna Power, Sam Skjonsberg, Lucy Wang, Chris Willhelm, Zheng Yuan, Madeleine van Zuylen, and Oren Etzioni. 2018. Construction of the Literature Graph in Semantic Scholar. In NAACL HLT. Association for Computational Linguistics, 84–91. https://doi.org/10.18653/v1/N18-3011
  • Arthur and Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii. 2007. K-means++: The Advantages of Careful Seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’07). SIAM, Philadelphia, PA, USA, 1027–1035. http://dl.acm.org/citation.cfm?id=1283383.1283494
  • Bell et al. (2009) Robert M. Bell, Yehuda Koren, and Chris Volinsky. 2009. The BellKor solution to the Netflix Prize.
  • Ben-David (2015) Shai Ben-David. 2015. Clustering is Easy When ….What? arXiv e-prints, Article arXiv:1510.05336 (Oct 2015), arXiv:1510.05336 pages. arXiv:stat.ML/1510.05336
  • Berger (2018) Emery Berger. 2018. CSRankings: Computer Science Rankings. http://csrankings.org/.
  • Bergstrom (2007) Carl Bergstrom. 2007. Eigenfactor: Measuring the value and prestige of scholarly journals. College & Research Libraries News 68, 5 (2007), 314–316.
  • Blackburn et al. (2018) Steve Blackburn, Carla Brodley, H. V. Jagadish, Kathryn S McKinley, Mario Nascimento, Minjeong Shin, Sean Stockwel, Lexing Xie, and Qiongkai Xu. 2018. csmetrics.org: Institutional Publication Metrics for Computer Science. https://github.com/csmetrics/csmetrics.org/blob/master/docs/Overview.md.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
  • Bornmann and Daniel (2008) Lutz Bornmann and Hans-Dieter Daniel. 2008. What do citation counts measure? A review of studies on citing behavior. Journal of documentation 64, 1 (2008), 45–80.
  • Butler (2011) D. Butler. 2011. Computing giants launch free science metrics. Nature 476, 7358 (Aug 2011), 18.
  • Clauset et al. (2015) Aaron Clauset, Samuel Arbesman, and Daniel B Larremore. 2015. Systematic inequality and hierarchy in faculty hiring networks. Science Advances (2015).
  • Colledge et al. (2010) Lisa Colledge, Félix de Moya-Anegón, Vicente P Guerrero-Bote, Carmen López-Illescas, Henk F Moed, et al. 2010. SJR and SNIP: two new journal metrics in Elsevier’s Scopus. Insights 23, 3 (2010), 215.
  • da Silva and Dobránszki (2018) Jaime A Teixeira da Silva and Judit Dobránszki. 2018. Multiple versions of the h-index: Cautionary use for formal academic purposes. Scientometrics 115, 2 (2018), 1107–1113.
  • da Silva and Memon (2017) Jaime A Teixeira da Silva and Aamir Raoof Memon. 2017. CiteScore: A cite for sore eyes, or a valuable, transparent metric? Scientometrics 111, 1 (2017), 553–556.
  • Education (2018) Times Higher Education. 2018. World University Rankings, Computer Science. https://www.timeshighereducation.com/world-university-rankings/2018/subject-ranking/computer-science.
  • Falagas et al. (2008) Matthew E. Falagas, Vasilios D. Kouranos, Ricardo Arencibia-Jorge, and Drosos E. Karageorgopoulos. 2008. Comparison of SCImago journal rank indicator with journal impact factor. The FASEB Journal 22, 8 (2008), 2623–2628. https://doi.org/10.1096/fj.08-107938 arXiv:https://doi.org/10.1096/fj.08-107938 PMID: 18408168.
  • Fenner (2013) Martin Fenner. 2013. What can article-level metrics do for you? PLoS biology 11, 10 (2013), e1001687.
  • Foundation (2018) National Science Foundation. 2018. Download Awards by Year. https://www.nsf.gov/awardsearch/download.jsp.
  • Freund and Schapire (1997) Yoav Freund and Robert E Schapire. 1997. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. System Sci. 55, 1 (1997), 119 – 139. https://doi.org/10.1006/jcss.1997.1504
  • Galiani and Gálvez (2017) Sebastian Galiani and Ramiro H Gálvez. 2017. The life cycle of scholarly articles across fields of research. Technical Report. National Bureau of Economic Research.
  • Geist et al. (1996) Robert Geist, Madhu Chetuparambil, Stephen Hedetniemi, and A. Joe Turner. 1996. Computing Research Programs in the U.S. Commun. ACM 39, 12 (Dec. 1996), 96–99. https://doi.org/10.1145/240483.240505
  • Hirsch (2005) J E Hirsch. 2005. An index to quantify an individual’s scientific research output. Proc. Natl. Acad. Sci. U. S. A. 102, 46 (nov 2005), 16569–16572. https://doi.org/10.1073/pnas.0507655102
  • Huang (2018) Jeff Huang. 2018. Best Paper Awards in Computer Science (since 1996). https://jeffhuang.com/best_paper_awards.html.
  • Huber (1964) Peter J. Huber. 1964. Robust Estimation of a Location Parameter. Ann. Math. Statist. 35, 1 (03 1964), 73–101. https://doi.org/10.1214/aoms/1177703732
  • Institute (2018) Nevada Policy Research Institute. 2018. Transparent California. https://transparentcalifornia.com/agencies/salaries/#university-system.
  • Johnes (2018) Jill Johnes. 2018. University rankings: What do they really show? Scientometrics 115, 1 (01 Apr 2018), 585–606. https://doi.org/10.1007/s11192-018-2666-1
  • Kurtz and Henneken (2017) Michael J. Kurtz and Edwin A. Henneken. 2017. Measuring metrics - a 40-year longitudinal cross-validation of citations, downloads, and peer review in astrophysics. JAIST 68, 3 (2017), 695–708. https://doi.org/10.1002/asi.23689
  • Lehmann et al. (2006) Sune Lehmann, Andrew D Jackson, and Benny E Lautrup. 2006. Measures for measures. Nature 444, 7122 (2006), 1003.
  • Ley (2002) Michael Ley. 2002. The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives. In String Processing and Information Retrieval, Alberto H. F. Laender and Arlindo L. Oliveira (Eds.). Springer Berlin Heidelberg, 1–10.
  • Lubienski et al. (2018) Sarah Theule Lubienski, Emily K. Miller, and Evthokia Stephanie Saclarides. 2018. Sex Differences in Doctoral Student Publication Rates. Educational Researcher 47, 1 (2018), 76–81. https://doi.org/10.3102/0013189X17738746
  • Morgan et al. (2018a) Allison C. Morgan, Dimitrios J. Economou, Samuel F. Way, and Aaron Clauset. 2018a. Prestige drives epistemic inequality in the diffusion of scientific ideas.

    EPJ Data Science

    7, 1 (19 Oct 2018), 40.
    https://doi.org/10.1140/epjds/s13688-018-0166-4
  • Morgan et al. (2018b) Allison C. Morgan, Samuel F. Way, and Aaron Clauset. 2018b. Automatically assembling a full census of an academic field. PLOS ONE 13, 8 (08 2018), 1–18. https://doi.org/10.1371/journal.pone.0202223
  • News and Report (2018) US News and World Report. 2018. Best Computer Science Schools. https://www.usnews.com/best-graduate-schools/top-science-schools/computer-science-rankings.
  • Nocka et al. (2014) Andrew Nocka, Danning Zheng, Tianran Hu, and Jiebo Luo. 2014. Moneyball for Academia: Toward Measuring and Maximizing Faculty Performance and Impact. In 2014 IEEE International Conference on Data Mining Workshop. 193–197. https://doi.org/10.1109/ICDMW.2014.156
  • Page et al. (1999) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
  • Rankings (2018) QS World University Rankings. 2018. Computer Science & Information Systems. https://www.topuniversities.com/university-rankings/university-subject-rankings/2018/computer-science-information-systems.
  • Rankings (2015) Shanghai Rankings. 2015. Academic Ranking of World Universities in Computer Science. http://www.shanghairanking.com/SubjectCS2015.html.
  • Ren and Taylor (2007) Jie Ren and Richard N Taylor. 2007. Automatic and versatile publications ranking for research institutions and scholars. Commun. ACM 50, 6 (2007), 81–85.
  • Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics 22, 3 (1951), 400–407.
  • Rousseeuw (1987) Peter J. Rousseeuw. 1987.

    Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.

    J. Comput. Appl. Math. 20 (1987), 53 – 65. https://doi.org/10.1016/0377-0427(87)90125-7
  • Schauerte (2014) Boris Schauerte. 2014. Microsoft Academic: conference field ratings. http://www.conferenceranks.com/visualization/msar2014.html.
  • Sekercioglu (2008) Cagan H. Sekercioglu. 2008. Quantifying Coauthor Contributions. Science 322, 5900 (2008), 371–371. https://doi.org/10.1126/science.322.5900.371a
  • Shah et al. (2018) Nihar B. Shah, Behzad Tabibian, Krikamol Muandet, Isabelle Guyon, and Ulrike von Luxburg. 2018. Design and Analysis of the NIPS 2016 Review Process. JMLR 19, 49 (2018), 1–34. http://jmlr.org/papers/v19/17-511.html
  • Sill (2010) Joseph Sill. 2010. Improved NBA adjusted+/-using regularization and out-of-sample testing. In MIT Sloan Sports Analytics Conference.
  • Spearman (1904) Charles Spearman. 1904. The proof and measurement of association between two things. The American journal of psychology 15, 1 (1904), 72–101.
  • Tol (2008) Richard S. J. Tol. 2008. A rational, successive g-index applied to economics departments in Ireland. J. Informetrics 2 (2008), 149–155.
  • Valenzuela et al. (2015) Marco Valenzuela, Vu Ha, and Oren Etzioni. 2015. Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data.
  • van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.
  • Vucetic et al. (2018) Slobodan Vucetic, Ashis Kumar Chanda, Shanshan Zhang, Tian Bai, and Aniruddha Maiti. 2018. Peer Assessment of CS Doctoral Programs Shows Strong Correlation with Faculty Citations. Commun. ACM 61, 9 (Aug. 2018), 70–76. https://doi.org/10.1145/3181854
  • Walters (2017) W. H. Walters. 2017. Citation-Based Journal Rankings: Key Questions, Metrics, and Data Sources. IEEE Access 5 (2017), 22036–22053. https://doi.org/10.1109/ACCESS.2017.2761400
  • Yan and Lee (2007) Su Yan and Dongwon Lee. 2007. Toward Alternative Measures for Ranking Venues: A Case of Database Research Community. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’07). ACM, New York, NY, USA, 235–244. https://doi.org/10.1145/1255175.1255221
  • Yong (2014) Alexander Yong. 2014. Critique of Hirsch’s citation index: A combinatorial Fermi problem. Notices of the AMS 61, 9 (2014), 1040–1050.
  • Zhang and Wu (2018) Fang Zhang and Shengli Wu. 2018. Ranking Scientific Papers and Venues in Heterogeneous Academic Networks by Mutual Reinforcement. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (JCDL ’18). ACM, New York, NY, USA, 127–130. https://doi.org/10.1145/3197026.3197070
  • Zhang (2004) Tong Zhang. 2004. Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML ’04). ACM, New York, NY, USA, 116–. https://doi.org/10.1145/1015330.1015332

Appendix A Credit Assignment

In order to address issues of collinearity raised by having authors who publish papers together, we wanted to solve a credit assignment problem. We adapted a well-known method for addressing this by adapting regularized plus minus (Sill, 2010) from the sports analytics literature. In our case, we simply regress each publications values from its authors, as shown below.

We found that this technique produced scores which correlated highly with total value scores. Depending on the choice of regularization and loss function, this produced somewhat different rankings. This may be a valuable technique for understanding an individual’s contribution, but we were unable to design an empirical test that would demonstrate the fidelity of this approach.

Appendix B Aging Curve

To evaluate if our model makes a sensible prediction over the timescale of a scholar’s career, we built a model to see what an average academic career looks like, given that the author is still publishing in those years. See Figure 8. Our model suggests a rise in productivity for the first 20 years of one’s publishing history, and then a steady decline.

Figure 8. The average productivity of all DBLP authors for that year of their publishing career.