Measuring Scientific Broadness

05/12/2018 ∙ by Tom Price, et al. ∙ 0

Who has not read letters of recommendations that comment on a student's `broadness' and wondered what to make of it? We here propose a way to quantify scientific broadness by a semantic analysis of researchers' publications. We apply our methods to papers on the open-access server arXiv.org and report our findings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most attempts at quantifying scientific output focus on productivity and popularity, measured by the number of papers, the number of papers in journals with high impact factor, media mentions, citation counts, or combinations thereof (for a recent review see [1]). This focus is problematic not only because it creates perverse incentives [2], but also because other criteria fall by the wayside. To identify a suitable candidate for a job opening, or judge an applicants’ qualifications to lead a project to success, other factors besides productivity and popularity play a role. One of them is the researchers’ breadth of knowledge. In this present work, we want to propose a simple and efficient way of quantifying this breadth.

Our aim here is not to argue that any particular level of broadness is good or bad. Instead, our point of view is that different tasks call for different amounts of specialization, where here and in the following we will use the word ‘specialization’ to mean the opposite of ‘broadness.’ We also do not wish to suggest that the particular measure of broadness which we will propose in the following is the ‘right’ one. Instead, we merely want to demonstrate that it is a useful measure, and one that captures previously unexplored information.

2 Data

We based this present analysis on papers from the open-access server arXiv, available through the Open Archives Initiative Protocal for Metadata Harvesting (OAI-PMH) interface [3]. The data used for this analysis was downloaded through the interface in February 2018. It contains the metadata of 1,358,923 papers. We use the title, abstract, author’s name, date, and arXiv primary category. Before calculating the broadness values, we also remove all papers with more than 30 authors because we expect collaboration papers to be highly specialized by their nature and thus follow a different distribution. When we analyze the statistical properties of the distribution we further remove all authors with fewer than 20 papers because those researchers have too few publications to be meaningfully associated with a broadness value. The final sample contains 46,772 authors and 1,350,611 papers.

3 Analysis

We analyze the text of the papers in four steps, the details of which will be laid out in the following subsections. In brief, the procedure works like this:

  1. We extract terms from the papers’ titles and abstracts. We collect similar terms, such as “galaxy” and “galaxies”, into clusters which we refer to as “keywords”. We rank each keyword using a combination of how frequently it occurs and the distribution of arXiv primary categories of the papers it appears in. This ranking is based on the assumption that highly generic terms such as “paper” or “demonstrate”, which make poor keywords, will be more evenly distributed among different arXiv categories. We keep the 40,000 highest-ranking keywords.

  2. We create author identifications by matching similar names.

  3. We train a statistical model – latent dirichlet allocation [4] – for the multiset of keywords used by an author.

  4. Once trained, this model allows us to infer a distribution over latent topics for each author. The broadness of an author is then determined as the Shannon entropy of this distribution over topics.

3.1 Keyword Generation

We extract the keywords from the titles and the abstracts of papers in our sample. While we could be using pre-existing classification shemes, such as MSC [5], ACM [6], or PACS [7], this would greatly limit the flexibility of our method. The reader be warned that what we refer to as “keyword” here is not necessarily a single word, but may be a sequence of words. For example, “dwarf galaxy” or “effective field theory” would each count as one keyword.

For each paper, we first obtain a sequence of sequences of words by the following steps:

  1. We concatenate the title and abstract together, with the string “. ” (period and space) in between. We convert the resulting string to lowercase and remove all latex commands.

  2. We obtain a sequence of strings by dividing the string above into contiguous sections. These sections end whenever a period, question mark, open or closed round bracket, open or closed square bracket, semicolon, colon, or comma is encountered.

  3. We break each string in the above sequence of strings into a sequence of words, by dividing it into contiguous sections of characters which contain no whitespace.

We then produce a list which contains all sequences of at most ten words which can be found in the title or abstract of at least 20 papers. We then remove all entries that begin or end with a stopword, i.e. a word like “the” or “a”.

Next we convert each entry of the list into a reduced form. This we do by removing “’s” from the end of every word in a keyword (so “Einstein field equations” and “Einstein’s field equations’ are the same), removing all diacritics, removing all non-alphanumeric characters, and applying the Porter Stemming Algorithm [8] to each word that is longer than four characters. Then we join the resulting words of each entry together with no whitespace in between. This means that now, for example, “noncompact”, “non-compact’,’ and “non compact” all have reduced form “noncompact,” and “galaxies” and “galaxy” both have the reduced form “galaxi”. Having done this, we collect sets of terms with the same reduced form. We will henceforth use the term “keyword” to refer to a set of all terms sharing some common reduced form.

We now need to identify the keywords that are most relevant. For this, we define a list for each paper of keywords that occur in the title or abstract of

, and a probability distribution

on keyword occurrences. By a keyword occurence, we mean specifically a triple consisting of an author , a paper containing that author among its list of coauthors, and an occurrence of a keyword in , or more specifically, an entry of . We give the details on the definition of and in appendix B.

Let be the probability that, in a keyword occurrence randomly selected with probability determined by , the paper’s arXiv primary category is . For a keyword , let be the probability that the keyword is , and let be the probability that the category is given that the keyword is .

We can then define the rank of a keyword using the Kullback-Liebler divergence between posterior and prior distributions over arXiv categories, as well as the keyword probability , and a manually chosen constant :

(1)

The effect of can be roughly summarized as follows: with higher values of , greater precedence is given to commonly used terms. We have found that a value of (roughly 3 divided by the number of authors in the unfiltered set) gives good results and this value has been used for the following analysis. We keep the 40,000 bins that have the highest rank.

Since a keyword then refers to a set of similar terms (such as “galaxies” and “galaxy”) rather than a single one, we use the keyword’s most probable form (as determined by mentioned above) as representative.

This completes the generation of the keyword list.

3.2 Author Identification

Author names tend to appear in a variety of different forms. For example, a middle name may be included or omitted. A name might be given in full, or only as an initial. There might be an inconsistency in the usage of diacritics. We therefore use the following procedure to collect names which likely refer to the same person.

Note that, in the following process, each name must consist of at least two words. We ignore all authors whose name, as given by the arXiv data, consists of only one word.

First, we normalize each name by removing all periods and commas, converting it to lower case, and removing all diacritics. Let denote the set of all the normalizations of the names encountered.

Next, we create a binary relation, , that measures the compatibility of two name parts and , where is either a first, middle, or last name but not combinations thereof. We call two name parts compatible, , if they are identical or one is just the initial of the other.

Using this, we define another relation, , for two full names and in , composed of name parts. These names are compatible – that is, – if the last names are identical, the first names are compatible according to , and at least one of the following two conditions hold:

  1. At least one of the two names has no middle names given.

  2. Each name has the same number of middle names given, and each middle name from one is compatible with the corresponding middle name from the other.

The relation is not an equivalence relation: it is reflexive and symmetric, but not transitive. However, we can create an equivalence relation, , from . We start with defining as equal to , but whenever we have a failure of transitivity of , say and but not , we remove and for all possible . In other words, every name that is in the middle of some failure of transitivity loses all its neighbors. The resulting relation then must be an equivalence relation and its equivalence classes are what we will use as author identifiers.

The author identification leaves us with 664,057 authors.

3.3 Creating the LDA model

Latent Dirichlet Allocation (LDA) is a way of generating a probability distribution for a collection of documents. Here, a document is a sequence of words, and a word is an element of a finite set referred to as ‘the vocabulary’. LDA works by representing the documents as mixtures of ‘latent’ topics, and then characterizes these topics by a distribution over words [4].

To apply LDA to the set of arXiv authors, we take the vocabulary to be our list of 40,000 keywords. Each ‘document’ corresponds to an author, and the sequence of words within each document is the sequence of keywords used in all of the titles and abstracts of that author’s papers. To determine the keywords used in each paper and their multiplicities, we use the procedure described in appendix B for creating the keyword lists . These are similar to the keyword lists used in section 3.1, but for the restricted set of the 40,000 highest-ranking keywords.

For training the model, we used the python library Gensim [9]. This library uses a training algorithm based on the one described in [10]. We used 50 topics, 5 passes, and set the alpha and eta parameters to auto.

Even though there have been many proposed improvements to LDA [11, 12, 13, 14] we decided to use LDA because it is widely recognized and there exist implementations in popular open source libraries.

3.4 Measuring Broadness

With the trained LDA model, we can compute the joint probability density of a probability distribution over latent topics , a sequence of topics , and a sequence of keywords . In principle, given a sequence of keywords used by an author , we can obtain a single probability distribution over topics for this author by taking the expected value of given .

Computing this value is intractable in general [4, Section 5.1]. However, it is possible to compute an approximation to . More specifically, we can choose to be the probability distribution which minimizes the Kullback-Liebler divergence among all probability distributions in a certain family. The details on the definition of this family of probability distributions, and an iterative algorithm for computing , are given in [4, Section 5.2].

We therefore modify the above definition of the topic distribution of an author in order to make it computationally tractable: instead of taking the expected value of according to the distribution , we take it according to the distribution . The marginal distribution is given by a Dirichlet distribution, for which there exists a simple explicit formula for the expected value. This operation of determining and taking the expected value of is performed by the gensim function LdaModel.getdocumenttopics. We set the minimum_probability parameter to 0, and all other optional parameters kept their default values.

For assigning a topic distribution to an author, we use only their papers with at most 30 coauthors. This is to avoid measuring an author as extremely specialized because they have many papers with a single highly specialized collaboration. We don’t apply this filter at any prior stage of the analysis.

We assume that broader authors will have a less predictable topic distribution. The unpredictability of a distribution can be quantified by the Shannon entropy [15]. We therefore define the broadness of an author to be the Shannon entropy of their topic distribution.

4 Validity

In this section, we consider the question of whether latent topic entropy is a valid measurement of scientific broadness. To give an affirmative answer to this, we would need to discuss what is meant specifically by “scientific broadness”, for example, by constructing a nomological network [16]. We won’t attempt that in this paper, however, we will take steps in the same direction by showing that latent topic entropy has some properties that we would expect a valid measurement of scientific broadness to have, for most reasonable interpretations of “scientific broadness”.

4.1 Correlations with other broadness metrics

One way test whether latent topic entropy qualifies as a measure of scientific broadness is by checking the correlation between latent topic entropy and simpler, more direct measurements. To this end, we have measured the correlation between latent topic entropy and two other metrics based on the arXiv primary categories of an author’s papers. (The details on these two metrics are given in Appendix A.)

The first alternative metric, arxiv category entropy, measures how unpredictable the arxiv categories of an author’s papers are. The second, which is a measurement of specialization rather than broadness (that is, it should be lower rather than higher for broader authors), is basically how different the category distribution of an author’s papers is from the average category distribution of all authors. The correlations are 0.45 and -0.195 respectively, which are both in the same direction that one should expect from the assumption that these are valid measurements of broadness or specialization.

4.2 Typical keywords of latent topics

Intertpreting latent topic entropy as a measure of scientific broadness requires the assumption that the latent topics discovered by LDA correspond to distinct scientific topics, instead of being, for example, random distributions of unrelated words. We provide a list of the 20 most common keywords of each latent topic so that the reader can see that we have reason to think this assumption holds true. 111http://lostinmathbook.com/topic%20keywords.txt

4.3 Consistency

Even though we use an author’s papers in order to determine their latent topic entropy, our intention is to measure an intrinsic property of the author’s research style, not a property of a particular set of papers. Hence, if latent topic entropy is a valid measurement of scientific broadness, different subsets of an author’s papers should tend to give similar measurements of latent topic entropy.

We have tested this hypothesis by measuring the correlation between two different latent topic entropy values for each author with at least 40 papers. The first measurement uses a random half of the author’s papers (rounded down), and the second measurement uses the remaining papers. We measured a Pearson’s of 0.94 between the two broadness values, indicating that our broadness metric is not very sensitive to the specific set of papers used to compute broadness, as we would hope.

5 Results

In this section, we restrict our attention to authors who have at least 20 papers with no more than 30 coauthors. This is so that we have sufficient data to get a meaningful estimate of their broadness.

5.1 Total Population

In Figure 1

we depict the distribution of values of broadness over authors together with a Gaussian fit. The data has a mean value of 1.584 and standard deviation of 0.500. It is close to normal, with a skewness of 0.132 and an excess kurtosis of -0.058.

Figure 1:

Broadness distribution over authors (blue). Mean value: 1.584, standard deviation: 0.500, total number of authors: 46,772. Normal distribution shown in red.

We note as an aside that if one does not remove papers with more than 30 authors (ie keeps papers of large collaborations), the broadness distribution has a second mode (not shown) which peaks at low broadness. This second mode consists mainly of authors whose papers are mostly with a highly specialized collaboration such as LHC-b or LIGO/VIRGO.

5.2 ArXiv Categories

Next we look at authors that are primarily associated with a certain arXiv category, where we identify an author with a category if it is the primary category of at least 60% of their papers. Because of the low statistics, we omit categories with fewer than 100 associated authors. In table 1 we list the most broad categories and in table 2 we list the least broad categories. The complete list can be downloaded online222fias.uni-frankfurt.de/~hossi/Physics/author_category.txt.

It is instructive to compare these results to the findings of [17] which studied (among other things) the frequency by which papers in a sub-field of physics reference the same subfield. In [17] it was found that nuclear physics, astrophysics, the physics of elementary particles and fields, and plasma physics have the highest ratio of self-citations. For the first three of these, there is a tendency for the associated arXiv categories to have low broadness, especially when measured by mean paper broadness. Plasma physics, however, we find to have a high broadness

One possible reason for this discrepancy is that [17] did not use the arXiv categories, so what they refer to as ‘plasma physics’ is not identical to the category we refer to. Another reason is that broadness just measures a different property to the frequency of self-citations. A category can be broad because its concepts are commonly used also in other categories. This may or may not mean that people who primarily work in this category commonly refer to papers outside their discipline.

category # authors mean standard deviation
physics.plasm-ph 106 1.927 0.331
math.NA 113 1.880 0.306
cond-mat.stat-mech 354 1.870 0.332
math.PR 458 1.787 0.305
math-ph 181 1.771 0.324
cond-mat.soft 281 1.760 0.243
physics.atom-ph 164 1.734 0.269
physics.optics 231 1.723 0.347
quant-ph 1714 1.719 0.340
cond-mat.mes-hall 1043 1.646 0.285
Table 1: ArXiv categories with the highest mean author broadness.
category # authors mean standard deviation
math.GT 159 1.293 0.276
cond-mat.str-el 912 1.219 0.290
nucl-ex 350 1.215 0.441
math.OA 109 1.192 0.292
math.GR 120 1.180 0.296
astro-ph.CO 406 1.179 0.378
hep-th 1930 1.162 0.391
math.AG 407 1.040 0.259
math.RT 115 1.008 0.327
astro-ph.GA 476 0.920 0.409
Table 2: ArXiv categories with the lowest mean author broadness.

Using our trained LDA model, we can also associate a broadness value to a paper in a similar way as for authors. We treat each paper as an LDA document whose sequence of words is given by the list (defined precisely in Appendix B) of keywords appearing in the title and abstract. We have calculated the broadness values for arXiv categories as per the average broadness of the papers that have this respective primary category. We omit categories with fewer than 100 associated papers. Since this is a much less restrictive criterion than having at least 100 associated authors, smaller arXiv categories are better represented here. The results are displayed in tables 3 and 4. The complete list can be downloaded online333fias.uni-frankfurt.de/~hossi/Physics/paper_category.txt.

category # papers mean standard deviation
physics.pop-ph 781 2.084 0.343
math.HO 1568 2.033 0.400
physics.hist-ph 1830 2.012 0.343
physics.med-ph 1429 1.977 0.290
nlin.CG 360 1.962 0.302
q-bio.OT 406 1.951 0.307
physics.data-an 2308 1.944 0.353
physics.class-ph 3100 1.939 0.352
patt-sol 542 1.937 0.284
physics.geo-ph 1724 1.930 0.342
Table 3: ArXiv categories with the highest mean paper broadness.
category # papers mean standard deviation
nucl-ex 7551 1.414 0.438
math.RT 9178 1.406 0.419
astro-ph.EP 9758 1.400 0.430
cond-mat.str-el 32240 1.373 0.390
astro-ph.HE 18998 1.366 0.408
astro-ph.SR 25478 1.366 0.418
astro-ph 93615 1.363 0.442
math.KT 1689 1.330 0.401
astro-ph.CO 25986 1.258 0.456
astro-ph.GA 20852 1.160 0.447
Table 4: ArXiv categories with the lowest mean paper broadness.

Note that the standard deviations quoted in Tables 1 and 2 are for the distribution in each category. The values do not quantify the deviation of each category’s mean value from that of the entire sample.

For both the mean author broadness and mean paper broadness, applying a one-way ANOVA F-test yields an undetectably small p-value, showing that the differences between categories are exceedingly unlikely to be random fluctuations.

5.3 Country broadness

We next quantify the typical broadness per country as the mean broadness of authors in that country. We used the following procedure to associate countries with authors. First, we used arXiv’s bulk pdf access [18] to download pdf files of arXiv papers up to January 2018. We used Grobid [19] to extract the countries of authors from the affiliation data provided in these pdf files. We associated a country with an author if a country was extracted by Grobid for this author in at least one paper, and all countries extracted by Grobid for this author were the same. To get meaningful statistical values, we do not consider countries which have fewer than 100 associated authors. The results are displayed in Table 5 and in Figure 2. The total number of authors here is smaller because we were not able to link each author to a country, and authors who are linked to countries with fewer than 100 authors in total are not represented.

Country # authors mean standard deviation
Israel 281 1.745 0.436
Austria 127 1.705 0.398
China 998 1.639 0.496
France 1409 1.634 0.459
Netherlands 204 1.624 0.462
India 450 1.619 0.473
Belgium 142 1.610 0.428
Hungary 107 1.609 0.459
Italy 1195 1.600 0.482
Australia 320 1.599 0.475
Poland 301 1.595 0.428
Russian Federation 554 1.593 0.446
Brazil 382 1.590 0.455
Switzerland 197 1.590 0.425
Germany 1397 1.583 0.451
United States 5411 1.578 0.474
Canada 378 1.570 0.456
UK and Northern Ireland 1006 1.568 0.463
Sweden 151 1.560 0.472
Spain 459 1.556 0.489
Japan 1370 1.482 0.462
Iran, Islamic Republic of 116 1.430 0.545
Korea, Republic of 181 1.404 0.438
Table 5: Mean broadness by country.
Figure 2: Mean broadness by country

Applying a one-way ANOVA F-test yields a p-value of , showing that the differences between countries are exceedingly unlikely to be random fluctuations.

We further looked at the correlation between our measure of broadness and the Nature Index [20]. For this we used the weighted fractional count (physical sciences only). The two measures are uncorrelated with a Pearson coefficient of .

5.4 Gender, career-termination, and -index

We matched author names with the lists of common female and male names from the 1990 United States Census [21]

to identify the gender of an author where possible. This way we were able to identify 6,295 likely male and 3,502 likely female authors. (We want to remind the reader that this sample only includes authors with at least 20 papers.) We find small differences in the mean values and variances of these distributions, but the results are not consistent for the four measures of broadness we have tried (see Appendix A). We thus conclude that either the gender differences are insignificant or our present methods do not allow to resolve them.

Next we have analyzed our sample for a correlation between broadness and sudden career terminations. An author is in the terminated-career set if there exists an active period of 10 years in which they have published at least 10 papers, immediately followed by an inactive period, of at least 10 years and extending until the time the data was collected, during which at most 3 papers were published. This is in addition to the usual criterion that they have at least 20 papers with at most 30 coauthors. Our sample contains a total of 1,672 authors with such terminated-careers.

We found that, in the terminated-career set, the mean broadness was 1.483 and the standard deviation was 0.469. We remind the reader that the mean broadness of the whole sample is slightly greater at 1.584, and the standard deviation of the whole sample is 0.5. This is a statistically significant difference in broadness between groups: Welch’s t-test gives a p-value of

. For all other broadness metrics we investigated (see Appendix A) we also observed that the terminated-career authors were more specialized. The largest p-value obtained was , by the arXiv category Kullback-Liebler divergence metric.

Therefore, from our analysis, it appears that sudden career terminations are associated with specialized authors. Although the size of the effect on the mean broadness is small, the difference between the means is highly significant.

We further computed an -index value for each author using the arXiv citation data published by Paperscape [22]. We used the data published in May 2016, which includes citation data up to 2015. Note that the -index value we computed is not necessarily the same as the author’s true -index, because the author may not have all their papers on the arXiv.

We found a Pearson’s value of between -index and broadness. With all other broadness metrics we tried, we found a slight negative correlation between -index and broadness, except for the arXiv category Kullback-Liebler divergence metric. We suggest a possible explanation for this anomaly in Appendix A.3. Therefore, from this analysis, it appears that there may be a weak positive correlation between specialization and -index, or a weak negative correlation between broadness and -index, respectively.

5.5 Keyword Broadness

We can also associate a broadness value to each keyword. For this, we use a probability distribution on the restricted keyword occurrences . This is analogous to the distribution , but it uses the restricted set of 40,000 highest-ranking keywords, the restricted set of papers with at most 30 coauthors, and the restricted set of authors with at least 20 papers in the restricted set. The details on this are given in Appendix B.

We can use to define a broadness value for each keyword: the broadness of is the expected value of the broadness of the author given that the keyword is .

In table 6, we list the top ten and bottom ten keywords, subject to the additional restriction that they occur with probability at least (about 10 divided by the size of the restricted set of authors) according to . A complete list can be downloaded online444fias.uni-frankfurt.de/~hossi/Physics/keywords.txt.

We note that the keyword broadness fits well with the category broadness (Table 1 and 2) in that the most specialized keywords are typical for the astro-ph.X categories and the broadest keywords are typical for many-particle systems found in numerical (math.NA) or probabilistic studies (math.PR) or cond-mat.X applications thereof.

broadest most specialized
1 agents molecular gas
2 chaos z = 0
3 synchronization star-forming
4 chaotic star formation rate
5 fractal early-type galaxies
6 sensors stellar mass
7 network z2
8 memory SFR
9 logic z1
10 percolation star-forming galaxies
Table 6: Keyword broadness

6 Conclusion

We have proposed and analyzed a new measure to quantify and aggregate research activity whose purpose is to capture the breadth of a scientist’s publications, or their specialization, respectively. We have found that broadness has little correlation with the -index (of individual authors) or the Nature Index (of countries), suggesting that it captures previously unused information. While we do not think that the specific way of measuring broadness put forward here is the only correct one, we wish to suggest that broadness is a valuable indicator in particular for nations, institutions, or individuals which strive to improve their interdisciplinary research.

Acknowledgements

We thank Tobias Mistele for helpful communication. This work was made possible through support by the Foundational Questions Institute (FQXi).

Appendix A: Other Measures

We tried some other ways to measure broadness before settling on the the latent topic Shannon entropy used in the main text. For completeness, we here list other methods that we investigated.

1. Kullback-Liebler Divergence

Instead of measuring the broadness of an author as the entropy of their topic distribution , we could measure it using the Kullback-Liebler divergence with the average topic distribution of all authors . We can interpret authors for whom this KL-divergence is low as being broader, and authors for whom it is high as more specialized. The justification for this interpretation is the assumption that a maximally broad author should have a topic distribution equal to , and so the quantity measures how different the author’s topic distribution is from a maximally broad author.

Note that we can’t use to define a broadness metric: the Kullback-Liebler divergence is only well-defined if all events that have a probability 0 according to the right distribution also have a probability of 0 according to the left. That is not the case here, since small probabilities in the computed topic distributions often become rounded to 0.

Note that, in general, the entropy of a distribution is linearly related to the Kullback-Liebler divergence

with the uniform distribution

on the same underlying sample space as . From this perspective, we can see that this broadness metric based on Kullback-Liebler divergence is closely related to the main one. The only difference is that the main metric assumes that a perfectly broad author has a uniform topic distribution, while this one assumes that a perfectly broad author has a topic distribution equal to the average topic distribution.

2. ArXiv Primary Categories

Instead of measuring an author’s broadness using on their latent topic distribution, we may use distributions derived from the arXiv primary categories of their papers.

Suppose that the arXiv primary categories of the papers of an author are sampled from an ideal category distribution for that author, which can be estimated based on the observed categories of this author’s papers, but cannot be known. An estimator of the entropy of may be interpreted as a measurement of the author’s broadness. Taking the entropy of the maximum-likelihood estimate of (that is, the distribution where the probability of a category is proportional to the number of times it was used in all of the author’s papers) is known to be a negatively biased estimator of the true entropy of , with the bias becoming less severe as the sample size increases [23]. For example, no matter how broad an author’s interests are, if they only have a single paper on arXiv, we will always estimate their category entropy as 0, since every paper of that author is in the same category.

Because of this, we estimated the category entropy of an author by taking a random sample of 20 of their papers without replacement (recall that we restrict our attention to authors with at least 20 papers, so this is always possible), and taking the entropy of the primary category distribution of these 20 papers. This increases the magnitude of the bias of our entropy estimator in most cases, but it becomes more consistent between authors with different numbers of papers, so we avoid systematically measuring a higher broadness value for authors with more papers.

Similarly, we examined another broadness metric obtained by taking the Kullback-Liebler divergence of the category distribution of a 20-paper subset of an author’s papers with the average category distribution of all authors. Note that, like the latent topic Kullback-Liebler divergence metric, this is really a measure of specialization since it should decrease for broader authors.

3. A comment on -index correlations

For all the metrics mentioned above, we also measured the correlation with -index like in section 4.4. We found, for all but the arXiv category Kullback-Liebler divergence metric (henceforth refered to as cat-KLD), a slight negative correlation between broadness and -index, in agreement with section 4.4. We offer here a possible explanation for why the arXiv category Kullback-Liebler divergence disagreed with the others.

Let be the arXiv category distribution of the 20 randomly-selected papers of some author used to compute their cat-KLD. Let be the average arXiv category distribution among all authors. The cat-KLD metric for the author is then given by

Here, is the cross-entropy between and and is the entropy of .

The cross-entropy can be interpreted as a measure of how much the author tends to publish in less active arXiv categories. We therefore have that the cat-KLD metric will tend to measure authors as more specialized if they publish in less active arXiv categories. This could explain why it correlates negatively with -index (this conflicts with the other metrics, since cat-KLD is a measure of specialization and not broadness): the authors with high cat-KLD could be receiving fewer citations because they tend to publish in less active categories, where there are fewer authors who might cite their works.

Regarding why the latent topic KLD metric doesn’t have the opposite correlation with -index for the same reason: while the arXiv categories vary in size by orders of magnitude, the latent topics have relatively consistent average probabilities. Therefore, the cross-entropy term has much less significance in this case.

Appendix B: Details on , , , and

In section 3.1, we describe the rank of a keyword, which quantifies, roughly, a combination of how common the keyword is and how much information it gives about the topic of the paper in question. Our procedure for determining the rank of a keyword depends on the probability distribution on keyword occurrences. By a keyword occurrence, we mean specifically a triple consisting of an author , a paper containing that author among its list of coauthors, and an occurrence of a keyword in , or more specifically, an entry of . Here, is a list, possibly with repetition, of the keywords occurring in a paper .

We define the probability using the following process:

  1. Choose an author uniformly at random.

  2. Choose one of this author’s papers uniformly at random. Call it .

  3. Choose an entry of uniformly at random.

The probability of a keyword occurrence is then the probability of choosing that author, paper, and entry of in this process.

It remains to give a precise definition of for a paper . For this, we use the following procedure:

  1. Initialize as the sequence of sequences of words associated with the paper that is described at the beginning of section 3.1. Initialize as an empty list.

  2. Perform the remaining steps for each nonempty sequence of words in .

  3. If the sequence begins with a keyword, remove the longest possible keyword from the beginning of the sequence (keep in mind that a keyword may contain more than one word, and may contain prefixes that are distinct keywords, such as “black hole evaporation” and “black hole”). Add the removed keyword to . If the sequence does not begin with a keyword, remove a single word from the beginning.

  4. Repeat the previous step until the sequence is empty.

For each paper, we can also define a restricted list of keywords (used in section 3.3) in an analogous way, by performing the process above with the restricted set of 40,000 top-ranking keywords instead of the full set. We define the restricted keyword occurrences and their distribution (used in section 4.5) the same way as and , except using in place of , the restricted set of papers with at most 30 coauthors, and the restricted set of authors with at least 20 papers in the restricted set.

References

  • [1] L. Waltman. A review of the literature on citation impact indicators. arXiv:1507.02099 [cs.DL], 2015.
  • [2] Marc A. Edwards and Siddhartha Roy. Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environ Eng Sci, 34(1):51–61, 2017.
  • [3] arxiv oai-pmh interface. https://arxiv.org/help/oa/index.
  • [4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003.
  • [5] American Mathematical Society. Mathematics subject classification. https://mathscinet.ams.org/msc/msc2010.html.
  • [6] Association for Computing Machinery. Computing classification system. https://www.acm.org/publications/class-2012.
  • [7] American Institute of Physics. Physics and astronomy classification scheme. https://journals.aps.org/PACS.
  • [8] M. F. Porter. An algorithm for suffix stripping. Program, 14:130–137, 1980.
  • [9] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en.
  • [10] Matthew Hoffman, Francis R. Bach, and David M. Blei. Online learning for latent dirichlet allocation. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 856–864. Curran Associates, Inc., 2010.
  • [11] David M. Blei and John D. Lafferty. A correlated topic model of science. Annals of Applied Statistics, 1(1):17–35, 2007.
  • [12] Potapenko A. and Vorontsov K. Robust plsa performs better than lda. Lecture Notes in Computer Science, 7814, 2013.
  • [13] Andrea Lancichinetti, M. Irmak Sirer, Jane X. Wang, Daniel Acuna, Konrad Körding, and Luís A. Nunes Amaral. A high-reproducibility and high-accuracy method for automated topic classification. Phys. Rev. X, 5(011007), 2014.
  • [14] Martin Gerlach, Tiago P. Peixoto, and Eduardo G. Altmann. A network approach to topic models. arXiv:1708.01677 [stat.ML], 2017.
  • [15] Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(3):379–423, 1948.
  • [16] Lee J. Cronbach and Paul E. Meehl. Construct validity in psychological tests. Psychological Bulletin, 52(4):281–302, 1955.
  • [17] Roberta Sinatra, Pierre Deville, Michael Szell, Dashun Wang, and Albert-Laszlo Barabasi. A century of physics. Nature, page 791–796, 2015.
  • [18] arxiv bulk data access. https://arxiv.org/help/bulk_data_s3.
  • [19] Patrice Lopez. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Maristella Agosti, José Borbinha, Sarantos Kapidakis, Christos Papatheodorou, and Giannis Tsakonas, editors, Research and Advanced Technology for Digital Libraries, pages 473–474, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.
  • [20] Nature Publishing Group. Nature index. https://www.natureindex.com/.
  • [21] Department of Commerce US Census Bureau. Names from census 1990. https://catalog.data.gov/dataset/names-from-census-1990.
  • [22] Damien P. George and Robert Knegjens. Paperscape. http://paperscape.org.
  • [23] G. P. Basharin.

    On a statistical estimate for the entropy of a sequence of independent random variables.

    Theory of Probability and its Applications, 4:333–336, 1959. Translated from Russian.