1 Introduction
Most attempts at quantifying scientific output focus on productivity and popularity, measured by the number of papers, the number of papers in journals with high impact factor, media mentions, citation counts, or combinations thereof (for a recent review see [1]). This focus is problematic not only because it creates perverse incentives [2], but also because other criteria fall by the wayside. To identify a suitable candidate for a job opening, or judge an applicants’ qualifications to lead a project to success, other factors besides productivity and popularity play a role. One of them is the researchers’ breadth of knowledge. In this present work, we want to propose a simple and efficient way of quantifying this breadth.
Our aim here is not to argue that any particular level of broadness is good or bad. Instead, our point of view is that different tasks call for different amounts of specialization, where here and in the following we will use the word ‘specialization’ to mean the opposite of ‘broadness.’ We also do not wish to suggest that the particular measure of broadness which we will propose in the following is the ‘right’ one. Instead, we merely want to demonstrate that it is a useful measure, and one that captures previously unexplored information.
2 Data
We based this present analysis on papers from the openaccess server arXiv, available through the Open Archives Initiative Protocal for Metadata Harvesting (OAIPMH) interface [3]. The data used for this analysis was downloaded through the interface in February 2018. It contains the metadata of 1,358,923 papers. We use the title, abstract, author’s name, date, and arXiv primary category. Before calculating the broadness values, we also remove all papers with more than 30 authors because we expect collaboration papers to be highly specialized by their nature and thus follow a different distribution. When we analyze the statistical properties of the distribution we further remove all authors with fewer than 20 papers because those researchers have too few publications to be meaningfully associated with a broadness value. The final sample contains 46,772 authors and 1,350,611 papers.
3 Analysis
We analyze the text of the papers in four steps, the details of which will be laid out in the following subsections. In brief, the procedure works like this:

We extract terms from the papers’ titles and abstracts. We collect similar terms, such as “galaxy” and “galaxies”, into clusters which we refer to as “keywords”. We rank each keyword using a combination of how frequently it occurs and the distribution of arXiv primary categories of the papers it appears in. This ranking is based on the assumption that highly generic terms such as “paper” or “demonstrate”, which make poor keywords, will be more evenly distributed among different arXiv categories. We keep the 40,000 highestranking keywords.

We create author identifications by matching similar names.

We train a statistical model – latent dirichlet allocation [4] – for the multiset of keywords used by an author.

Once trained, this model allows us to infer a distribution over latent topics for each author. The broadness of an author is then determined as the Shannon entropy of this distribution over topics.
3.1 Keyword Generation
We extract the keywords from the titles and the abstracts of papers in our sample. While we could be using preexisting classification shemes, such as MSC [5], ACM [6], or PACS [7], this would greatly limit the flexibility of our method. The reader be warned that what we refer to as “keyword” here is not necessarily a single word, but may be a sequence of words. For example, “dwarf galaxy” or “effective field theory” would each count as one keyword.
For each paper, we first obtain a sequence of sequences of words by the following steps:

We concatenate the title and abstract together, with the string “. ” (period and space) in between. We convert the resulting string to lowercase and remove all latex commands.

We obtain a sequence of strings by dividing the string above into contiguous sections. These sections end whenever a period, question mark, open or closed round bracket, open or closed square bracket, semicolon, colon, or comma is encountered.

We break each string in the above sequence of strings into a sequence of words, by dividing it into contiguous sections of characters which contain no whitespace.
We then produce a list which contains all sequences of at most ten words which can be found in the title or abstract of at least 20 papers. We then remove all entries that begin or end with a stopword, i.e. a word like “the” or “a”.
Next we convert each entry of the list into a reduced form. This we do by removing “’s” from the end of every word in a keyword (so “Einstein field equations” and “Einstein’s field equations’ are the same), removing all diacritics, removing all nonalphanumeric characters, and applying the Porter Stemming Algorithm [8] to each word that is longer than four characters. Then we join the resulting words of each entry together with no whitespace in between. This means that now, for example, “noncompact”, “noncompact’,’ and “non compact” all have reduced form “noncompact,” and “galaxies” and “galaxy” both have the reduced form “galaxi”. Having done this, we collect sets of terms with the same reduced form. We will henceforth use the term “keyword” to refer to a set of all terms sharing some common reduced form.
We now need to identify the keywords that are most relevant. For this, we define a list for each paper of keywords that occur in the title or abstract of
, and a probability distribution
on keyword occurrences. By a keyword occurence, we mean specifically a triple consisting of an author , a paper containing that author among its list of coauthors, and an occurrence of a keyword in , or more specifically, an entry of . We give the details on the definition of and in appendix B.Let be the probability that, in a keyword occurrence randomly selected with probability determined by , the paper’s arXiv primary category is . For a keyword , let be the probability that the keyword is , and let be the probability that the category is given that the keyword is .
We can then define the rank of a keyword using the KullbackLiebler divergence between posterior and prior distributions over arXiv categories, as well as the keyword probability , and a manually chosen constant :
(1) 
The effect of can be roughly summarized as follows: with higher values of , greater precedence is given to commonly used terms. We have found that a value of (roughly 3 divided by the number of authors in the unfiltered set) gives good results and this value has been used for the following analysis. We keep the 40,000 bins that have the highest rank.
Since a keyword then refers to a set of similar terms (such as “galaxies” and “galaxy”) rather than a single one, we use the keyword’s most probable form (as determined by mentioned above) as representative.
This completes the generation of the keyword list.
3.2 Author Identification
Author names tend to appear in a variety of different forms. For example, a middle name may be included or omitted. A name might be given in full, or only as an initial. There might be an inconsistency in the usage of diacritics. We therefore use the following procedure to collect names which likely refer to the same person.
Note that, in the following process, each name must consist of at least two words. We ignore all authors whose name, as given by the arXiv data, consists of only one word.
First, we normalize each name by removing all periods and commas, converting it to lower case, and removing all diacritics. Let denote the set of all the normalizations of the names encountered.
Next, we create a binary relation, , that measures the compatibility of two name parts and , where is either a first, middle, or last name but not combinations thereof. We call two name parts compatible, , if they are identical or one is just the initial of the other.
Using this, we define another relation, , for two full names and in , composed of name parts. These names are compatible – that is, – if the last names are identical, the first names are compatible according to , and at least one of the following two conditions hold:

At least one of the two names has no middle names given.

Each name has the same number of middle names given, and each middle name from one is compatible with the corresponding middle name from the other.
The relation is not an equivalence relation: it is reflexive and symmetric, but not transitive. However, we can create an equivalence relation, , from . We start with defining as equal to , but whenever we have a failure of transitivity of , say and but not , we remove and for all possible . In other words, every name that is in the middle of some failure of transitivity loses all its neighbors. The resulting relation then must be an equivalence relation and its equivalence classes are what we will use as author identifiers.
The author identification leaves us with 664,057 authors.
3.3 Creating the LDA model
Latent Dirichlet Allocation (LDA) is a way of generating a probability distribution for a collection of documents. Here, a document is a sequence of words, and a word is an element of a finite set referred to as ‘the vocabulary’. LDA works by representing the documents as mixtures of ‘latent’ topics, and then characterizes these topics by a distribution over words [4].
To apply LDA to the set of arXiv authors, we take the vocabulary to be our list of 40,000 keywords. Each ‘document’ corresponds to an author, and the sequence of words within each document is the sequence of keywords used in all of the titles and abstracts of that author’s papers. To determine the keywords used in each paper and their multiplicities, we use the procedure described in appendix B for creating the keyword lists . These are similar to the keyword lists used in section 3.1, but for the restricted set of the 40,000 highestranking keywords.
3.4 Measuring Broadness
With the trained LDA model, we can compute the joint probability density of a probability distribution over latent topics , a sequence of topics , and a sequence of keywords . In principle, given a sequence of keywords used by an author , we can obtain a single probability distribution over topics for this author by taking the expected value of given .
Computing this value is intractable in general [4, Section 5.1]. However, it is possible to compute an approximation to . More specifically, we can choose to be the probability distribution which minimizes the KullbackLiebler divergence among all probability distributions in a certain family. The details on the definition of this family of probability distributions, and an iterative algorithm for computing , are given in [4, Section 5.2].
We therefore modify the above definition of the topic distribution of an author in order to make it computationally tractable: instead of taking the expected value of according to the distribution , we take it according to the distribution . The marginal distribution is given by a Dirichlet distribution, for which there exists a simple explicit formula for the expected value. This operation of determining and taking the expected value of is performed by the gensim function LdaModel.getdocumenttopics
. We set the minimum_probability
parameter to 0, and all other optional parameters kept their default values.
For assigning a topic distribution to an author, we use only their papers with at most 30 coauthors. This is to avoid measuring an author as extremely specialized because they have many papers with a single highly specialized collaboration. We don’t apply this filter at any prior stage of the analysis.
We assume that broader authors will have a less predictable topic distribution. The unpredictability of a distribution can be quantified by the Shannon entropy [15]. We therefore define the broadness of an author to be the Shannon entropy of their topic distribution.
4 Validity
In this section, we consider the question of whether latent topic entropy is a valid measurement of scientific broadness. To give an affirmative answer to this, we would need to discuss what is meant specifically by “scientific broadness”, for example, by constructing a nomological network [16]. We won’t attempt that in this paper, however, we will take steps in the same direction by showing that latent topic entropy has some properties that we would expect a valid measurement of scientific broadness to have, for most reasonable interpretations of “scientific broadness”.
4.1 Correlations with other broadness metrics
One way test whether latent topic entropy qualifies as a measure of scientific broadness is by checking the correlation between latent topic entropy and simpler, more direct measurements. To this end, we have measured the correlation between latent topic entropy and two other metrics based on the arXiv primary categories of an author’s papers. (The details on these two metrics are given in Appendix A.)
The first alternative metric, arxiv category entropy, measures how unpredictable the arxiv categories of an author’s papers are. The second, which is a measurement of specialization rather than broadness (that is, it should be lower rather than higher for broader authors), is basically how different the category distribution of an author’s papers is from the average category distribution of all authors. The correlations are 0.45 and 0.195 respectively, which are both in the same direction that one should expect from the assumption that these are valid measurements of broadness or specialization.
4.2 Typical keywords of latent topics
Intertpreting latent topic entropy as a measure of scientific broadness requires the assumption that the latent topics discovered by LDA correspond to distinct scientific topics, instead of being, for example, random distributions of unrelated words. We provide a list of the 20 most common keywords of each latent topic so that the reader can see that we have reason to think this assumption holds true. ^{1}^{1}1http://lostinmathbook.com/topic%20keywords.txt
4.3 Consistency
Even though we use an author’s papers in order to determine their latent topic entropy, our intention is to measure an intrinsic property of the author’s research style, not a property of a particular set of papers. Hence, if latent topic entropy is a valid measurement of scientific broadness, different subsets of an author’s papers should tend to give similar measurements of latent topic entropy.
We have tested this hypothesis by measuring the correlation between two different latent topic entropy values for each author with at least 40 papers. The first measurement uses a random half of the author’s papers (rounded down), and the second measurement uses the remaining papers. We measured a Pearson’s of 0.94 between the two broadness values, indicating that our broadness metric is not very sensitive to the specific set of papers used to compute broadness, as we would hope.
5 Results
In this section, we restrict our attention to authors who have at least 20 papers with no more than 30 coauthors. This is so that we have sufficient data to get a meaningful estimate of their broadness.
5.1 Total Population
In Figure 1
we depict the distribution of values of broadness over authors together with a Gaussian fit. The data has a mean value of 1.584 and standard deviation of 0.500. It is close to normal, with a skewness of 0.132 and an excess kurtosis of 0.058.
We note as an aside that if one does not remove papers with more than 30 authors (ie keeps papers of large collaborations), the broadness distribution has a second mode (not shown) which peaks at low broadness. This second mode consists mainly of authors whose papers are mostly with a highly specialized collaboration such as LHCb or LIGO/VIRGO.
5.2 ArXiv Categories
Next we look at authors that are primarily associated with a certain arXiv category, where we identify an author with a category if it is the primary category of at least 60% of their papers. Because of the low statistics, we omit categories with fewer than 100 associated authors. In table 1 we list the most broad categories and in table 2 we list the least broad categories. The complete list can be downloaded online^{2}^{2}2fias.unifrankfurt.de/~hossi/Physics/author_category.txt.
It is instructive to compare these results to the findings of [17] which studied (among other things) the frequency by which papers in a subfield of physics reference the same subfield. In [17] it was found that nuclear physics, astrophysics, the physics of elementary particles and fields, and plasma physics have the highest ratio of selfcitations. For the first three of these, there is a tendency for the associated arXiv categories to have low broadness, especially when measured by mean paper broadness. Plasma physics, however, we find to have a high broadness
One possible reason for this discrepancy is that [17] did not use the arXiv categories, so what they refer to as ‘plasma physics’ is not identical to the category we refer to. Another reason is that broadness just measures a different property to the frequency of selfcitations. A category can be broad because its concepts are commonly used also in other categories. This may or may not mean that people who primarily work in this category commonly refer to papers outside their discipline.
category  # authors  mean  standard deviation 

physics.plasmph  106  1.927  0.331 
math.NA  113  1.880  0.306 
condmat.statmech  354  1.870  0.332 
math.PR  458  1.787  0.305 
mathph  181  1.771  0.324 
condmat.soft  281  1.760  0.243 
physics.atomph  164  1.734  0.269 
physics.optics  231  1.723  0.347 
quantph  1714  1.719  0.340 
condmat.meshall  1043  1.646  0.285 
category  # authors  mean  standard deviation 

math.GT  159  1.293  0.276 
condmat.strel  912  1.219  0.290 
nuclex  350  1.215  0.441 
math.OA  109  1.192  0.292 
math.GR  120  1.180  0.296 
astroph.CO  406  1.179  0.378 
hepth  1930  1.162  0.391 
math.AG  407  1.040  0.259 
math.RT  115  1.008  0.327 
astroph.GA  476  0.920  0.409 
Using our trained LDA model, we can also associate a broadness value to a paper in a similar way as for authors. We treat each paper as an LDA document whose sequence of words is given by the list (defined precisely in Appendix B) of keywords appearing in the title and abstract. We have calculated the broadness values for arXiv categories as per the average broadness of the papers that have this respective primary category. We omit categories with fewer than 100 associated papers. Since this is a much less restrictive criterion than having at least 100 associated authors, smaller arXiv categories are better represented here. The results are displayed in tables 3 and 4. The complete list can be downloaded online^{3}^{3}3fias.unifrankfurt.de/~hossi/Physics/paper_category.txt.
category  # papers  mean  standard deviation 

physics.popph  781  2.084  0.343 
math.HO  1568  2.033  0.400 
physics.histph  1830  2.012  0.343 
physics.medph  1429  1.977  0.290 
nlin.CG  360  1.962  0.302 
qbio.OT  406  1.951  0.307 
physics.dataan  2308  1.944  0.353 
physics.classph  3100  1.939  0.352 
pattsol  542  1.937  0.284 
physics.geoph  1724  1.930  0.342 
category  # papers  mean  standard deviation 

nuclex  7551  1.414  0.438 
math.RT  9178  1.406  0.419 
astroph.EP  9758  1.400  0.430 
condmat.strel  32240  1.373  0.390 
astroph.HE  18998  1.366  0.408 
astroph.SR  25478  1.366  0.418 
astroph  93615  1.363  0.442 
math.KT  1689  1.330  0.401 
astroph.CO  25986  1.258  0.456 
astroph.GA  20852  1.160  0.447 
Note that the standard deviations quoted in Tables 1 and 2 are for the distribution in each category. The values do not quantify the deviation of each category’s mean value from that of the entire sample.
For both the mean author broadness and mean paper broadness, applying a oneway ANOVA Ftest yields an undetectably small pvalue, showing that the differences between categories are exceedingly unlikely to be random fluctuations.
5.3 Country broadness
We next quantify the typical broadness per country as the mean broadness of authors in that country. We used the following procedure to associate countries with authors. First, we used arXiv’s bulk pdf access [18] to download pdf files of arXiv papers up to January 2018. We used Grobid [19] to extract the countries of authors from the affiliation data provided in these pdf files. We associated a country with an author if a country was extracted by Grobid for this author in at least one paper, and all countries extracted by Grobid for this author were the same. To get meaningful statistical values, we do not consider countries which have fewer than 100 associated authors. The results are displayed in Table 5 and in Figure 2. The total number of authors here is smaller because we were not able to link each author to a country, and authors who are linked to countries with fewer than 100 authors in total are not represented.
Country  # authors  mean  standard deviation 

Israel  281  1.745  0.436 
Austria  127  1.705  0.398 
China  998  1.639  0.496 
France  1409  1.634  0.459 
Netherlands  204  1.624  0.462 
India  450  1.619  0.473 
Belgium  142  1.610  0.428 
Hungary  107  1.609  0.459 
Italy  1195  1.600  0.482 
Australia  320  1.599  0.475 
Poland  301  1.595  0.428 
Russian Federation  554  1.593  0.446 
Brazil  382  1.590  0.455 
Switzerland  197  1.590  0.425 
Germany  1397  1.583  0.451 
United States  5411  1.578  0.474 
Canada  378  1.570  0.456 
UK and Northern Ireland  1006  1.568  0.463 
Sweden  151  1.560  0.472 
Spain  459  1.556  0.489 
Japan  1370  1.482  0.462 
Iran, Islamic Republic of  116  1.430  0.545 
Korea, Republic of  181  1.404  0.438 
Applying a oneway ANOVA Ftest yields a pvalue of , showing that the differences between countries are exceedingly unlikely to be random fluctuations.
We further looked at the correlation between our measure of broadness and the Nature Index [20]. For this we used the weighted fractional count (physical sciences only). The two measures are uncorrelated with a Pearson coefficient of .
5.4 Gender, careertermination, and index
We matched author names with the lists of common female and male names from the 1990 United States Census [21]
to identify the gender of an author where possible. This way we were able to identify 6,295 likely male and 3,502 likely female authors. (We want to remind the reader that this sample only includes authors with at least 20 papers.) We find small differences in the mean values and variances of these distributions, but the results are not consistent for the four measures of broadness we have tried (see Appendix A). We thus conclude that either the gender differences are insignificant or our present methods do not allow to resolve them.
Next we have analyzed our sample for a correlation between broadness and sudden career terminations. An author is in the terminatedcareer set if there exists an active period of 10 years in which they have published at least 10 papers, immediately followed by an inactive period, of at least 10 years and extending until the time the data was collected, during which at most 3 papers were published. This is in addition to the usual criterion that they have at least 20 papers with at most 30 coauthors. Our sample contains a total of 1,672 authors with such terminatedcareers.
We found that, in the terminatedcareer set, the mean broadness was 1.483 and the standard deviation was 0.469. We remind the reader that the mean broadness of the whole sample is slightly greater at 1.584, and the standard deviation of the whole sample is 0.5. This is a statistically significant difference in broadness between groups: Welch’s ttest gives a pvalue of
. For all other broadness metrics we investigated (see Appendix A) we also observed that the terminatedcareer authors were more specialized. The largest pvalue obtained was , by the arXiv category KullbackLiebler divergence metric.Therefore, from our analysis, it appears that sudden career terminations are associated with specialized authors. Although the size of the effect on the mean broadness is small, the difference between the means is highly significant.
We further computed an index value for each author using the arXiv citation data published by Paperscape [22]. We used the data published in May 2016, which includes citation data up to 2015. Note that the index value we computed is not necessarily the same as the author’s true index, because the author may not have all their papers on the arXiv.
We found a Pearson’s value of between index and broadness. With all other broadness metrics we tried, we found a slight negative correlation between index and broadness, except for the arXiv category KullbackLiebler divergence metric. We suggest a possible explanation for this anomaly in Appendix A.3. Therefore, from this analysis, it appears that there may be a weak positive correlation between specialization and index, or a weak negative correlation between broadness and index, respectively.
5.5 Keyword Broadness
We can also associate a broadness value to each keyword. For this, we use a probability distribution on the restricted keyword occurrences . This is analogous to the distribution , but it uses the restricted set of 40,000 highestranking keywords, the restricted set of papers with at most 30 coauthors, and the restricted set of authors with at least 20 papers in the restricted set. The details on this are given in Appendix B.
We can use to define a broadness value for each keyword: the broadness of is the expected value of the broadness of the author given that the keyword is .
In table 6, we list the top ten and bottom ten keywords, subject to the additional restriction that they occur with probability at least (about 10 divided by the size of the restricted set of authors) according to . A complete list can be downloaded online^{4}^{4}4fias.unifrankfurt.de/~hossi/Physics/keywords.txt.
We note that the keyword broadness fits well with the category broadness (Table 1 and 2) in that the most specialized keywords are typical for the astroph.X categories and the broadest keywords are typical for manyparticle systems found in numerical (math.NA) or probabilistic studies (math.PR) or condmat.X applications thereof.
broadest  most specialized  

1  agents  molecular gas 
2  chaos  z = 0 
3  synchronization  starforming 
4  chaotic  star formation rate 
5  fractal  earlytype galaxies 
6  sensors  stellar mass 
7  network  z2 
8  memory  SFR 
9  logic  z1 
10  percolation  starforming galaxies 
6 Conclusion
We have proposed and analyzed a new measure to quantify and aggregate research activity whose purpose is to capture the breadth of a scientist’s publications, or their specialization, respectively. We have found that broadness has little correlation with the index (of individual authors) or the Nature Index (of countries), suggesting that it captures previously unused information. While we do not think that the specific way of measuring broadness put forward here is the only correct one, we wish to suggest that broadness is a valuable indicator in particular for nations, institutions, or individuals which strive to improve their interdisciplinary research.
Acknowledgements
We thank Tobias Mistele for helpful communication. This work was made possible through support by the Foundational Questions Institute (FQXi).
Appendix A: Other Measures
We tried some other ways to measure broadness before settling on the the latent topic Shannon entropy used in the main text. For completeness, we here list other methods that we investigated.
1. KullbackLiebler Divergence
Instead of measuring the broadness of an author as the entropy of their topic distribution , we could measure it using the KullbackLiebler divergence with the average topic distribution of all authors . We can interpret authors for whom this KLdivergence is low as being broader, and authors for whom it is high as more specialized. The justification for this interpretation is the assumption that a maximally broad author should have a topic distribution equal to , and so the quantity measures how different the author’s topic distribution is from a maximally broad author.
Note that we can’t use to define a broadness metric: the KullbackLiebler divergence is only welldefined if all events that have a probability 0 according to the right distribution also have a probability of 0 according to the left. That is not the case here, since small probabilities in the computed topic distributions often become rounded to 0.
Note that, in general, the entropy of a distribution is linearly related to the KullbackLiebler divergence
with the uniform distribution
on the same underlying sample space as . From this perspective, we can see that this broadness metric based on KullbackLiebler divergence is closely related to the main one. The only difference is that the main metric assumes that a perfectly broad author has a uniform topic distribution, while this one assumes that a perfectly broad author has a topic distribution equal to the average topic distribution.2. ArXiv Primary Categories
Instead of measuring an author’s broadness using on their latent topic distribution, we may use distributions derived from the arXiv primary categories of their papers.
Suppose that the arXiv primary categories of the papers of an author are sampled from an ideal category distribution for that author, which can be estimated based on the observed categories of this author’s papers, but cannot be known. An estimator of the entropy of may be interpreted as a measurement of the author’s broadness. Taking the entropy of the maximumlikelihood estimate of (that is, the distribution where the probability of a category is proportional to the number of times it was used in all of the author’s papers) is known to be a negatively biased estimator of the true entropy of , with the bias becoming less severe as the sample size increases [23]. For example, no matter how broad an author’s interests are, if they only have a single paper on arXiv, we will always estimate their category entropy as 0, since every paper of that author is in the same category.
Because of this, we estimated the category entropy of an author by taking a random sample of 20 of their papers without replacement (recall that we restrict our attention to authors with at least 20 papers, so this is always possible), and taking the entropy of the primary category distribution of these 20 papers. This increases the magnitude of the bias of our entropy estimator in most cases, but it becomes more consistent between authors with different numbers of papers, so we avoid systematically measuring a higher broadness value for authors with more papers.
Similarly, we examined another broadness metric obtained by taking the KullbackLiebler divergence of the category distribution of a 20paper subset of an author’s papers with the average category distribution of all authors. Note that, like the latent topic KullbackLiebler divergence metric, this is really a measure of specialization since it should decrease for broader authors.
3. A comment on index correlations
For all the metrics mentioned above, we also measured the correlation with index like in section 4.4. We found, for all but the arXiv category KullbackLiebler divergence metric (henceforth refered to as catKLD), a slight negative correlation between broadness and index, in agreement with section 4.4. We offer here a possible explanation for why the arXiv category KullbackLiebler divergence disagreed with the others.
Let be the arXiv category distribution of the 20 randomlyselected papers of some author used to compute their catKLD. Let be the average arXiv category distribution among all authors. The catKLD metric for the author is then given by
Here, is the crossentropy between and and is the entropy of .
The crossentropy can be interpreted as a measure of how much the author tends to publish in less active arXiv categories. We therefore have that the catKLD metric will tend to measure authors as more specialized if they publish in less active arXiv categories. This could explain why it correlates negatively with index (this conflicts with the other metrics, since catKLD is a measure of specialization and not broadness): the authors with high catKLD could be receiving fewer citations because they tend to publish in less active categories, where there are fewer authors who might cite their works.
Regarding why the latent topic KLD metric doesn’t have the opposite correlation with index for the same reason: while the arXiv categories vary in size by orders of magnitude, the latent topics have relatively consistent average probabilities. Therefore, the crossentropy term has much less significance in this case.
Appendix B: Details on , , , and
In section 3.1, we describe the rank of a keyword, which quantifies, roughly, a combination of how common the keyword is and how much information it gives about the topic of the paper in question. Our procedure for determining the rank of a keyword depends on the probability distribution on keyword occurrences. By a keyword occurrence, we mean specifically a triple consisting of an author , a paper containing that author among its list of coauthors, and an occurrence of a keyword in , or more specifically, an entry of . Here, is a list, possibly with repetition, of the keywords occurring in a paper .
We define the probability using the following process:

Choose an author uniformly at random.

Choose one of this author’s papers uniformly at random. Call it .

Choose an entry of uniformly at random.
The probability of a keyword occurrence is then the probability of choosing that author, paper, and entry of in this process.
It remains to give a precise definition of for a paper . For this, we use the following procedure:

Initialize as the sequence of sequences of words associated with the paper that is described at the beginning of section 3.1. Initialize as an empty list.

Perform the remaining steps for each nonempty sequence of words in .

If the sequence begins with a keyword, remove the longest possible keyword from the beginning of the sequence (keep in mind that a keyword may contain more than one word, and may contain prefixes that are distinct keywords, such as “black hole evaporation” and “black hole”). Add the removed keyword to . If the sequence does not begin with a keyword, remove a single word from the beginning.

Repeat the previous step until the sequence is empty.
For each paper, we can also define a restricted list of keywords (used in section 3.3) in an analogous way, by performing the process above with the restricted set of 40,000 topranking keywords instead of the full set. We define the restricted keyword occurrences and their distribution (used in section 4.5) the same way as and , except using in place of , the restricted set of papers with at most 30 coauthors, and the restricted set of authors with at least 20 papers in the restricted set.
References
 [1] L. Waltman. A review of the literature on citation impact indicators. arXiv:1507.02099 [cs.DL], 2015.
 [2] Marc A. Edwards and Siddhartha Roy. Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environ Eng Sci, 34(1):51–61, 2017.
 [3] arxiv oaipmh interface. https://arxiv.org/help/oa/index.
 [4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003.
 [5] American Mathematical Society. Mathematics subject classification. https://mathscinet.ams.org/msc/msc2010.html.
 [6] Association for Computing Machinery. Computing classification system. https://www.acm.org/publications/class2012.
 [7] American Institute of Physics. Physics and astronomy classification scheme. https://journals.aps.org/PACS.
 [8] M. F. Porter. An algorithm for suffix stripping. Program, 14:130–137, 1980.
 [9] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en.
 [10] Matthew Hoffman, Francis R. Bach, and David M. Blei. Online learning for latent dirichlet allocation. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 856–864. Curran Associates, Inc., 2010.
 [11] David M. Blei and John D. Lafferty. A correlated topic model of science. Annals of Applied Statistics, 1(1):17–35, 2007.
 [12] Potapenko A. and Vorontsov K. Robust plsa performs better than lda. Lecture Notes in Computer Science, 7814, 2013.
 [13] Andrea Lancichinetti, M. Irmak Sirer, Jane X. Wang, Daniel Acuna, Konrad Körding, and Luís A. Nunes Amaral. A highreproducibility and highaccuracy method for automated topic classification. Phys. Rev. X, 5(011007), 2014.
 [14] Martin Gerlach, Tiago P. Peixoto, and Eduardo G. Altmann. A network approach to topic models. arXiv:1708.01677 [stat.ML], 2017.
 [15] Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(3):379–423, 1948.
 [16] Lee J. Cronbach and Paul E. Meehl. Construct validity in psychological tests. Psychological Bulletin, 52(4):281–302, 1955.
 [17] Roberta Sinatra, Pierre Deville, Michael Szell, Dashun Wang, and AlbertLaszlo Barabasi. A century of physics. Nature, page 791–796, 2015.
 [18] arxiv bulk data access. https://arxiv.org/help/bulk_data_s3.
 [19] Patrice Lopez. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Maristella Agosti, José Borbinha, Sarantos Kapidakis, Christos Papatheodorou, and Giannis Tsakonas, editors, Research and Advanced Technology for Digital Libraries, pages 473–474, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.
 [20] Nature Publishing Group. Nature index. https://www.natureindex.com/.
 [21] Department of Commerce US Census Bureau. Names from census 1990. https://catalog.data.gov/dataset/namesfromcensus1990.
 [22] Damien P. George and Robert Knegjens. Paperscape. http://paperscape.org.

[23]
G. P. Basharin.
On a statistical estimate for the entropy of a sequence of independent random variables.
Theory of Probability and its Applications, 4:333–336, 1959. Translated from Russian.
Comments
There are no comments yet.