Using text analysis to quantify the similarity and evolution of scientific disciplines

by   Laercio Dias, et al.

We use an information-theoretic measure of linguistic similarity to investigate the organization and evolution of scientific fields. An analysis of almost 20M papers from the past three decades reveals that the linguistic similarity is related but different from experts and citation-based classifications, leading to an improved view on the organization of science. A temporal analysis of the similarity of fields shows that some fields (e.g., computer science) are becoming increasingly central, but that on average the similarity between pairs has not changed in the last decades. This suggests that tendencies of convergence (e.g., multi-disciplinarity) and divergence (e.g., specialization) of disciplines are in balance.



page 3


Combining dissimilarity measure for the study of evolution in scientific fields

The evolution of scientific fields has been attracting much attention in...

Quantification and Analysis of Scientific Language Variation Across Research Fields

Quantifying differences in terminologies from various academic domains h...

Annotation Uncertainty in the Context of Grammatical Change

This paper elaborates on the notion of uncertainty in the context of ann...

Coevolution of theoretical and applied research: a case study of graphene research by temporal and geographic analysis

As a part of science of science (SciSci) research, the evolution of scie...

Evolution of the Informational Complexity of Contemporary Western Music

We measure the complexity of songs in the Million Song Dataset (MSD) in ...

A Neural Network-Based Linguistic Similarity Measure for Entrainment in Conversations

Linguistic entrainment is a phenomenon where people tend to mimic each o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The digitization of scientific production opens new possibilities for quantitative studies on scientometrics and science of science evans.2011 , bringing new insights into questions such as how knowledge is organized (maps of science) boerner.2003 ; Shiffrin2004 ; boyack.2005 ; Rosvall2008 ; glaser.2017 , how impact evolves over time (bibliometrics) Wang2013 ; moreira.2015 , or how to measure the degree of interdisciplinarity lariviere.book2014 ; Noorden2015 . At the heart of these questions lies the problems of identifying scientific fields and how they relate to each other. The difficulty of these problems, and the inadequacy of a purely essentialist approach, was clear to K. R. Popper already in the 1950’s popper.1952 : “The belief that there is such a thing as physics, or biology, or archaeology, and that these ’studies’ or ’disciplines’ are distinguishable by the subject matter which they investigate, appears to me to be a residue from the time when one believed that a theory had to proceed from a definition of its own subject matter. But subject matter, or kinds of things, do not, I hold, constitute a basis for distinguishing disciplines.” popper.1952 . Instead, he argued that disciplines have a cognitive and a social dimension balsiger.book2005 , i.e. they “are distinguished partly for historical reasons and reasons of administrative convenience (such as the organization of teaching and of appointments), and partly because the theories which we construct to solve our problems have a tendency to grow into unified systems.” popper.1952 .

On the one hand, the social dimension of scientific fields can be defined in terms of different institutions establishing stable recurring patterns of behavior guntau.book1991 : producing and reproducing institutions such as research institutes and universities, communicative institutions such as scientific societies, journals or conferences, collecting institutions (journals, libraries), as well as directing institutions (ministries, scientific advisory boards), etc. All these institutions contribute to the formation, stabilization, and reproduction of a discipline as well as its distinction from others. On the other hand, the cognitive dimension has been specified in Ref. guntau.book1991 as a number of fundamental invariants in the procedural knowledge, which lead to the categorical construction of scientific knowledge. If this process causes a change in the cognitive realm for an object of knowledge, it constitutes a certain discipline.

The brief discussion above is sufficient to show that both the definition and relation between scientific fields depend on multiple dimensions (e.g., essentialist, social, and cognitive). Traditional (expert) classifications are mostly motivated by the ”subject matters” under investigation and can be associated to an essentialist view. The empirical analysis of citation networks, an approach with a long tradition in scientometry Garfield1964 ; DeSollaPrice1965 , can be regarded as capturing the social dimension (i.e. collecting institutions in the form of journals). While citations offer valuable insights into the structure and dynamics of science, they thus reflect only one particular dimension of the relationship between publications (or scientists) largely ignoring the actual content of the scientific articles. In contrast, the cognitive dimension can be operationalized with the help of linguistic features (e.g., keywords as indicators for conceptual imprints of disciplines). The increasing availability of full text of scientific articles (e.g. of Open Access journals) provides new opportunities to study the latter aspect in the form of written language. Examples include i) the tracking of the spread of individual words (memes) kuhn.2014 or ideas chavalarias.2013 , ii) quantifying differences in the scientific discourse between subdomains in biomedical literature lippincott.2011 or “hard” and “soft” science evans.2016a , or iii) efforts to combine citation and textual information braam.1991 ; boerner.2003 ; vilhena.2014 ; silva.2016 ; sienkiewicz.2016 .

In this work we advance the idea that the organization and evolution of science should be studied through different, complementary, dimensions. We add a new methodology that provides a meaningful, language-based, organization of scientific disciplines based on written text, we study how it compares to classifications obtained from experts as well as citations, and we study the temporal evolution in the relation between different scientific disciplines. More specifically, we introduce an unsupervised methodology to analyze the text of scientific articles. Our methodology is based on an information-theoretic dissimilarity measure we proposed recently Gerlach2016 (more technically, it is a generalized and normalized Jensen-Shannon divergence between two corpora). The main advantage of this measure is that it has an absolute meaning (i.e., it is not based on relative comparisons) and it is statistically more robust than traditional approaches Gerlach2016 ; Altmann2017 , e.g. with respect to the detection of spurious trends due to rare words and increasing corpus sizes. We measure the similarity between scientific fields based on abstracts from the last 3 decades (Web of Science database). Comparing our language analysis to a citation analysis and an experts classification, we find that the language and citation are more similar to each other but the language is even more distinct from the experts than the citation analysis. Following the relation between scientific fields over time, our language analysis reveals the scientific fields that are becoming more central in science. However, overall (averaged over all pairs of disciplines) we find that the similarity between the language of different fields is not increasing.

Ii Dissimilarity measures of scientific fields

We are interested in the general problem boerner.2003 ; boyack.2005 of quantifying the relationship between two scientific fields through the computation of dissimilarity measures , i.e., a quantification of how different and are. Dissimilarity measures are symmetric , non-negative , and  webb.2002

. Each scientific field is defined by (at least hundreds of) papers classified by Web of Science as belonging to the same category (see Methods Sec. 

V.1 for details on the data). We consider dissimilarities computed based on the following three different information.

ii.1 Experts

The classification of disciplines by their relationship is as old as science itself. The most used structure is a strict hierarchical tree, as seen in the traditional departmental division of Universities. The collection of papers used here, provided by ISI Web Of Science webOfScience , provides a classification of papers according to the OECD classification of fields of science and technology bibliographicWoSClassification . This scheme is a hierarchical tree with scientific fields defined at 3 levels (domains, disciplines, and specialties). For instance, Applied Mathematics (a specialty) is part of Mathematics (a discipline) which is part of Natural Sciences (a domain). The natural dissimilarity measure between two fields in this structure is the number of links needed to reach a common ancestor of and . For instance, considering at the specialty level, can assume three different values: for specialties belonging to the same discipline (e.g., Applied Mathematics and

Statistics & Probability

), for specialties belonging to the same domain (e..g, Applied Mathematics and Condensed Matter Physics), and for the other pairs of specialties (e.g., Applied Mathematics and Linguistics). While researchers have pointed out potential issues with classification into categories of ISI Web Of Science boyack.2005 , it offers the most extensively available classification and remains widely used to relate articles and journals to disciplines porter.2009 ; lariviere.book2014 .

ii.2 Citations

Another popular approach is to consider that fields and are more similar if there are citations from (to) papers in to (from) papers in  Garfield1964 ; DeSollaPrice1965 ; boyack.2005 . Here we consider a dissimilarity measure which decreases for every citation between papers in and , increases with every citation from that is not to (and vice-versa), but that remains unchanged by the number of citations that do not involve neither nor . These requirements are achieved using (for ) a symmetrized Jaccard-like dissimilarity webb.2002 ; leydesdorff.2008


where are the number of citations from to , , and 111Each of the two terms in Eq. (1) can be interpreted as a directed Jaccard distance () in the sense that we divide the number of edges that are out-links of field () and in-links of field () by the number of edges that are out-links of field () or in-links of field ()..

Figure 1: Dissimilarity between specialties measured in three different dimensions: (a) based on experts classification  bibliographicWoSClassification , where and ; (b) Citations dissimilarity  (1), where and ; (c) Language dissimilarity  (2), where and . specialties of the OECD classification scheme are considered. Results based on papers from , see Sec. V.1 for details.

ii.3 Language

We compare the language of fields and based on the frequency of words in each field using methods from Information Theory. Measuring the frequency of word , for each field

we obtain a vector of frequencies

for , where is the size of the vocabulary (i.e. number of different words). From this, following Ref. Gerlach2016 , the dissimilarity between two fields and is


where is the generalized entropy of order and the denominator ensures normalization (i.e.,

). In order to increase the discrimination power and to avoid statistical biases in our estimation, we removed a list of stop words and included only the

most frequent words (see Methods Sec. V.3 for a justification). The dissimilarity (2) corresponds to a generalized (and normalized) Jensen-Shannon divergence which yields statistically robust estimations in texts Gerlach2016 ; Altmann2017 (for details and motivation, see Methods Sec. V.4).

The advantages of Eq. (2) are twofold. On the one hand, it is well-founded in Information Theory and its statistical properties (in terms of systematic and statistical errors) are well understood grosse.2002 ; Gerlach2016

distinguishing it from other heuristic approaches. On the other hand, it has convenient properties: i)

; ii) it depends only on the papers contained in fields and ; and iii) it does not require training corpora. As a result, the measured distance between two fields, , has an absolute meaning. This is in contrast to alternative similarity measures boyack.2005 ; boerner.2003

, including machine-learning approaches (e.g., topic models 

Landauer2004 ; Boyack2011 ) based on (un-) supervised classification of documents into coherent subgroups. Here, the main limitations stem from the fact that either i) the division into subgroups is typically based on statistically significant differences in the usage of words between the different subgroups independent of the actual effect size, or ii) the resulting distance between two fields depends on all other fields as well (e.g. the distance between ’Physics’ and ’Chemistry’ depends on whether one includes articles about ’Anthropology’ in the classification).

Iii Results

We now present and interpret results obtained computing the three dissimilarity measures ( and ) reported above for scientific fields defined by papers published in different time intervals and categorized (by Web of Science) as belonging to the same specialty (e.g., Applied Mathematics), discipline, (e.g., Mathematics) or domain (e.g., Natural Sciences).

iii.1 Comparison of dissimilarity measures

Figure 1 shows the three at the level of specialties for the complete time interval . The concentration of low close to the diagonal shows that both the citations and language of scientific papers partially reflect the disciplinary classification done by the experts. However, visual inspection already reveals that citations and our language analysis show relationships not present in the expert classification, e.g., the low dissimilarity between Engineering and Natural Sciences (most clearly between Electrical Engineering and Physical Sciences) and between Agriculture and Biological Sciences.

We start by quantifying the relationship between the three different dissimilarity measures, i.e. ( and ), across all pairs of specialties . In Tab. 1 we report the rank-correlation between the three measures, which we obtain from ranking for each dissimilarity the pairs of according to . The choice of this non-parametric correlation is motivated by the fact that the range of the three measures differs dramatically (e.g. and ). The positive statistically-significant correlation between all pairs of ’s confirms the visual impression described above. The correlation between citations and language is higher than the correlation with the experts classification. Remarkably, language and citations show a very similar correlation with experts but language is systematically less correlated than citations ( for Spearman- and for Kendall- 222Obtained from

bootstrapping samples of each joint distribution

and , i.e. comparison of pairs of correlation values). We conclude that the language dissimilarity introduced here is able to retrieve the well-known relationships between disciplines in a similar extent that the (well-studied) citation analysis.

Time lang-cite lang-exp cite-exp
All, 1991-2014 () () ()
half, 1991-2002 () () ()
half, x2003-2014 () () ()
Table 1: Rank correlation between the dissimilarities measures obtained from different dimensions computed over all specialty pairs . All values are significantly different from zero (p-values ). The two values in each cell denote the Kendall- and Spearman- (in parenthesis). Qualitatively equivalent results are obtained in three different time intervals (indicated in the left row).

We now explore how the relationship between the different dimensions depends on the different scientific fields. The results in Fig. 2 confirm the conclusions of the aggregated analysis but shows further interesting features. First, the correlation in () is smaller than () mainly in the natural sciences. Second, while the correlation between citations and language remains largely constant, large fluctuations in the correlations between expert and citations (as well as expert and language) exist. This is seen both as the strong downward spikes and also in the manifested dependence on disciplines and domains. The titles of the specialties at the low peaks already suggest that these are specialties with interdisciplinary connections. For instance, Chemistry, Medicinal is a specialty that (according to the experts classification) belongs to the discipline Basic Medicine and to the domain Medical Science. Therefore between Chemistry, Medicinal and all specialties of the Natural Sciences (in particular, for all specialties from the discipline Chemical Sciences). Instead, the dissimilarity measured by citations and language yield much smaller values revealing the proximity of Chemistry, Medicinal to the Natural Sciences thus explaining the low correlation in () and in (). The central role of the natural sciences in other disciplines explains also the other spikes: computing for a list of selected specialties the pairs which suffered the largest rank change we find that 9 from the the top 10 specialties which increased most in ranks (comparing with ) were from the domain Natural Sciences ( of them from the discipline Chemical sciences, including the top 2 specialties).

Figure 2: Correlation between the different dissimilarity measures varies across fields. The Kendall correlation (shown in the vertical axis) for two measures and is computed between and over all specialties for a fixed specialty (shown in the horizontal-axis). The three possible comparisons are indicated in the caption. Six specialties (one from each domain) with low correlation are highlighted.
Figure 3: Hierarchical clusterings at the level of domains (top row) and disciplines (bottom row). Results for citations (language) were obtained by agglomerative hierarchical clustering, applying the Group Average Method Sokal1958 to (). The x-axis shows the clustering dissimilarity (i.e., the dissimilarity of two clusterings that are merged). The colors reflect the clustering obtained at the dashed line, which corresponds to a clustering dissimilarity equals to the percentile 0.92 of the values of all cluster dissimilarities at each measure (citations/language).

iii.2 Hierarchical Clustering

A strict hierarchical classification of scientific fields is both aesthetically appealing and of practical use in bibliographical and document classification tasks. It also allows us to further highlight the differences in the relationship between scientific fields revealed by the different dissimilarity measures (in particular by ). While is precisely based on one such hierarchical classifications, and are not. In Fig. 3 we show the hierarchical classifications induced by and through the computation of a simple clustering method at the level of domains and disciplines.

At the top level of the domains (top row in Fig. 3), the clustering obtained from citations and from language are very similar. In particular, both identify Engineering-Natural Sciences and Humanities-Social Science as clusters that separate from the other domains in a similar fashion. The only difference is that, based on citations, Agriculture appears more isolated while based on language this happens for Medical Science. A more detailed picture of the differences between language and citation is revealed at the level of disciplines (bottom row in Fig. 3). While at the first division, both citations and language create a cluster in which all disciplines of the domains Humanities and Social Sciences appear, further divisions show more subtle differences between the two dissimilarity measures.

Remarkably, the hierarchy obtained from language creates a cluster containing all and only Humanities disciplines. In contrast, the hierarchy based on citations creates one clustering with three of the five Humanities disciplines (Lang. and Literature, Arts, and Other Humanities while the two remaining ones (History & Archaeology and Philosophy, ethics, religion) are clustered together in the middle of a cluster of disciplines in Social Science. Another interesting difference between the clusterings is revealed looking at 3 disciplines of the domain Medicine: In the analysis based on Citations the minimum cluster that includes the three disciplines includes Biological sciences and Other natural sciences, while in the language analysis this cluster includes additionally three related Engineering disciplines (Medical eng., Ind. biotechnology, and Envir. biotechnology).

Probably the most remarkable feature of the clustering obtained by, both, citations and language is that it repeatedly clusters together related disciplines from Natural Sciences with disciplines from Engineering and Medicine (e.g., Chemical Sciences and Materials Science). This clustering, not present in the experts classification, suggests that the distinction between fundamental and applied sciences present in the expert classification has no strong effect on citations and the language of the publications. Instead, in this specific case, the citation and language analysis seem to be capturing a connection between “subject matters” that was necessarily absent from the strict hierarchical expert classification.

Figure 4: Evolution of the similarity between disciplines in the last three decades. Left panel: distance between Physical Sciences () and other five selected disciplines (, three-year moving averages). Right panel: total variation – defined in Eq. (3) – of the distance for pairs of disciplines with histories longer than years. Each boxplot corresponds to the distribution of for pairs of disciplines where we fixed one of the disciplines. At position (a) we fixed Computer and information sciences, at (b) Chemical sciences, at (c) Psychology, and at (d) we used all pairs of disciplines.

iii.3 Temporal evolution

While in the previous sections we looked at a static snapshot of the relation between disciplines, here we are interested in how the linguistic relationship between pairs of disciplines evolved over the last three decades 333We work at the level of disciplines because most specialties fail to have enough publications in a single year.. In Figure 4 we show the temporal evolution for five out of pairs , with focus on the discipline Physical Sciences, illustrating different types of dynamic patterns. On the one hand, the dissimilarity to Chemical Sciences (its most similar discipline) and Mathematics stay roughly constant over time. On the other hand, we also observe systematic trends of disciplines becoming more or less similar over time. While the proximity to Biological Sciences and Computer and information Science has steadily increased (decreased dissimilarity ) after the year , the opposite trend is seen for Electrical, electronical, and information Engineering. These observations are consistent with the increasing number of biological and computational-related publications in Physics, and with a departure from the historical connections to Engineering.

The observations reported above raise the question whether scientific disciplines are showing an overall tendency to become more similar to each other. In a more general context, this amounts to the question whether the purported increase in interdisciplinarity leads to a larger overlap in the language used by different disciplines. We address this question by computing, for each pair of disciplines, the mean yearly variation


where the time interval was usually from to . The distribution of values of for all disciplines pairs is shown at the (rightmost) box plot in the right panel of Fig. 4. We see that there are both positive and negative variations, consistent with our qualitative observations in the example of Physical Sciences in left panel of the Fig. 4. However, the average variation over all pairs of disciplines

is not distinguishable from zero (the null hypothesis of

has a p-value=

in the T-test for the mean of one sample and a p-value =

in the non-parametric Wilcoxon test), i.e. the typical dissimilarity remains unchanged. This result suggests that, while there are systematic trends for individual pairs of disciplines, on average there is no significant increase or decrease in the interdisciplinarity for the science as a whole in the last 3 decades as measured by the language.

On a more fine-grained level, however, we observe systematic trends that suggest that individual disciplines tend to become more (less) central. For this, we focus on the discipline pairs

which experienced the most extreme variation in the last decade (one standard deviation away from

). These pairs have typically meaning that their (normalized) dissimilarity changes roughly in a decade. The three disciplines that are most frequently seen in the left tail () are: 1-02 Computer and information sciences, 2-08 Environmental biotechnology, and 3-01 Basic medicine. The language of these disciplines became significantly more similar to the language of other disciplines in the last 3 decades, suggesting that these disciplines became more central. In contrast, the three disciplines that experienced most strongly the opposite effect (most frequently seen in the right tail, ) are: 5-01 Psychology, 2-05 Materials engineering, and 2-02 Electrical engineering, electronic engineering, information engineering.

Iv Discussion

We investigated the similarity between scientific fields from different perspectives: an expert classification, a citation analysis, and a newly proposed measure of linguistic similarity. We found that these different dimensions are related yet different, yielding thus new insights on the relationship between disciplines, their hierarchical organization, and their temporal evolution.

Our first main finding is that the language and citation relationships between disciplines are similar and substantially different from the expert classification. This is consistent with the motivation exposed in our introduction which associated the expert classification to the (largely idealized) essentialist view of scientific disciplines, while the citation (social) and language (cognitive) were closer to dimensions that play a more important role in the relationship between fields. Interestingly, our results indicate that the language-relation of fields is more distinct from the expert classification than the citation-relation is, specially in the natural sciences.

Our second main finding is that in the last 30 years the language of different scientific fields remain, on average, at the same distance from all other fields. While individual disciplines show clear trends of increasing (or decreasing) centrality, this suggests that, overall, diverging tendencies in science (e.g., specialization) are in balance with converging tendencies (e.g., multidisciplinarism). This is a remarkable quantitative finding because of the substantial changes observed in this period.

The latter result demonstrates that our textual measure is of practical relevance for the study of interdisciplinarity. In recent years, interdisciplinary research achieved a central position (Noorden2015, ) due to its broader relation to the concept of diversity stirling.2007 and its effect on impact uzzi.2013 ; wang.2015 and performance of teams lungeanu.2014 as well as its implications for policy making, e.g. in terms of funding academy.book2004 . Is it just a fashion or science is really getting more and more interdisciplinary? A usual way to assess interdisciplinarity is based on citation networks using heuristic approaches porter.2009 ; wagner.2011 ; lariviere.book2014 or methods from complex networks pan.2012 ; sayama.2012 ; sinatra.2015 ; omodei.2016 . In line with the arguments exposed in the introduction, interdisciplinarity can be viewed through different dimensions and the cognitive dimension would be best measured using textual data. However, there are only very few works bache.2013 ; nichols.2014 ; evans.2016 relating textual measures with interdisciplinarity, despite the increasing availability of the text of scientific articles. In this view, the significance of our approach is that it provides a measure of interdisciplinarity based on how much the usage of words in different disciplines overlap.

Finally, we hope our results and methodology will stimulate a multiple-dimensional approach in other problems related to the study of sciences, profiting from the modern availability of large (textual) databases of scientific publications that allow us to go beyond traditional bibliometric analysis evans.2011 ; lariviere.book2014 . These include, but are not limited to, the formulation of more meaningful bibliometric indicators mann.2006 , the identification and prediction of influential papers and disciplines gerrish.2010 ; foulds.2013 ; whalen.2015 , or the inclusion of textual information in recommending related scientific papers achakulvisut.2016 .

V Materials and Methods

v.1 Data and grouping of corpora

We use the Web of Science database webOfScience and explore the following information available for individual articles: citations, title, abstract, and the classification in one scientific specialty (per OECD classification bibliographicWoSClassification ). We use all papers published between 1991 and 2014 because the number of articles with text in the abstract is substantial only after 1991 and because at the time we started our analysis 2014 was the last complete year available to us. The text of an article was built concatenation its title and abstract. The corpus representing a specialty in a given year is obtained from the concatenation of the text of all articles for that specialty in that year. The corpus for one discipline (or domain) concatenates all articles in all specialties belonging to that discipline (or domain).

Our analysis is based on articles for each the textual and classification information were available ( of all articles indexed in Web of Science between 1991-2014). In our analysis we considered only citations from and to the papers in our list because only for these papers we had a reliable classification of specialties. These citations corresponded to roughly half of the M citations associated with these papers.

v.2 Data processing

For each article in our database we performed the following steps to process the textual information:

  1. The copyright information contained in the abstract was removed.

  2. Title and abstract were concatenated.

  3. The text was converted to lowercase.

  4. Contractions were replaced by their non-contracted form.

  5. The text was tokenized, and the nouns and verbs were lemmatized using the Natural Language Toolkit nltk .

  6. Symbols (except hyphen, to avoid remove significant compound modifiers) inside tokens were replaced by white space, therefore generating two or more distinct tokens.

  7. Tokens composed by numbers or single letter were removed.

  8. Tokens belonging to a preset stop-word list were discarded.

v.3 Minimum corpus size

We computed using only the most frequent word types, disregarding the scientific fields for which there was not enough data to achieve this cut-off. This choice is motivated by the slow convergence of entropy estimations (and thus Gerlach2016 . By choosing a fixed number of word types we reduce the effect of the remaining bias (in the estimation of ) on our comparative analysis of textual dissimilarity between pairs of fields. This happens because the residual bias acts as an off-set in all cases (when a fixed cut-off is chosen) instead of affecting differently each case (as obtained if the maximum amount of data is used in each case). The bias decays with the number of word types used because the more frequent types are responsible for almost all the dissimilarity, specially for  Altmann2017 . Using types as a cut-off, we estimated the textual dissimilarity relative standard deviation, computed over multiple samples of the same scientific field, to be . Our cut-off of types is a conservative choice to ensure that .

v.4 Generalized Jensen-Shannon Divergence

Given two texts (indexed by and

), we define the probability distributions over all words

as and . An Information-theoretic measure to quantify their similarity is the generalized Jensen-Shannon divergence


based on the generalized entropy of order (), where


Here, we consider a normalized similarity Gerlach2016


such that where is the maximum possible between and assuming that the the set of symbols in each distribution (i.e., the support of and ) are disjoint.

Note that for , Eq. (6) yields the Shannon-entropy (cover2006, ), i.e. , and is the well-known Jensen-Shannon divergence lin1991 . Ref. Gerlach2016 shows that provides the most robust statistical measure of similarity of texts.


L.D. received financial support from CNPq/Brazil through the program “Science without Borders”. We thank M. Palzenberger and the Max Planck Digital Library for providing access to the data, M. de Domenico for insightful discussions, and S. Haan and the Centre for Translational Data Science (University of Sydney) for helping with Figs. 

1 and 3.


  • (1) J. A. Evans, J. G. Foster, Science 331, 721 (2011).
  • (2) K. Börner, C. Chen, K. W. Boyack, Annual review of information science and technology 37, 179 (2003).
  • (3) R. M. Shiffrin, K. Borner, Proceedings of the National Academy of Sciences 101, 5183 (2004).
  • (4) K. W. Boyack, R. Klavans, K. Börner, Scientometrics 64, 351 (2005).
  • (5) M. Rosvall, C. T. Bergstrom, Proceedings of the National Academy of Sciences 105, 1118 (2008).
  • (6) J. Gläser, W. Glänzel, A. Scharnhorst, Scientometrics 111, 979 (2017).
  • (7) D. Wang, C. Song, A.-L. Barabási, Science 342, 127 (2013).
  • (8) J. A. G. Moreira, X. H. T. Zeng, L. A. N. Amaral, PLoS ONE 10, e0143108 (2015).
  • (9) V. Larivière, Y. Gingras, Beyond Bibliometrics (MIT Press, 2014).
  • (10) R. V. Noorden, Nature 525, 306 (2015).
  • (11) K. R. Popper, The British Journal for the Philosophy of Science 3, 124 (1952).
  • (12) P. W. Balsiger, Transdisziplinarität : systematisch-vergleichende Untersuchung disziplinenübergreifender Wissenschaftspraxis (Fink, 2005).
  • (13) M. Guntau, H. Laitko, World Views and Scientific Discipline Formation, R. W. Woodward, R. S. Cohen, eds. (Springer Netherlands, 1991).
  • (14) E. Garfield, I. H. Sher, R. J. Torpie, The use of citation data in writing the history of science (Institute for Scientific Information, Philadelphia, 1964).
  • (15) D. J. de Solla Price, Science 149, 510 (1965).
  • (16) T. Kuhn, M. Perc, D. Helbing, Physical Review X 4, 041036 (2014).
  • (17) D. Chavalarias, J.-P. Cointet, PLoS ONE 8, e54847 (2013).
  • (18) T. Lippincott, D. Ó. Séaghdha, A. Korhonen, BMC bioinformatics 12, 212 (2011).
  • (19) E. Evans, C. Gomez, D. McFarland, Sociological Science 3, 757 (2016).
  • (20) R. R. Braam, H. F. Moed, A. F. J. van Raan, Journal of the American Society for Information Science 42, 233 (1991).
  • (21) D. Vilhena, et al., Sociological Science 1, 221 (2014).
  • (22) F. N. Silva, D. R. Amancio, M. Bardosova, L. d. F. Costa, O. N. Oliveira, Journal of Informetrics 10, 487 (2016).
  • (23) J. Sienkiewicz, E. G. Altmann, Royal Society Open Science 3, 160140 (2016).
  • (24) M. Gerlach, F. Font-Clos, E. G. Altmann, Physical Review X 6, 021009 (2016).
  • (25) E. G. Altmann, L. Dias, M. Gerlach, Journal of Statistical Mechanics: Theory and Experiment 2017, 014002 (2017).
  • (26) A. Webb,

    Statistical Pattern Recognition

    (Wiley, 2002).
  • (27) Web of Science is a product of Thomson Reuters.
  • (28) Working Party of National Experts on Science and Technology, OECD (2006) available at
  • (29) A. L. Porter, I. Rafols, Scientometrics 81, 719 (2009).
  • (30) L. Leydesdorff, Journal of the American Society for Information Science and Technology 59, 77 (2008).
  • (31) Each of the two terms in Eq. (1) can be interpreted as a directed Jaccard distance () in the sense that we divide the number of edges that are out-links of field () and in-links of field () by the number of edges that are out-links of field () or in-links of field ().
  • (32) I. Grosse, et al., Physical Review E 65, 041905 (2002).
  • (33) T. K. Landauer, D. Laham, M. Derr, Proceedings of the National Academy of Sciences 101 Suppl, 5214 (2004).
  • (34) K. W. Boyack, et al., PLoS One 6, e18029 (2011).
  • (35) Obtained from bootstrapping samples of each joint distribution and , i.e. comparison of pairs of correlation values.
  • (36) R. Sokal, C. Michener, University of Kansas Science Bulletin 38, 1409 (1958).
  • (37) We work at the level of disciplines because most specialties fail to have enough publications in a single year.
  • (38) A. Stirling, Journal of The Royal Society Interface 4, 707 (2007).
  • (39) B. Uzzi, S. Mukherjee, M. Stringer, B. Jones, Science (New York, N.Y.) 342, 468 (2013).
  • (40) J. Wang, B. Thijs, W. Glänzel, PLoS ONE 10, e0127298 (2015).
  • (41) A. Lungeanu, Y. Huang, N. S. Contractor, Journal of Informetrics 8, 59 (2014).
  • (42) Committee on Facilitating Interdisciplinary Research; Committee on Science, Engineering, P. P. I. of Medicine; Policy, G. A. N. A. of Sciences; National Academy of Engineering, Facilitating Interdisciplinary Research (National Academies Press, 2004).
  • (43) C. S. Wagner, et al., Journal of Informetrics 5, 14 (2011).
  • (44) R. K. Pan, S. Sinha, K. Kaski, J. Saramäki, Scientific Reports 2, 1 (2012).
  • (45) H. Sayama, J. Akaishi, PLoS ONE 7, e38747 (2012).
  • (46) R. Sinatra, P. Deville, M. Szell, D. Wang, A.-L. Barabási, Nature Physics 11, 791 (2015).
  • (47) E. Omodei, M. D. Domenico, A. Arenas, Network Science pp. 1–12 (2016).
  • (48) K. Bache, D. Newman, P. Smyth, Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’13 (ACM Press, 2013).
  • (49) L. G. Nichols, Scientometrics 100, 741 (2014).
  • (50) E. D. Evans, Socius: Sociological Research for a Dynamic World 2 (2016).
  • (51) G. S. Mann, D. Mimno, A. McCallum, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’06 (ACM Press, 2006).
  • (52) S. Gerrish, D. M. Blei, Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel (2010), pp. 375–382.
  • (53) J. Foulds, P. Smyth,

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

    pp. 113–123 (2013).
  • (54) R. Whalen, Y. Huang, A. Sawant, B. Uzzi, N. Contractor, Quantifying and Analysing Scholarly Communication on the Web (ASCW’15) (2015).
  • (55) T. Achakulvisut, D. E. Acuna, T. Ruangrong, K. Kording, PLoS ONE 11, e0158423 (2016).
  • (56) Natural language toolkit,
  • (57) T. M. Cover, J. A. Thomas, Elements of Information Theory (Wiley-Interscience, 2006).
  • (58) J. Lin, IEEE Transactions on Information Theory 37, 145 (1991).