I Introduction
The digitization of scientific production opens new possibilities for quantitative studies on scientometrics and science of science evans.2011 , bringing new insights into questions such as how knowledge is organized (maps of science) boerner.2003 ; Shiffrin2004 ; boyack.2005 ; Rosvall2008 ; glaser.2017 , how impact evolves over time (bibliometrics) Wang2013 ; moreira.2015 , or how to measure the degree of interdisciplinarity lariviere.book2014 ; Noorden2015 . At the heart of these questions lies the problems of identifying scientific fields and how they relate to each other. The difficulty of these problems, and the inadequacy of a purely essentialist approach, was clear to K. R. Popper already in the 1950’s popper.1952 : “The belief that there is such a thing as physics, or biology, or archaeology, and that these ’studies’ or ’disciplines’ are distinguishable by the subject matter which they investigate, appears to me to be a residue from the time when one believed that a theory had to proceed from a definition of its own subject matter. But subject matter, or kinds of things, do not, I hold, constitute a basis for distinguishing disciplines.” popper.1952 . Instead, he argued that disciplines have a cognitive and a social dimension balsiger.book2005 , i.e. they “are distinguished partly for historical reasons and reasons of administrative convenience (such as the organization of teaching and of appointments), and partly because the theories which we construct to solve our problems have a tendency to grow into unified systems.” popper.1952 .
On the one hand, the social dimension of scientific fields can be defined in terms of different institutions establishing stable recurring patterns of behavior guntau.book1991 : producing and reproducing institutions such as research institutes and universities, communicative institutions such as scientific societies, journals or conferences, collecting institutions (journals, libraries), as well as directing institutions (ministries, scientific advisory boards), etc. All these institutions contribute to the formation, stabilization, and reproduction of a discipline as well as its distinction from others. On the other hand, the cognitive dimension has been specified in Ref. guntau.book1991 as a number of fundamental invariants in the procedural knowledge, which lead to the categorical construction of scientific knowledge. If this process causes a change in the cognitive realm for an object of knowledge, it constitutes a certain discipline.
The brief discussion above is sufficient to show that both the definition and relation between scientific fields depend on multiple dimensions (e.g., essentialist, social, and cognitive). Traditional (expert) classifications are mostly motivated by the ”subject matters” under investigation and can be associated to an essentialist view. The empirical analysis of citation networks, an approach with a long tradition in scientometry Garfield1964 ; DeSollaPrice1965 , can be regarded as capturing the social dimension (i.e. collecting institutions in the form of journals). While citations offer valuable insights into the structure and dynamics of science, they thus reflect only one particular dimension of the relationship between publications (or scientists) largely ignoring the actual content of the scientific articles. In contrast, the cognitive dimension can be operationalized with the help of linguistic features (e.g., keywords as indicators for conceptual imprints of disciplines). The increasing availability of full text of scientific articles (e.g. of Open Access journals) provides new opportunities to study the latter aspect in the form of written language. Examples include i) the tracking of the spread of individual words (memes) kuhn.2014 or ideas chavalarias.2013 , ii) quantifying differences in the scientific discourse between subdomains in biomedical literature lippincott.2011 or “hard” and “soft” science evans.2016a , or iii) efforts to combine citation and textual information braam.1991 ; boerner.2003 ; vilhena.2014 ; silva.2016 ; sienkiewicz.2016 .
In this work we advance the idea that the organization and evolution of science should be studied through different, complementary, dimensions. We add a new methodology that provides a meaningful, language-based, organization of scientific disciplines based on written text, we study how it compares to classifications obtained from experts as well as citations, and we study the temporal evolution in the relation between different scientific disciplines. More specifically, we introduce an unsupervised methodology to analyze the text of scientific articles. Our methodology is based on an information-theoretic dissimilarity measure we proposed recently Gerlach2016 (more technically, it is a generalized and normalized Jensen-Shannon divergence between two corpora). The main advantage of this measure is that it has an absolute meaning (i.e., it is not based on relative comparisons) and it is statistically more robust than traditional approaches Gerlach2016 ; Altmann2017 , e.g. with respect to the detection of spurious trends due to rare words and increasing corpus sizes. We measure the similarity between scientific fields based on abstracts from the last 3 decades (Web of Science database). Comparing our language analysis to a citation analysis and an experts classification, we find that the language and citation are more similar to each other but the language is even more distinct from the experts than the citation analysis. Following the relation between scientific fields over time, our language analysis reveals the scientific fields that are becoming more central in science. However, overall (averaged over all pairs of disciplines) we find that the similarity between the language of different fields is not increasing.
Ii Dissimilarity measures of scientific fields
We are interested in the general problem boerner.2003 ; boyack.2005 of quantifying the relationship between two scientific fields through the computation of dissimilarity measures , i.e., a quantification of how different and are. Dissimilarity measures are symmetric , non-negative , and webb.2002
. Each scientific field is defined by (at least hundreds of) papers classified by Web of Science as belonging to the same category (see Methods Sec.
V.1 for details on the data). We consider dissimilarities computed based on the following three different information.ii.1 Experts
The classification of disciplines by their relationship is as old as science itself.
The most used structure is a strict hierarchical tree, as seen in the traditional departmental division of Universities.
The collection of papers used here, provided by ISI Web Of Science webOfScience , provides a classification of papers according to the OECD classification of fields of science and technology bibliographicWoSClassification .
This scheme is a hierarchical tree with scientific fields defined at 3 levels (domains, disciplines, and specialties).
For instance, Applied Mathematics (a specialty) is part of Mathematics (a discipline) which is part of Natural Sciences (a domain).
The natural dissimilarity measure between two fields in this structure is the number of links needed to reach a common ancestor of and .
For instance, considering at the specialty level, can assume three different values: for specialties belonging to the same discipline (e.g., Applied Mathematics and Statistics & Probability
ii.2 Citations
Another popular approach is to consider that fields and are more similar if there are citations from (to) papers in to (from) papers in Garfield1964 ; DeSollaPrice1965 ; boyack.2005 . Here we consider a dissimilarity measure which decreases for every citation between papers in and , increases with every citation from that is not to (and vice-versa), but that remains unchanged by the number of citations that do not involve neither nor . These requirements are achieved using (for ) a symmetrized Jaccard-like dissimilarity webb.2002 ; leydesdorff.2008
(1) |
where are the number of citations from to , , and 111Each of the two terms in Eq. (1) can be interpreted as a directed Jaccard distance () in the sense that we divide the number of edges that are out-links of field () and in-links of field () by the number of edges that are out-links of field () or in-links of field ()..
![]() |
![]() |
![]() |
ii.3 Language
We compare the language of fields and based on the frequency of words in each field using methods from Information Theory. Measuring the frequency of word , for each field
we obtain a vector of frequencies
for , where is the size of the vocabulary (i.e. number of different words). From this, following Ref. Gerlach2016 , the dissimilarity between two fields and is(2) |
where is the generalized entropy of order and the denominator ensures normalization (i.e.,
). In order to increase the discrimination power and to avoid statistical biases in our estimation, we removed a list of stop words and included only the
most frequent words (see Methods Sec. V.3 for a justification). The dissimilarity (2) corresponds to a generalized (and normalized) Jensen-Shannon divergence which yields statistically robust estimations in texts Gerlach2016 ; Altmann2017 (for details and motivation, see Methods Sec. V.4).The advantages of Eq. (2) are twofold. On the one hand, it is well-founded in Information Theory and its statistical properties (in terms of systematic and statistical errors) are well understood grosse.2002 ; Gerlach2016
distinguishing it from other heuristic approaches. On the other hand, it has convenient properties: i)
; ii) it depends only on the papers contained in fields and ; and iii) it does not require training corpora. As a result, the measured distance between two fields, , has an absolute meaning. This is in contrast to alternative similarity measures boyack.2005 ; boerner.2003, including machine-learning approaches (e.g., topic models
Landauer2004 ; Boyack2011 ) based on (un-) supervised classification of documents into coherent subgroups. Here, the main limitations stem from the fact that either i) the division into subgroups is typically based on statistically significant differences in the usage of words between the different subgroups independent of the actual effect size, or ii) the resulting distance between two fields depends on all other fields as well (e.g. the distance between ’Physics’ and ’Chemistry’ depends on whether one includes articles about ’Anthropology’ in the classification).Iii Results
We now present and interpret results obtained computing the three dissimilarity measures ( and ) reported above for scientific fields defined by papers published in different time intervals and categorized (by Web of Science) as belonging to the same specialty (e.g., Applied Mathematics), discipline, (e.g., Mathematics) or domain (e.g., Natural Sciences).
iii.1 Comparison of dissimilarity measures
Figure 1 shows the three at the level of specialties for the complete time interval . The concentration of low close to the diagonal shows that both the citations and language of scientific papers partially reflect the disciplinary classification done by the experts. However, visual inspection already reveals that citations and our language analysis show relationships not present in the expert classification, e.g., the low dissimilarity between Engineering and Natural Sciences (most clearly between Electrical Engineering and Physical Sciences) and between Agriculture and Biological Sciences.
We start by quantifying the relationship between the three different dissimilarity measures, i.e. ( and ), across all pairs of specialties .
In Tab. 1 we report the rank-correlation between the three measures, which we obtain from ranking for each dissimilarity the pairs of according to .
The choice of this non-parametric correlation is motivated by the fact that the range of the three measures differs dramatically (e.g. and ).
The positive statistically-significant correlation between all pairs of ’s confirms the visual impression described above.
The correlation between citations and language is higher than the correlation with the experts classification. Remarkably, language and citations show a very similar correlation with experts but language is systematically less correlated than citations ( for Spearman- and for Kendall- 222Obtained from bootstrapping samples of each joint distribution
Time | lang-cite | lang-exp | cite-exp |
---|---|---|---|
All, 1991-2014 | () | () | () |
half, 1991-2002 | () | () | () |
half, x2003-2014 | () | () | () |
We now explore how the relationship between the different dimensions depends on the different scientific fields. The results in Fig. 2 confirm the conclusions of the aggregated analysis but shows further interesting features. First, the correlation in () is smaller than () mainly in the natural sciences. Second, while the correlation between citations and language remains largely constant, large fluctuations in the correlations between expert and citations (as well as expert and language) exist. This is seen both as the strong downward spikes and also in the manifested dependence on disciplines and domains. The titles of the specialties at the low peaks already suggest that these are specialties with interdisciplinary connections. For instance, Chemistry, Medicinal is a specialty that (according to the experts classification) belongs to the discipline Basic Medicine and to the domain Medical Science. Therefore between Chemistry, Medicinal and all specialties of the Natural Sciences (in particular, for all specialties from the discipline Chemical Sciences). Instead, the dissimilarity measured by citations and language yield much smaller values revealing the proximity of Chemistry, Medicinal to the Natural Sciences thus explaining the low correlation in () and in (). The central role of the natural sciences in other disciplines explains also the other spikes: computing for a list of selected specialties the pairs which suffered the largest rank change we find that 9 from the the top 10 specialties which increased most in ranks (comparing with ) were from the domain Natural Sciences ( of them from the discipline Chemical sciences, including the top 2 specialties).


iii.2 Hierarchical Clustering
A strict hierarchical classification of scientific fields is both aesthetically appealing and of practical use in bibliographical and document classification tasks. It also allows us to further highlight the differences in the relationship between scientific fields revealed by the different dissimilarity measures (in particular by ). While is precisely based on one such hierarchical classifications, and are not. In Fig. 3 we show the hierarchical classifications induced by and through the computation of a simple clustering method at the level of domains and disciplines.
At the top level of the domains (top row in Fig. 3), the clustering obtained from citations and from language are very similar. In particular, both identify Engineering-Natural Sciences and Humanities-Social Science as clusters that separate from the other domains in a similar fashion. The only difference is that, based on citations, Agriculture appears more isolated while based on language this happens for Medical Science. A more detailed picture of the differences between language and citation is revealed at the level of disciplines (bottom row in Fig. 3). While at the first division, both citations and language create a cluster in which all disciplines of the domains Humanities and Social Sciences appear, further divisions show more subtle differences between the two dissimilarity measures.
Remarkably, the hierarchy obtained from language creates a cluster containing all and only Humanities disciplines. In contrast, the hierarchy based on citations creates one clustering with three of the five Humanities disciplines (Lang. and Literature, Arts, and Other Humanities while the two remaining ones (History & Archaeology and Philosophy, ethics, religion) are clustered together in the middle of a cluster of disciplines in Social Science. Another interesting difference between the clusterings is revealed looking at 3 disciplines of the domain Medicine: In the analysis based on Citations the minimum cluster that includes the three disciplines includes Biological sciences and Other natural sciences, while in the language analysis this cluster includes additionally three related Engineering disciplines (Medical eng., Ind. biotechnology, and Envir. biotechnology).
Probably the most remarkable feature of the clustering obtained by, both, citations and language is that it repeatedly clusters together related disciplines from Natural Sciences with disciplines from Engineering and Medicine (e.g., Chemical Sciences and Materials Science). This clustering, not present in the experts classification, suggests that the distinction between fundamental and applied sciences present in the expert classification has no strong effect on citations and the language of the publications. Instead, in this specific case, the citation and language analysis seem to be capturing a connection between “subject matters” that was necessarily absent from the strict hierarchical expert classification.

iii.3 Temporal evolution
While in the previous sections we looked at a static snapshot of the relation between disciplines, here we are interested in how the linguistic relationship between pairs of disciplines evolved over the last three decades 333We work at the level of disciplines because most specialties fail to have enough publications in a single year.. In Figure 4 we show the temporal evolution for five out of pairs , with focus on the discipline Physical Sciences, illustrating different types of dynamic patterns. On the one hand, the dissimilarity to Chemical Sciences (its most similar discipline) and Mathematics stay roughly constant over time. On the other hand, we also observe systematic trends of disciplines becoming more or less similar over time. While the proximity to Biological Sciences and Computer and information Science has steadily increased (decreased dissimilarity ) after the year , the opposite trend is seen for Electrical, electronical, and information Engineering. These observations are consistent with the increasing number of biological and computational-related publications in Physics, and with a departure from the historical connections to Engineering.
The observations reported above raise the question whether scientific disciplines are showing an overall tendency to become more similar to each other. In a more general context, this amounts to the question whether the purported increase in interdisciplinarity leads to a larger overlap in the language used by different disciplines. We address this question by computing, for each pair of disciplines, the mean yearly variation
(3) | |||||
(4) |
where the time interval was usually from to . The distribution of values of for all disciplines pairs is shown at the (rightmost) box plot in the right panel of Fig. 4. We see that there are both positive and negative variations, consistent with our qualitative observations in the example of Physical Sciences in left panel of the Fig. 4. However, the average variation over all pairs of disciplines
is not distinguishable from zero (the null hypothesis of
has a p-value=in the T-test for the mean of one sample and a p-value =
in the non-parametric Wilcoxon test), i.e. the typical dissimilarity remains unchanged. This result suggests that, while there are systematic trends for individual pairs of disciplines, on average there is no significant increase or decrease in the interdisciplinarity for the science as a whole in the last 3 decades as measured by the language.On a more fine-grained level, however, we observe systematic trends that suggest that individual disciplines tend to become more (less) central. For this, we focus on the discipline pairs
which experienced the most extreme variation in the last decade (one standard deviation away from
). These pairs have typically meaning that their (normalized) dissimilarity changes roughly in a decade. The three disciplines that are most frequently seen in the left tail () are: 1-02 Computer and information sciences, 2-08 Environmental biotechnology, and 3-01 Basic medicine. The language of these disciplines became significantly more similar to the language of other disciplines in the last 3 decades, suggesting that these disciplines became more central. In contrast, the three disciplines that experienced most strongly the opposite effect (most frequently seen in the right tail, ) are: 5-01 Psychology, 2-05 Materials engineering, and 2-02 Electrical engineering, electronic engineering, information engineering.Iv Discussion
We investigated the similarity between scientific fields from different perspectives: an expert classification, a citation analysis, and a newly proposed measure of linguistic similarity. We found that these different dimensions are related yet different, yielding thus new insights on the relationship between disciplines, their hierarchical organization, and their temporal evolution.
Our first main finding is that the language and citation relationships between disciplines are similar and substantially different from the expert classification. This is consistent with the motivation exposed in our introduction which associated the expert classification to the (largely idealized) essentialist view of scientific disciplines, while the citation (social) and language (cognitive) were closer to dimensions that play a more important role in the relationship between fields. Interestingly, our results indicate that the language-relation of fields is more distinct from the expert classification than the citation-relation is, specially in the natural sciences.
Our second main finding is that in the last 30 years the language of different scientific fields remain, on average, at the same distance from all other fields. While individual disciplines show clear trends of increasing (or decreasing) centrality, this suggests that, overall, diverging tendencies in science (e.g., specialization) are in balance with converging tendencies (e.g., multidisciplinarism). This is a remarkable quantitative finding because of the substantial changes observed in this period.
The latter result demonstrates that our textual measure is of practical relevance for the study of interdisciplinarity. In recent years, interdisciplinary research achieved a central position (Noorden2015, ) due to its broader relation to the concept of diversity stirling.2007 and its effect on impact uzzi.2013 ; wang.2015 and performance of teams lungeanu.2014 as well as its implications for policy making, e.g. in terms of funding academy.book2004 . Is it just a fashion or science is really getting more and more interdisciplinary? A usual way to assess interdisciplinarity is based on citation networks using heuristic approaches porter.2009 ; wagner.2011 ; lariviere.book2014 or methods from complex networks pan.2012 ; sayama.2012 ; sinatra.2015 ; omodei.2016 . In line with the arguments exposed in the introduction, interdisciplinarity can be viewed through different dimensions and the cognitive dimension would be best measured using textual data. However, there are only very few works bache.2013 ; nichols.2014 ; evans.2016 relating textual measures with interdisciplinarity, despite the increasing availability of the text of scientific articles. In this view, the significance of our approach is that it provides a measure of interdisciplinarity based on how much the usage of words in different disciplines overlap.
Finally, we hope our results and methodology will stimulate a multiple-dimensional approach in other problems related to the study of sciences, profiting from the modern availability of large (textual) databases of scientific publications that allow us to go beyond traditional bibliometric analysis evans.2011 ; lariviere.book2014 . These include, but are not limited to, the formulation of more meaningful bibliometric indicators mann.2006 , the identification and prediction of influential papers and disciplines gerrish.2010 ; foulds.2013 ; whalen.2015 , or the inclusion of textual information in recommending related scientific papers achakulvisut.2016 .
V Materials and Methods
v.1 Data and grouping of corpora
We use the Web of Science database webOfScience and explore the following information available for individual articles: citations, title, abstract, and the classification in one scientific specialty (per OECD classification bibliographicWoSClassification ). We use all papers published between 1991 and 2014 because the number of articles with text in the abstract is substantial only after 1991 and because at the time we started our analysis 2014 was the last complete year available to us. The text of an article was built concatenation its title and abstract. The corpus representing a specialty in a given year is obtained from the concatenation of the text of all articles for that specialty in that year. The corpus for one discipline (or domain) concatenates all articles in all specialties belonging to that discipline (or domain).
Our analysis is based on articles for each the textual and classification information were available ( of all articles indexed in Web of Science between 1991-2014). In our analysis we considered only citations from and to the papers in our list because only for these papers we had a reliable classification of specialties. These citations corresponded to roughly half of the M citations associated with these papers.
v.2 Data processing
For each article in our database we performed the following steps to process the textual information:
-
The copyright information contained in the abstract was removed.
-
Title and abstract were concatenated.
-
The text was converted to lowercase.
-
Contractions were replaced by their non-contracted form.
-
The text was tokenized, and the nouns and verbs were lemmatized using the Natural Language Toolkit nltk .
-
Symbols (except hyphen, to avoid remove significant compound modifiers) inside tokens were replaced by white space, therefore generating two or more distinct tokens.
-
Tokens composed by numbers or single letter were removed.
-
Tokens belonging to a preset stop-word list were discarded.
v.3 Minimum corpus size
We computed using only the most frequent word types, disregarding the scientific fields for which there was not enough data to achieve this cut-off. This choice is motivated by the slow convergence of entropy estimations (and thus ) Gerlach2016 . By choosing a fixed number of word types we reduce the effect of the remaining bias (in the estimation of ) on our comparative analysis of textual dissimilarity between pairs of fields. This happens because the residual bias acts as an off-set in all cases (when a fixed cut-off is chosen) instead of affecting differently each case (as obtained if the maximum amount of data is used in each case). The bias decays with the number of word types used because the more frequent types are responsible for almost all the dissimilarity, specially for Altmann2017 . Using types as a cut-off, we estimated the textual dissimilarity relative standard deviation, computed over multiple samples of the same scientific field, to be . Our cut-off of types is a conservative choice to ensure that .
v.4 Generalized Jensen-Shannon Divergence
Given two texts (indexed by and
), we define the probability distributions over all words
as and . An Information-theoretic measure to quantify their similarity is the generalized Jensen-Shannon divergence(5) |
based on the generalized entropy of order (), where
(6) |
Here, we consider a normalized similarity Gerlach2016
(7) |
such that where is the maximum possible between and assuming that the the set of symbols in each distribution (i.e., the support of and ) are disjoint.
Note that for , Eq. (6) yields the Shannon-entropy (cover2006, ), i.e. , and is the well-known Jensen-Shannon divergence lin1991 . Ref. Gerlach2016 shows that provides the most robust statistical measure of similarity of texts.
Acknowledgements.
L.D. received financial support from CNPq/Brazil through the program “Science without Borders”. We thank M. Palzenberger and the Max Planck Digital Library for providing access to the data, M. de Domenico for insightful discussions, and S. Haan and the Centre for Translational Data Science (University of Sydney) for helping with Figs.
1 and 3.References
- (1) J. A. Evans, J. G. Foster, Science 331, 721 (2011).
- (2) K. Börner, C. Chen, K. W. Boyack, Annual review of information science and technology 37, 179 (2003).
- (3) R. M. Shiffrin, K. Borner, Proceedings of the National Academy of Sciences 101, 5183 (2004).
- (4) K. W. Boyack, R. Klavans, K. Börner, Scientometrics 64, 351 (2005).
- (5) M. Rosvall, C. T. Bergstrom, Proceedings of the National Academy of Sciences 105, 1118 (2008).
- (6) J. Gläser, W. Glänzel, A. Scharnhorst, Scientometrics 111, 979 (2017).
- (7) D. Wang, C. Song, A.-L. Barabási, Science 342, 127 (2013).
- (8) J. A. G. Moreira, X. H. T. Zeng, L. A. N. Amaral, PLoS ONE 10, e0143108 (2015).
- (9) V. Larivière, Y. Gingras, Beyond Bibliometrics (MIT Press, 2014).
- (10) R. V. Noorden, Nature 525, 306 (2015).
- (11) K. R. Popper, The British Journal for the Philosophy of Science 3, 124 (1952).
- (12) P. W. Balsiger, Transdisziplinarität : systematisch-vergleichende Untersuchung disziplinenübergreifender Wissenschaftspraxis (Fink, 2005).
- (13) M. Guntau, H. Laitko, World Views and Scientific Discipline Formation, R. W. Woodward, R. S. Cohen, eds. (Springer Netherlands, 1991).
- (14) E. Garfield, I. H. Sher, R. J. Torpie, The use of citation data in writing the history of science (Institute for Scientific Information, Philadelphia, 1964).
- (15) D. J. de Solla Price, Science 149, 510 (1965).
- (16) T. Kuhn, M. Perc, D. Helbing, Physical Review X 4, 041036 (2014).
- (17) D. Chavalarias, J.-P. Cointet, PLoS ONE 8, e54847 (2013).
- (18) T. Lippincott, D. Ó. Séaghdha, A. Korhonen, BMC bioinformatics 12, 212 (2011).
- (19) E. Evans, C. Gomez, D. McFarland, Sociological Science 3, 757 (2016).
- (20) R. R. Braam, H. F. Moed, A. F. J. van Raan, Journal of the American Society for Information Science 42, 233 (1991).
- (21) D. Vilhena, et al., Sociological Science 1, 221 (2014).
- (22) F. N. Silva, D. R. Amancio, M. Bardosova, L. d. F. Costa, O. N. Oliveira, Journal of Informetrics 10, 487 (2016).
- (23) J. Sienkiewicz, E. G. Altmann, Royal Society Open Science 3, 160140 (2016).
- (24) M. Gerlach, F. Font-Clos, E. G. Altmann, Physical Review X 6, 021009 (2016).
- (25) E. G. Altmann, L. Dias, M. Gerlach, Journal of Statistical Mechanics: Theory and Experiment 2017, 014002 (2017).
-
(26)
A. Webb,
Statistical Pattern Recognition
(Wiley, 2002). - (27) Web of Science is a product of Thomson Reuters.
- (28) Working Party of National Experts on Science and Technology, OECD (2006) available at http://www.oecd.org/science/inno/38235147.pdf.
- (29) A. L. Porter, I. Rafols, Scientometrics 81, 719 (2009).
- (30) L. Leydesdorff, Journal of the American Society for Information Science and Technology 59, 77 (2008).
- (31) Each of the two terms in Eq. (1) can be interpreted as a directed Jaccard distance () in the sense that we divide the number of edges that are out-links of field () and in-links of field () by the number of edges that are out-links of field () or in-links of field ().
- (32) I. Grosse, et al., Physical Review E 65, 041905 (2002).
- (33) T. K. Landauer, D. Laham, M. Derr, Proceedings of the National Academy of Sciences 101 Suppl, 5214 (2004).
- (34) K. W. Boyack, et al., PLoS One 6, e18029 (2011).
- (35) Obtained from bootstrapping samples of each joint distribution and , i.e. comparison of pairs of correlation values.
- (36) R. Sokal, C. Michener, University of Kansas Science Bulletin 38, 1409 (1958).
- (37) We work at the level of disciplines because most specialties fail to have enough publications in a single year.
- (38) A. Stirling, Journal of The Royal Society Interface 4, 707 (2007).
- (39) B. Uzzi, S. Mukherjee, M. Stringer, B. Jones, Science (New York, N.Y.) 342, 468 (2013).
- (40) J. Wang, B. Thijs, W. Glänzel, PLoS ONE 10, e0127298 (2015).
- (41) A. Lungeanu, Y. Huang, N. S. Contractor, Journal of Informetrics 8, 59 (2014).
- (42) Committee on Facilitating Interdisciplinary Research; Committee on Science, Engineering, P. P. I. of Medicine; Policy, G. A. N. A. of Sciences; National Academy of Engineering, Facilitating Interdisciplinary Research (National Academies Press, 2004).
- (43) C. S. Wagner, et al., Journal of Informetrics 5, 14 (2011).
- (44) R. K. Pan, S. Sinha, K. Kaski, J. Saramäki, Scientific Reports 2, 1 (2012).
- (45) H. Sayama, J. Akaishi, PLoS ONE 7, e38747 (2012).
- (46) R. Sinatra, P. Deville, M. Szell, D. Wang, A.-L. Barabási, Nature Physics 11, 791 (2015).
- (47) E. Omodei, M. D. Domenico, A. Arenas, Network Science pp. 1–12 (2016).
- (48) K. Bache, D. Newman, P. Smyth, Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’13 (ACM Press, 2013).
- (49) L. G. Nichols, Scientometrics 100, 741 (2014).
- (50) E. D. Evans, Socius: Sociological Research for a Dynamic World 2 (2016).
- (51) G. S. Mann, D. Mimno, A. McCallum, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’06 (ACM Press, 2006).
- (52) S. Gerrish, D. M. Blei, Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel (2010), pp. 375–382.
-
(53)
J. Foulds, P. Smyth,
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
pp. 113–123 (2013). - (54) R. Whalen, Y. Huang, A. Sawant, B. Uzzi, N. Contractor, Quantifying and Analysing Scholarly Communication on the Web (ASCW’15) (2015).
- (55) T. Achakulvisut, D. E. Acuna, T. Ruangrong, K. Kording, PLoS ONE 11, e0158423 (2016).
- (56) Natural language toolkit, http://www.nltk.org/.
- (57) T. M. Cover, J. A. Thomas, Elements of Information Theory (Wiley-Interscience, 2006).
- (58) J. Lin, IEEE Transactions on Information Theory 37, 145 (1991).