I Data and method
In our datasets, each paper is represented by a data entry that includes the year of publication, the subject classification code, and the numbers of author(s), reference(s) and citations as recorded in the Web of Science. We use the established subject classification scheme for each discipline to identify the subfields to which each paper belongs. The classification schemes used in this work are the Physics and Astronomy Classification Scheme (PACS) for physics, the Mathematics Subject Classification (MSC) for mathematics and the Journal of Economic Literature (JEL) codes for economics. These schemes are all hierarchical, and in this work, we use the fourth level of the physics classification scheme (e.g., 03.67), the third level of the mathematics classification scheme (e.g., 92B) and the second level of the economics classification scheme (e.g., N3).
The physics dataset is a collection of all papers published by the American Physical Society (APS) Physical Review journals from to . Here, we consider only those research papers, e.g., articles, brief reports and rapid communications, with PACS numbers. In total, the dataset includes papers, PACS numbers and classification labels.
The mathematics dataset is a collection of papers published from to and classified using the 2010 Mathematics Subject Classification. Here, we consider only those journal papers that have entries in both Mathematical Reviews and Web of Science. The MSC codes were obtained from the Mathematical Reviews records, and the other information was obtained from Web of Science. This dataset includes papers, MSC codes and classification labels.
The economics dataset is a collection of all economics papers collected by the American Economic Association Journal of Economic Literature from to . Here, we consider only those papers that have records with both the JEL and Web of Science. The JEL Classification Codes were obtained from the Journal of Economic Literature, and the other information was obtained from Web of Science. This dataset includes papers, JEL codes and classification labels.
An allometric scaling-law relation between one quantity and another quantity is assumed to have the form
denotes an output (such as the number of papers) or input (such as the number of references) of a subfield , and denotes the size of that subfield (such as the number of authors). is a normalization constant.
is the exponent, which we obtain through an ordinary least-squares (OLS) regression in log-log coordinates. The goodness of the regression is measured in terms of the coefficient of determination, where is calculated as the correlation coefficient between and . This OLS analysis can be applied to the all-year data, in which case the values of and are taken to be the cumulative values up through the last year covered by each dataset, or to single-year data. In the latter case, the exponent for year is denoted by .
Scaling laws, which follow a power-law function, describe the relation between two variables. In scaling laws, the scaling exponents are generally obtained by OLS regression Growth ; Invention ; interaction ; road ; Metropolitan ; innovation1
. However, scaling laws are different from the recent interest in power laws, which generally describe probability distributions, e.g. the distribution of citations powerlaw ; Innovation
. In power laws, due to the fact that to be a normalizable probability distribution function the power law usually holds only at the tail part of the distribution function and also due to noises in rare events at the very end of the tail part so that sometimes a cut-off has to be introduced, the exponents can be better estimated by the maximum likelihood methodmaxhood ; maxhood1 ; maxhood2 than OLS regression. When there is a scaling law between two variables and when one of then two variables follows a power-law distribution, then clearly so does the other. Therefore, often scaling laws and power laws often appear together. However, this is not the case here in our analysis.
The relative stage of development of a subfield is measured in terms of the following deviation of the empirical value for that subfield with respect to the value predicted according to the allometric scaling relation:
It is independent of the absolute size of the subfield.
Author name disambiguation.
In the following, we investigate the possible scaling relationships between the number of papers and the numbers of author instances and authors, where the former simply counts the number of authors among all papers in a field regardless of whether some papers have the same or overlapping authors, whereas the latter counts only all unique authors. For the latter, we must address the problem of author disambiguation. In this paper, we adopt the simple last full and all initials method to identify author names Authorname , in which authors who have the same last name and all the same initials are considered to be the same author. For example, A Smith, AB Smith and AC Smith would be identified as distinct authors, but Alice Smith and Alysia Smith will be regarded as the same author.
The all initials method has been claimed to have relatively low “contamination” rates in certain disciplines, such as in mathematics and in economics Authorname . We also performed our own small-scale validation of this approach. In the physics dataset, the subfield 42.50.Dv (Nonclassical states of the electromagnetic field) contains author-paper pairs. A total of distinct scientists were found after the disambiguation process. To validate the all initials method, we randomly selected 200 pairs of authors with similar names, each consisting of two papers considered to be from the same author. We then verified whether they were indeed the same person by performing a search on the APS website and the authors’ research homepages. We found the false positive rate (i.e., the number of authors considered to be the same person whereas, in reality, they are not) to be . We also performed a manual examination of the false negative rate (i.e., the number of identical authors incorrectly identified as different individuals using the all initials method) and found it to be approximately .
First, let us consider the relation between the number of papers and the number of authors, which, in a sense, is similar to the relation between the output and size of cities. Here, we use the cumulative data up through the last year covered by the dataset for all three disciplines. We see that the values of the exponent are () for physics (Fig. 1(a)), () for mathematics (Fig. 1(c)) and () for economics (Fig. 1(d)). This means that the number of papers per author very weakly increases as the number of authors increases. This, in turn, indicates that there are marginal increasing returns in physics, mathematics and economics.
When we consider Fig. 1(a) in further detail, we note that some subfields (29.20.xx) of physics are much less productive than predicted by the scaling law. For example, subfield 29.20.xx (Storage rings and colliders) is related to high-energy experiments and has approximately authors per article on average. Such experimental subfields in physics generally require many scientists to work together. This might make the scaling exponent systemically smaller. To exclude these subfields, we restrict the analysis only to papers with at most ten authors (denoted by physics (subsets)). As shown in Fig. 1(b), with this approach, the scaling exponent becomes (). For cities, the scaling exponent between the number of new patents and the urban population is , and that between the number of inventors and the urban population is Growth . Therefore, we can roughly estimate the scaling exponent between the numbers of new patents and inventors to be approximately . This rough estimation shows that our results are qualitatively consistent with the relation between the numbers of patents and inventors, which, in a sense, is similar to the relation between the numbers of papers and authors, as deduced from studies of scaling relations in cities Growth . However, the exponent values of () in physics, in mathematics and in economics for the development of science/patents are quite different from the exponent relating the output and size of cities, which is roughly . This means that the effect of increasing returns in science/patents is only marginal and not as high as the effect seen for production processes in cities. We do not know the reason for this difference. We can only speculate that it may be more difficult to increase scientific output than it is to increase industrial production by simply expanding in size.
Next, let us check whether the scaling exponents have remained stable during all investigated years of development of the fields by performing a scaling analysis on the single-year data for each year. We know that the average numbers of authors and references in papers today are much larger than those in earlier times. However, we find that except for physics, for which the value is smaller than for the other fields and slightly decreasing, the values of the scaling exponent have remained quite stable, as shown in Fig. 2, especially for physics (subsets). The fact that similar scaling laws are observed in various disciplines implies that there might be a common mechanism governing the scientific progress of these disciplines, and the fact that the exponent values have remained similar and stable over time indicates that the underlying mechanism, if there is such a mechanism, tends to be preserved over time. The fact that physics as a whole shows a smaller and slightly decreasing exponent and the fact that we know that this phenomenon is due to papers with more than authors, which are often related to high-energy experimental physics, suggest that physics might have developed to a stage in which it often requires large teams to solve certain difficult problems and thus is less productive. It should be noted that the exponents in the yearly data analysis are smaller than the exponents for the cumulative data for reasons that we do not yet know.
Let us also compare this relation with the relation between the number of papers and the number of author instances (each appearance of an author, including duplicate authors, increments the total number of author instances by ). Interestingly, in this case, it is found that all exponents are smaller than , with () for physics (Fig. 1(e)), () for physics (subsets) (Fig. 1(f)), () for mathematics (Fig. 1(g)) and () for economics (Fig. 1(h)). This means that the marginal effect of increasing returns previously observed in Fig. 1(a-d) disappears when the number of author instances is considered, and the number of papers per author instance decreases as the number of author instances increases.
Although the goodness of fit of the fitted curves are very high overall, there are some outliers that are relatively far from the fitted curves in the above figures, and the relative positions of the subfields often change from year to year. The residual is a measure of the deviation of a true value from the corresponding value predicted by the scaling law (Eq. (2)). These deviations provide a meaningful way to rank cities interaction and universities university2 . In Fig. 3, we show the ranking of the deviations by magnitude and sign for physics and economics in 2013 as well as those for mathematics in 2010. Let us focus on a few subfields that deviate strongly and positively from the scaling law. For example, the output of classical general relativity (04.20) is ranked 3rd in physics in Fig. 3(a) according to its deviation, but it is a relatively small subfield (ranked 200th by size). Quantum information (03.67) ranks 4th in physics in Fig. 3(a) but is ranked 49th according to its size. When they are ranked according to their sizes, classical general relativity is not ranked similarly to quantum information. However, when they are ranked according to their deviations, we see that they are both among the top subfields in physics. These findings are broadly consistent with our intuition regarding these subfields: one is small and one is big, but both are very active subfields. This implies that, at least in part, the deviation from the fitted scaling law provides a reasonable indicator of the ranking of the subfields that is independent of their sizes.
We also rank the subfields of mathematics and economics. It is found that topological geometry (51H) is ranked 1st in mathematics according to its deviation, whereas it is ranked 632nd according to its size. In addition, it is found that game theory and bargaining theory (C7) is the top subfield in economics according to its deviation but is ranked 44th according to its size. Judging from our limited knowledge of economics, we believe that it is reasonable for game theory to be considered among the top subfields: it is not large but is a core subfield of economics, which can partially be seen from the fact that (according to Wikipedia) there have beengame theorists among the Nobel laureates in economics, and some economists even believe that it is the core of the whole of economic theory Levin:GameTheory .
Let us now look at other outputs of the investigated scientific fields vs. the numbers of authors in their subfields. It is found that the exponents relating the numbers of citations and authors are larger than , with () for physics, () for physics (subsets), () for mathematics and () for economics (Fig. 4). This means that authors working in larger subfields receive, on average, more citations than those in smaller subfields. In addition, the exponents relating the numbers of citations and papers are () for physics, () for physics (subsets), () for mathematics and () for economics. These findings are similar to the scaling laws between the numbers of citations and papers when universities university1 ; university2 and research groups group1 ; group2 ; group3 are treated as the relevant units. However, the exponent values for the latter cases are approximately , larger than those found here. This means that whereas authors are more likely to cite papers from the same university, the same research group and the same subfield, the degrees of affinity for universities and research groups are even stronger than those for subfields.
Next, we find that the scaling-law exponent relating the numbers of references and authors is smaller than the exponent between numbers of citations and authors. We have () for physics, () for physics (subsets), () for mathematics and () for economics (Fig. 5). In scaling law of cities, similarly the exponent of supplies and population is also smaller than the exponent of outputs and populationGrowth . We might expect these exponents in the case of scientific publications to be higher since, intentionally or unintentionally, people may cite references more carelessly than they would use living supplies because there is no cost for citing more references, whereas there is a cost associated with the use of living supplies. However, the fact that these exponents are close to, although clearly slightly higher than, that relating the housing/water/energy supplies and populations in cities implies that perhaps researchers do not cite many unnecessary references.
Iii Conclusions and Discussion
In this paper, we first examined and confirmed the allometric scaling relations between the numbers of papers, citations, and references and numbers of authors in subfields of three disciplines, namely, physics, mathematics and economics, which are analogous to the relation between the numbers of patents and inventors of patents in cities Growth and the relations between various outputs/inputs and population size for cities Growth and countries countries . One of the reasons for the development of cities is that there is an effect of increasing returns between the output and size of a city, which results in a lower effective cost for intra-city transactions than for inter-city transactions. Perhaps there are similar factors driving the formation of scientific subfields, which cause the observed allometric scaling relations to arise in the development of research subfields. Furthermore, the values of the exponents for all three disciplines were found to be similar and to have remained stable over time. We do not yet know why the various disciplines display similar exponents and temporal stability. However, we believe that this common allometric law across disciplines and time requires further investigation: certain common underlying mechanisms may exist that drive the development of scientific fields in various disciplines.
We found that the exponents relating the numbers of papers and authors are much smaller than those relating the various outputs of cities to their size Growth . This means that the effect of increasing returns observed in scientific production is much lower than the corresponding effect on production in cities. However, the exponents relating the numbers of citations and authors are more similar to those relating the various outputs of cities to their size, indicating that there is a stronger effect of increasing returns between the numbers of citations and authors. This suggests that on average, as the number of authors increases, there is only a marginal effect of increasing returns on the number of papers but a much larger effect of increasing returns on the number of citations. In addition, through several examples, we showed that deviations of individual subfields from the predictions of the allometric scaling relations can provide a size-independent but still meaningful ranking of those subfields.
The current study has several limitations. Our datasets contained only the portions of WOS that overlap with the relevant subject classification schemes (PACS, MSC and JEL), which restricted our ability to study the scaling relations governing the properties of all publications. In particular, the results for physics consider only those papers published in Physical Review journals. Moreover, the method used for author name disambiguation could be further improved.
- (1) J. S. Katz, The self-similar science system. Res Policy 28: 501-517 (1999).
- (2) Scale-independent indicators and research evaluation. Sci Public Policy 27: 23-36 (2000).
- (3) J. S. Katz, What is a complex innovation system? PLOS ONE 11(6):e0156150 (2016).
- (4) X. Gao, J. Guan, A scale-independent analysis of the performance of the chinese innovation system. Journal of Informetrics 3:321 C331 (2009).
- (5) L. M. A. Bettencourt, J. Lobo, D. Helbing, C. Kuhnert, G. West, Growth, innovation, scaling, and the pace of life in cities, Proceedings of the National Academy of Sciences of the United States of America 104 7301-7306 (2007).
- (6) L.M.A. Bettencourt, J. Lobo, D. Strumsky, G. B. West, Urban scaling and its deviations: revealing the structure of wealth, innovation and crime across cities. PLoS One 5: e13541 (2010).
- (7) M. Herrera, D. C. Roberts, N. Gulbahce, Mapping the evolution of scientific fields. PLoS ONE 5:e10, 355 (2010).
- (8) L. M. A. Bettencourt, D. I. Kaiser, J. Kaur, C. Castillo-Chavez, D. E. Wojick, Population modeling of the emergence and development of scientific fields. Scientometrics 75(3):495-518 (2008).
- (9) A. F. J. Van Raan, Bibliometric statistical properties of the 100 largest European research universities: Prevalent scaling rules in the science system. J Am Soc Inf Sci Technol 59: 461-475 (2008).
- (10) A. F. J. Van Raan, Universities scale like cities. PLoS One 8: e59384 (2013).
- (11) A. F. J. van Raan, Statistical Properties of Bibliometric Indicators: Research Group Indicator Distributions and Correlations. J Am Soc Inf Sci Technol 57: 408-430 (2006).
- (12) A. F. J. van Raan, Performance-related differences of bibliometric statistical properties of research groups: cumulative advantages and hierarchically layered networks. J Am Soc Inf Sci Technol 57: 1919-1935 (2006).
- (13) A. F. J. van Raan, Scaling rules in the science system: Influence of fieldspecific citation characteristics on the impact of research groups. J Am Soc Inf Sci Technol 59: 565-576 (2008).
- (14) Ö. Nomaler, K. Frenken, G. Heimeriks, On Scaling of Scientific Knowledge Production in U.S. Metropolitan Areas, PLoS ONE 9(10): e110805 (2014).
- (15) R. Naroll, L. von Bertalanffy, The principle of allometry in biology and the social sciences, General Systems Yearbook 1 76-89 (1956).
- (16) L. M. A. Bettencourt, G. .B West, A unified theory of urban living. Nature 467: 912-913 (2010).
- (17) J. Brown, G. West, Scaling in Biology, Oxford University Press, 2000.
- (18) M. Kleiber, Body size and metabolism, Hilgardia 6 315-353 (1932).
- (19) G. B. West, J. Brown, Review the origin of allometric scaling laws in biology from genomes to ecosystems: towards a quantitative unifying theory of biological structure and organization, The Journal of Experimental Biology 208 1575-1592 (2005).
- (20) M. Batty, The Size, Scale, and Shape of Cities. Science 319: 769-771 (1999).
- (21) G. B. West, J. H. Brown, B. J. Enquist, The Fourth Dimension of Life: Fractal Geometry and Allometric Scaling of Organisms. Science 284: 1677-1679 (1999).
- (22) J. Brown, Toward a metabolic theory of ecology, Ecology 85 (7) 1771-1789 (2004).
- (23) L. M. A. Bettencourt, J. Lobob, D. Strumsky, Invention in the city: increasing returns to scale in metropolitan patenting, Research Policy 36 107-120 (2007).
- (24) S. G. Ortman, A. H. F. Cabaniss, J. O. Sturm, L. M. A. Bettencourt, Settlement scaling and increasing returns in an ancient society, Sci. Adv. 2015;1:e1400066.
- (25) C. Kuhnert, D. Helbing, G. West, Scaling laws in urban supply networks, Physica A 363 96-103 (2006).
- (26) Y. Lee, An allometric analysis of the US urban system: 1960-80, Environment and Planning A 21 463-476 (1989).
- (27) S. Lammer, B. Gehlsena, B. Helbing, Scaling laws in the spatial structure of urban road networks, Physica A 363 89-95 (2006).
- (28) J. Zhang, T. Yu, Allometric scaling of countries, Physica A 389 4887-4896 (2010).
- (29) S. Milojević, Accuracy of simple, initials-based methods for author namedisambiguation, Journal of Informetrics 7 767-773 (2013).
- (30) D. K. Levine, What is game theory? http://www.dklevine.com/general/whatis.htm, accessed: 2016-12-20 (2016).
- (31) J. S. Katz, Indicators for complex innovation systems. Research Policy 35:893 C909 (2006).
- (32) S. Redner, How popular is your paper? an empirical study of the citation distribution. Eur Phys J B 4:131 C134 (1998).
- (33) P. T. Nicholls, Estimation of zipf parameters. J Am Soc Inf Sci Technol 38(6):443 C445 (1987).
- (34) S. Milojević, Power law distributions in information science: Making the case for logarithmic binning. J Am Soc Inf Sci Technol 61(12):2417 C2425 (2010).
- (35) A. Clauset, C. R. Shalizi, M. E. J. Newman, Power-law distributions in empirical data. SIAM Rev 51(4):661 C703 (2009).