The evolution of scientific literature as metastable knowledge states

by   Sai Dileep Koneru, et al.

The problem of identifying common concepts in the sciences and deciding when new ideas have emerged is an open one. Metascience researchers have sought to formalize principles underlying stages in the life-cycle of scientific research, determine how knowledge is transferred between scientists and stakeholders, and understand how new ideas are generated and take hold. Here, we model the state of scientific knowledge immediately preceding new directions of research as a metastable state and the creation of new concepts as combinatorial innovation. We find that, through the combined use of natural language clustering and citation graph analysis, we can predict the evolution of ideas over time and thus connect a single scientific article to past and future concepts in a way that goes beyond traditional citation and reference connections.


page 1

page 4

page 5

page 7

page 8


Predicting Research Trends with Semantic and Neural Networks with an application in Quantum Physics

The vast and growing number of publications in all disciplines of scienc...

Will This Idea Spread Beyond Academia? Understanding Knowledge Transfer of Scientific Concepts across Text Corpora

What kind of basic research ideas are more likely to get applied in prac...

Delineating Knowledge Domains in the Scientific Literature Using Visual Information

Figures are an important channel for scientific communication, used to e...

Embedding technique and network analysis of scientific innovations emergence in an arXiv-based concept network

Novelty is an inherent part of innovations and discoveries. Such process...

Novelty and Foreseeing Research Trends; The Case of Astrophysics and Astronomy

Metrics based on reference lists of research articles or on keywords hav...

Making Sense of the Evolution of a Scientific Domain: A Visual Analytic Study of the Sloan Digital Sky Survey Research

We introduce a new visual analytic approach to the study of scientific d...

Network-based link prediction of scientific concepts – a Science4Cast competition entry

We report on a model built to predict links in a complex network of scie...


Early work in metascience can be traced back at least half a century, [morris1946significance]

although it has been only in the last decade or so that a robust literature has been seeded exploring co-authorship networks, citation networks, topical networks and similar static and one-dimensional representations of complex interactions amongst researchers and their work. Much of this has been powered by the increased availability of digital data on scientific processes, improvements in information retrieval, network science, machine learning, and computational power, allowing researchers to derive meaningful insights. A substantial subset of this literature has focused on quantifying and predicting success in publishing – how we should measure success, who will have it, and what factors contribute to having it. Seminal work has focused on modeling citation patterns for papers

[wang2013quantifying] and researchers [sinatra2016quantifying], with more recent work setting out to explain hot streaks in researchers’ career trajectories[liu2018hot], unique patterns of productivity and collaboration amongst the scientific elite[li2020scientific], and even the role of luck in driving scientific success[pluchino2019exploring, janosov2020success]. We are also seeing the emergence of metascience as a social movement [peterson2020metascience], catalyzed by the last decade’s reproducibility crisis[schooler2014metascience], aiming to describe and evaluate science at a macro scale in order to diagnose biases in research practice[lariviere2013bibliometrics, hofstra2020diversity], highlight flaws in publication processes[franco2014publication], understand how researchers select new work to pursue[rzhetsky2015choosing, jia2017quantifying], identify opportunities for increased efficiency (e.g., automated hypothesis generation [spangler2014automated]), and forecast emergence of research topics [prabhakaran2016predicting, chen2018modeling].

Prior work on the evolution of research can be broadly viewed in three categories based on method: network-based, language-based, and hybrid methods using both networks and language. Language-based methods commonly use topic models such as Latent Dirichlet Allocation (LDA) and predict changes in topics[kleminski2017identifying, uban2021studying]. Other language-based approaches include tracking usage of keywords[faust2018documenting], analyzing linguistic context[prabhakaran2016predicting], and modeling topics sequentially[chen2018modeling]. Studies using network-based methods usually use citation networks and community detection algorithms such as topological clustering methods [shibata2008detecting] or clique percolation methods[salatino2018augur] to identify emergence of new fields, while other network approaches include usage of temporal[sun2020evolution], multiplex[zamani2020evolution] networks, projections of citation networks such as co-authorship[sarigol2014predicting, sun2016mapping]. Hybrid usage of both language- and network-based methods to predict the evolution of scientific fields includes keyword-generated networks used to predict changes in topics [krenn2020predicting] or approaches that mostly rely on network analysis, applying linguistic techniques such as LDA for explanatory labels only[sasaki2020emerging]. Still others [zhang2017detecting]

have used LDA and co-occurrence networks of topics to study changes in knowledge-based systems. However, to the best of our knowledge, these hybrid methods do not incorporate state-of-the-art language embeddings, nor do they incorporate insights from both the language models and the citation network. It is our hypothesis that the only way to truly capture the evolution of ideas and knowledge in the literature is through the integration of network and linguistic techniques.

A premise of the work discussed in this paper is that neither citation networks alone (or derivatives thereof) nor purely language-driven models of the scientific corpus can explain the evolution of fields and the emergence of new ideas. We show that these two frameworks capture overlapping but distinct and complementary aspects of dynamics in scientific research. We use pre-trained neural network models


to generate vectorized representations of the literature while separately leveraging citation network measures (e.g., betweenness centrality), combining these two inputs to build predictive models of topical evolution. The intuition behind the mechanisms explored herein is that scientific disciplines can be described at a high level by aggregation of related ideas. When a discipline is beginning to show signs of fracture or change via the emergence or synthesis of new ideas, we model this moment borrowing from physics the concept of

metastability: a state easily perturbed into a new state. We suggest that measures of interdisciplinarity may be indicators of this transition and thus useful predictors of change in the scientific ecosystem.

Recent efforts have elevated the role of interdisciplinarity in scientific practice[klein1990interdisciplinarity, jacobs2009interdisciplinarity, repko2019introduction, pan2012evolution]. Prior work has shown interdisciplinarity to have an effect on innovation and research impact[molas2014relationship, hofstra2020diversity]. Calls for collaboration across disciplines are prominent throughout research institutions and funding agencies111See, e.g., the U.S. National Science Foundation’s Growing Convergence Research program: but some have argued that the promises of interdisciplinarity are overstated and misplaced[jacobs2014defense]. The bibliometric community has offered a data-driven framing for interdisciplinary studies, e.g., defining interdisciplinarity as a process of integrating different bodies of knowledge[wagner2011approaches, porter2009science].

The definition of interdisciplinarity varies broadly in the literature, with different definitions capturing different aspects of this concept[wang2020consistency]

, and can be broadly classified into two groups: subject-based and network-based definitions

[wang2020consistency]. Subject-based metrics rely on multi-classification systems to calculate interdisciplinarity, leaning on pre-defined subject categories, e.g., from the Web of Science (WoS)[WoS]. These approaches generally are imposed at the journal level, focusing on the distribution of subject categories, e.g., percentage of references cited by publications in journals outside a journal of interest’s category[porter1985indicator, morillo2001approach]. In some cases, metrics are borrowed from other fields, such as the Gini index from economics and Shannon entropy from information theory, to quantify diversity[wang2015interdisciplinarity]); these are also based on pre-made categories. Alternatively, network-based interdisciplinarity metrics are often assessed based on the location of a publication in a citation network[leydesdorff2007betweenness], with centrality measures frequently being the focus. For example, betweenness centrality, which is independent of third-party categorization, was one of the first metrics used in this way[leydesdorff2007betweenness, leydesdorff2011indicators] and has likewise been used to predict future network trends[gao2021potential, chen2009towards].

To study knowledge evolution in the scientific literature, we: (1) develop methods that utilize transformers-based language models and unsupervised clustering to track the evolution of ideas over time; (2) quantify interdisciplinarity using complementary text- and citation-based metrics; and (3) explore the utility of metastability, measured through interdisciplinarity, as a predictor of scientific evolutionary events (Fig. 1).

Figure 1: Data analysis workflows. (Top) Text-based analysis. Title and abstract are concatenated and input to a language embedding model, then dimensionally reduced and fed into a clustering algorithm; clusters of embedded papers are then used for event modeling and interdisciplinarity scoring. (Bottom) Citation-based analysis. Citation information is used to create undirected citation graphs; the Louvain algorithm is used to identify network communities and betweenness centrality is used for interdisciplinarity scoring. Interdisciplinary metrics are jointly used to predict disciplinary evolution.


Our dataset contains detailed records of 19,177 scientific papers published in the years 2011 through 2018, with 2300 to 2500 papers for each year, representing a substantial stratified random sample of papers published in 62 prominent journals from the following disciplines as strata: Criminology; Economics and Finance; Education; Health; Management; Marketing and Organizational Behavior; Political Science; Psychology; Public Administration; and Sociology.222The dataset was collected in conjunction with DARPA’s SCORE program. For a complete listing of journals see[alipourfard2021systematizing]. Metadata for these were collected using the Web of Science as a primary source. Digital Object Identifiers (DOIs) were used to merge WoS records with Semantic Scholar (S2) records[ammar2018construction, lo2019s2orc] for completeness of metadata coverage and author name disambiguation. When DOIs were not available from WoS, we used Crossref [lammey2015crossref] to fill in missing DOIs for more complete record linking between the Web of Science and Semantic Scholar. For citation network analyses, we also included all papers referenced by these papers. Our complete dataset includes records of 839,096 papers and about 1.45 million citations.


We use parallel workflows to model dynamics in bibliometric data – one based on text and one based on citation networks (Fig. 1). For each we derive a measure of interdisciplinarity useful for prediction of knowledge evolution and we describe our explanatory and predictive experiments to evaluate our measures.

SPECTER-based topic modeling

We use language-embedding-based topic modeling to identify topics within our corpus for a given year. To do so, we extract embeddings for each publication in our dataset using the concatenated title and abstract as an input to SPECTER (Scientific Paper Embeddings using Citation-informed TransformERs)[cohan2020specter], a model for generating document-level embeddings of scientific documents via pre-training on scientific papers and their citation graphs.333Specifically, we use the huggingface implementation [wolf-etal-2020-transformers] of the pre-trained SPECTER model. SPECTER embeddings have been shown to outperform competitive baselines on benchmark document-level tasks such as citation prediction, document classification and recommendation[cohan2020specter].

To identify disciplines and subdisciplines, we use an unsupervised, non-parametric, hierarchical clustering algorithm, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)


. Specifically, we soft-cluster SPECTER embeddings to reflect that papers may belong to multiple (sub)disciplines with different probabilities. As the performance of HDBSCAN generally reduces as the dimensionality of input data increases, we use UMAP


to reduce the dimensionality of SPECTER embeddings prior to clustering with HDBSCAN. We use multi-objective Bayesian hyperparameter tuning

[turner2021bayesian] for the UMAP-HDBSCAN pipeline to balance five evaluative criteria related to balancing inter- vs. intra-cluster density, number of clusters, and persistence of clusters over multiple runs of the algorithm. The successfully clustered papers are considered “strong members” of that cluster.

We refer to papers that cannot be confidently assigned by the clustering algorithm as “weak members”. We assign each weak member to the cluster with which it has the highest semantic similarity. Downstream analyses are reported with and without inclusion of weak members. We consider this distinction because we suggest that weak members represent research which is significantly different (and potential truly innovative) relative to existing disciplines, and as such can help explain shifts in the trajectories of fields.444

HDBSCAN refers to these non-confident assignments as noise; however, we expect these not to be noise in the traditional sense (e.g., an outlier or data worthy of discarding as it provides no analytical value) but instead to potentially add value as extremely novel research.

Figure 2: UMAP projection of publications in the year 2011 colored by HDBSCAN-generated cluster labels with corresponding cluster-level keyphrases. Each cluster plotted here contains at least 2.5% of total papers for the year and the size of each point is proportional to that publication’s language-based interdisciplinarity score. Small blue points represent weak members. Note that most clusters shown are well-separated and not homogeneous in shape, suggesting that UMAP is doing a good job of dimensionally reducing the feature space in such a way that it is reasonably straightforward to partition and that a variable-density-based clustering algorithm, such as HDBSCAN, is well-suited to identifying clusters in such a dataset.

For each cluster, we generate representative keyphrases using a procedure similar to the KeyBERT library[grootendorst2020keybert], with modifications (e.g., more performant aggregation of embeddings from large numbers of documents belonging to the same cluster). Deriving keyphrases provides explanatory power for clusters and adds more nuanced understanding of the clusters than other commonly used approaches to grouping knowledge products, e.g., WoS categories. Clusters identified in our dataset for the year 2011 and their corresponding keyphrases are shown in Figure 2. Our approach identifies a total of 371 clusters over the dataset, i.e., years 2011 through 2018.

Citation graphs and communities

Per common practice, our citation-based analysis considers the citation network wherein nodes in the graph represent papers in our dataset and undirected edges represent citation relationships. We detect communities in this network using the Louvain community detection algorithm[lu2015parallel]. Commonly-used, it maximizes modularity of the network, namely the expected value of inter- vs intra-community edges[lancichinetti2009community]. Specifically, for a given time window/year of interest we consider the subgraph containing only papers published in year and earlier, as well as their references. This approach allows us to make predictions for past papers without fear that future papers citing them will cause information leakage into the dataset (e.g. a model trying to predict the evolution of an idea tied to a paper from 2017 should not have access to information about papers from 2018 citing it during model training). An example of the community structure discovered via the Louvain method is shown in Figure 3.

Figure 3: An exemplary snapshot of the dense network and communities found by the Louvain community detection algorithm for the year 2011. Communities comprising less than 2.5% of total papers for the year are colored grey. Note that a clear community structure can be observed for this graph-only approach much like it was for the language-only clustering presented earlier.

Quantifying interdisciplinarity

Language-based interdisciplinarity: Our text-based interdisciplinarity (ID) metric scores each publication based on its soft clustering membership probabilities (i.e. the probability of a publication belonging to each possible cluster identified by standard or “hard” clustering), considering only strong member publications. It does so by assuming that one representation of interdisciplinarity is the diversity of language pulled from different fields. This metric is calculated using Equation 1 which considers the spread in its cluster assignment probabilities. Formally:


where is the total number of clusters in the dataset, is the probability of the paper belonging to a cluster, is the probability of the paper being a weak member of any cluster, and

the standard deviation of

over all clusters. This formulation is more intuitive when extreme cases are considered. For example, consider a corpus with 9 clusters for the year of interest. Consider a paper that sits very clearly within a single well-defined scientific discipline, i.e., for a single cluster (consequently, ). The interdisciplinarity score for that paper would be . Alternatively, imagine a paper with membership probabilities that are equivalent for all clusters, with the same probability that it may be a weak member, i.e., for . This would result in , reflecting that the paper belongs to a wide array of disciplines/clusters equally, but also there is some chance that it may be a weak member – which can also be interpreted as a global uncertainty in the membership probabilities – thus keeping it from achieving a score of .

Citation-based interdisciplinarity: We use betweenness centrality for each publication in the network as an interdisciplinarity metric, with higher centrality generally indicating higher interdisciplinarity, as has been done in previous literature[rafols2007diversity]. As we do for community detection, we use time-windowed subgraphs for centrality measurement. Betweenness centrality is lightly modified for use as an ID metric, normalized on a [0,1] scale. For paper in publication year :


where is the set of all centrality values for papers published in calendar year .

Text-based dynamic event modeling

We identify and track critical knowledge evolution events borrowing from the literature tracking communities in dynamic social networks[greene2010tracking]

. Specifically, representative embeddings for each cluster are calculated using the element-wise mean of embeddings of the papers in each cluster, and clusters are compared across consecutive years by calculating the pairwise cosine similarity of the embeddings of each

pair of clusters in years and [greene2010tracking]. We then link a cluster with its best-matching cluster(s) in the consecutive time step if the cosine similarity is above 0.95. We employ the following taxonomy[greene2010tracking]:

  • A birth event is identified at time when a cluster at time has no matching cluster(s) at time .

  • A death event is identified at time when a cluster at time has no matching cluster(s) at time .

  • Multiple clusters have merged at time when one cluster at time matches to two or more clusters at time .

  • Multiple clusters have split at time when one or more clusters at time match to a single cluster at time .

  • A continuation event is observed when one cluster at time is matched to exactly one cluster at time .

We group these events into two types for subsequent analyses: (1) dynamic – split or merge and (2) stable – continuity or death.555Not only does treating splits and merges as a single class emerge from our metastability mental model but, given that they often co-occur, this treatment creates non-overlapping classes. We disregard birth events at present since a birth event has no preceding data from which to build a model and is unrelated to the concept of combinatorial innovation being described by metastability. Figure 4 gives a notional example of merge and continuation events. We note that events may occur in combination; e.g., a cluster may split into two, and those two clusters may simultaneously merge with two other clusters.

Event-tracking and prediction

We hypothesize that interdisciplinarity scores and cluster size are indicators of metastability and therefore can be used to predict cluster evolution, i.e., dynamic vs. stable events, as an endogenous and target variable. In particular, for each language cluster at time , we use as exogenous model inputs: cluster-wise mean language-based interdisciplinarity score (which does include weak member papers); mean citation-based interdisciplinarity score for weak and strong members, treated as separate features in order to discern if there is any difference in predictive power considering weak members; and number of weak and strong member papers in the cluster.

Figure 4: Notional continuation and merge events showing weak (significantly different from existing clusters) and strong members (high confidence in its membership) of each cluster.

To choose the most powerful features and test their predictive power (and thus value for further analyses), we use multinomial logistic regression and a Random Forest classifier with a binary target

representing if a dynamic event type (split or merge) is observed for a cluster at time as shown in Equation 3.


We use the entire dataset with multinomial logistic regression for explanatory power. For the random forest, we use cluster events in the period 2011-17 for training and the year of 2018 for testing, resulting in roughly an 86%/14% train/test split by cluster count with 275 events for training (split/merge: 136; continuation/death: 139) and 43 testing events (split/merge: 21, continuation/death: 22). Using the above input features and event types in year , we fit a random forest with 100 trees using the default hyperparameter values from the scikit-learn python library[scikit-learn].


In the following, we first show that language and network frameworks capture different information by comparing the overlap between clusters identified using text and citation-based communities. We then further investigate the nature of the information provided by both frameworks by discussing how these representations, when considered together, not only serve to predict the evolution of disciplines and sub-fields but are equally important when doing so.

Comparing clusters and communities suggests valuable incomplete overlap

Figure 3 gives a snapshot of network communities in 2011; comparison with Figure 2 illustrates differences in grouping across the two approaches. In general, the Louvain algorithm detects communities in the citation network at a finer resolution than our text-based clustering. For reference, Figure 5 shows the number of clusters and communities in our dataset, in addition to a measure of overlap between the two that we describe below. The number of network communities generally decreases over time, reflecting a more integrated citation graph emerging amongst the papers in our sample.

Figure 5: Plot with number of clusters/communities identified by text-based (brown) and networks-based (blue) frameworks with inset plot showing percentage of language clusters associated with at least one network-derived community. Note that overlap values are consistently below 100% but well above 0%, suggesting unique and complementary insights added by each.

As both our language- and citation-based frameworks are unsupervised, to compare them we need to identify clusters with one another across frameworks. For this, we measure pairwise Jaccard similarity between clusters and communities, effectively looking at the fraction of shared publications between every language cluster and every network community relative to their total number of member papers. If the similarity between a cluster and a community is above then we consider them similar. This threshold-based method (and the 0.1 threshold specifically) has been used in the literature for tracking clusters and communities over time [greene2010tracking, asur2009event] and performs well across a variety of synthetic graphs. Going back to Figure 5, the inset shows the percentage of language clusters with similar (Jaccard similarity ) network communities. It can be seen that while there is an overlap between the communities and clusters, the overlap is not complete, which suggests that each approach adds unique insight.

Illustrating knowledge evolution events

To illustrate the types of knowledge events we identify and track in this work, let us consider an example from our dataset. Figure 6 shows the evolution of a full chain of language cluster evolutionary events over the period 2011 through 2018. Every cluster in this chain has ’Business and Finance’ and ’Economics’ as the most common WoS categories among member papers. In contrast, the keyphrases generated via our language clustering approach (shown in the Figure) reflect a much greater resolution, including phrases like “income hedging” and “intangible capital”. This chain starts with a 2011 cluster that appears related to the (then recent) U.S. housing market crisis and Great Recession. There is a strong focus on work discussing corporate governance and government spending. This focus on organizational-level finance and economics mostly continues through 2017, with only a few deviations that are more focused on overall market trends. This is epitomized by the representative paper for one of the 2016 clusters, focused on European banking. Then something happens in 2018: topics appear to shift substantially from organizational/macroeconomic concepts to research focused on individual-level spending, finance, and decision-making, as can be seen both from the keyphrases representing those linguistic clusters, as well as from the representative 2018 paper focused on accounting for consumer behaviors in investing. It is interesting to note that this timing corresponds with Richard Thaler’s 2017 Nobel Prize in Economics, awarded for contributions to behavioural economics.

Figure 6: Figure showing evolution of a set of language clusters from 2011 to 2018 (left to right) and keyphrases for each, along with two representative papers for two of the clusters. Note the marked change in focus between 2016 and 2018 evidenced by representative titles and cluster keyphrases. The split event for the 2017 cluster was successfully predicted by the random forest classifier described later (green box).

Knowledge evolution is significantly associated with interdisciplinarity and weak members

We use multinomial logistic regression with the mentioned endogenous and exogenous variables to evaluate how knowledge evolution may be explained through our interdisciplinarity scores, cluster size and network metrics. Per common practice, we insert a constant and a year variable to account for potential temporal effects. We attempt to explain whether or not clusters split or merge first, in order to evaluate the strength of associations between our hypothesized inputs and outputs.

Per Table 1, we see significant positive associations between a cluster splitting or merging and the language interdisciplinarity score and network interdisciplinarity score with only certain associations (i.e., without weak members).666Following common best practice, we first conducted tests with all features, and, finding some insignificant, repeated with only significant features. See Supplementary Materials for details of this purposeful selection. We also see a positive association with the number of weak members associated with a cluster, and a negative association with the year.777Year was included per common practice to remove potential associations from time passing. Note that this model had a higher pseudo than a model without the year included. Future work should investigate any temporal associations through e.g., time series analyses. Though all marginal effects are on the same order of magnitude, ranking by those effects, the language interdisciplinarity score is most important, followed by the number of weak members and the network score without weak members. Next, we further investigate this statistical relationship by testing the predictive power of a model trained on only a subset of these cluster data.

Model estimates

Marginal effects
Model Input (per cluster) Coefficient Effect
Mean language ID score (strong members only) 0.534 0.000 0.116 0.000
Number of weak members 0.449 0.003 0.097 0.002
Mean network ID score (strong members only) 0.292 0.030 0.063 0.025
Publication year -0.372 0.007 -0.081 0.005
Constant -0.009 0.941
Table 1: Multinomial logistic regression results describing associations with split or merge () vs. continuation or death (). Note significant positive associations with language score, network score, number of weak members, and a negative association for the year. All other features were not significant, and left out via purposeful selection for a more parsimonious model; see Supplemental Materials.

Validating our statistical result with predictive power - equal importance of interdisciplinarity scores

Model input feature Gini Importance
Mean language ID score (strong members only) 0.336
Number of weak members 0.234
Mean network ID score (strong members only) 0.315
Publication year 0.115
Table 2: Random forest results on a held-out test set predicting the different types of cluster events a given cluster would experience in the next year, with the same features as in Table 1. We achieved a micro-averaged on our held-out test set, with a class-specific for the class representing knowledge evolution (splits and merges). Per reported Gini feature importance of each independent variable, both interdisciplinarity scores are equally important, followed by number of weak members, then year. Note that the sort order of this table is identical to that of Table 1 to allow for more direct comparison of logistic regression coefficients to random forest feature importances.

We have shown significant associations between knowledge splitting and merging, and interdisciplinarity and weak members. Here we go further by performing predictive modeling with a random forest classifier. Including only features shown to be statistically significant, we achieve a micro-averaged on our held-out test set, with on our class representing knowledge evolution (i.e. splitting or merging), a performance that is significantly better than random chance. Specifically, we present Table 2, which intuitively shows both interdisciplinarity scores to be equally important in achieving our predictive power. The number of weak members associated with a given cluster is next-most important, followed by the year variable. We validated against potential issues that can affect the Gini feature importance values from a random forest, specifically issues that arise when features exhibit multicollinearity and a bias towards numeric and high-cardinality categorical features [strobl2007bias]

. The first is not a problem in this case, as the high-correlation features were removed as a result of the logistic regression analysis discussed earlier. The second is expected to only be a minimal concern for this analysis, as the only non-numeric feature in this model is the publication year. Because this is a low-cardinality categorical feature, it may be the victim of a bias in the feature importances and, as a result, the year’s true ranking in the feature importance table could be higher than is indicated. As this is not a critical change in the data for our analysis, correcting for this bias is beyond the scope of this work. Taken together, our results underscore the importance of including both the linguistic and network viewpoints of interdisciplinarity.

Conclusions and Future Work

In this paper, we proposed a hybrid language- and network-based framework that uses state-of-the-art semantic embeddings and citation information to model metastability of ideas in order to identify dynamic events associated with the rise, fall, combination, and dispersion of topics in the scholarly corpus. We show that this hybrid approach is distinctly different from those based on linguistic or citation information alone. The approach we propose relies on a multi-dimensional view of interdisciplinarity as a predictor of scientific knowledge transitions.

Through both explanatory and predictive efforts, we show that language as well as network interdisciplinarity has positive effects on metastable knowledge combining and mixing. Interestingly, network interdisciplinarity of strong member papers is significantly predictive of these mixing events, even though the number of strong members is not. By contrast, even though weak members’ network interdisciplinarity is not significantly predictive, more weak members are predictive of knowledge combining and mixing. This suggests that papers that do not cluster neatly are indicative of combinatorial innovation that is expressed as the knowledge mixing events discussed herein. As such, if one is interested in spurring broad interdisciplinarity, one should focus on encouraging more weakly-clustered research, regardless of its own network-derived interdisciplinarity. Future work should further investigate these relationships, in particular over longer time scales and on a more complete body of the scientific corpus. Additionally, a comparison of a few other useful and popular interdisciplinarity metrics is a natural extension of this work, to determine how well other established measures can predict the knowledge evolution events we have explored. A lack of consensus on the most useful interdisciplinarity metrics [wang2020consistency] however makes this a challenge that must be tackled in later analyses.

This work also motivates and lays groundwork for new hybrid models that align multiple views of the literature ( e.g., linguistic, bibliometric) into unified modeling frameworks. Looking beyond traditional single-view approaches, such frameworks would be better suited to capture the richness of the scholarly record. This can be achieved through so-called graph machine learning modeling, that allows an integrated representation of a datum reflecting both its content (e.g. language in the case of a scientific paper) and its context within a network. Further, the work we describe here is mostly based on unsupervised learning. This is a necessity of the nature of this work, as there is no readily-available ground truth that is universally acknowledged to reflect the changing nature of scientific thought, disciplines, and sub-disciplines at a time scale reflective of how ideas mature and evolve. Future work should build benchmark datasets with which the metascience community can engage to evaluate and test these approaches more thoroughly than is currently possible. Possible proxy datasets that do exist at the moment include citation records – the prediction of above-average citation growth, for example, could be another modeling task that is able to further determine the utility of the interdisciplinarity metrics presented in this paper.

Data Availability

Parts of the data that support the findings of this study are available from Clarivate but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Clarivate.

Code Availability

The code used for data processing and model development for the current study is available from the corresponding author on reasonable request.



This research was supported by the National Center for Science and Engineering Statistics (NCSES) at the National Science Foundation through award 49100420C0030. The authors would like to thank Dr. Ashley Arigoni for her work on cluster comparison visualizations, as well as Mr. Joe Gorney and Mr. Alex Wade of Semantic Scholar for their aid in troubleshooting data engineering issues, and Dr. Ilya Rahkovsky of the Center for Security and Emerging Technology at Georgetown University and Dr. Phoebe Wong of Quantitative Scientific Solutions for their insights on the final analyses and drafts of this paper.

Author contributions statement

D.G., S.R. conceived the experiment(s); S.K., D.R.M. and M.S. conducted the experiment(s). All authors analyzed the results and reviewed the manuscript.

Additional information

The authors declare no competing interests.