Unsupervised embedding of trajectories captures the latent structure of mobility

12/04/2020 ∙ by Dakota Murray, et al. ∙ Indiana University Bloomington Indiana University 0

Human mobility and migration drive major societal phenomena such as the growth and evolution of cities, epidemics, economies, and innovation. Historically, human mobility has been strongly constrained by physical separation – geographic distance. However, geographic distance is becoming less relevant in the increasingly-globalized world in which physical barriers are shrinking while linguistic, cultural, and historical relationships are becoming more important. As understanding mobility is becoming critical for contemporary society, finding frameworks that can capture this complexity is of paramount importance. Here, using three distinct human trajectory datasets, we demonstrate that a neural embedding model can encode nuanced relationships between locations into a vector-space, providing an effective measure of distance that reflects the multi-faceted structure of human mobility. Focusing on the case of scientific mobility, we show that embeddings of scientific organizations uncover cultural and linguistic relations, and even academic prestige, at multiple levels of granularity. Furthermore, the embedding vectors reveal universal relationships between organizational characteristics and their place in the global landscape of scientific mobility. The ability to learn scalable, dense, and meaningful representations of mobility directly from the data can open up a new avenue of studying mobility across domains.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 7

page 11

page 12

page 23

page 40

page 41

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

How far apart are two places? The question is surprisingly hard to answer when it involves human mobility. Although geographic distance has been constraining human movements throughout history, it is becoming less relevant in a world connected by rapid transit and global airline networks. For instance, a person living in Australia is more likely to migrate to the United Kingdom, a far-away country with similar language and culture, than to a much closer country such as Indonesia pew2018migration. Similarly, a student in South Korea is more likely to attend a university in Canada than one in North Korea unesco2019students. Although geographic distance has been used as the most prominent basis for models of mobility, such as the Gravity zipf1946gravity and Radiation simini2012universal models, there have been attempts to define alternative notions of distance, or functional distances boschma2005proximity, brown1970migration, kim2018functional, from real-world data or a priori relationships between geographic entities.

Yet, functional distances are often low-resolution, computed at the level of countries rather than regions, cities, or organizations, and have focused on only a single facet of mobility at a time, whereas real-world mobility is multi-faceted, influenced simultaneously by geography, language, culture, history, and economic opportunity. Low dimensional distance alone cannot represent the multitude of inter-related factors that drive mobility. Networks offer a solution to representing many dimensions of mobility, yet edges only encode simple relationships between connected entities. Capturing the complexity of mobility requires moving beyond simple functional distances and networks, to learning high-dimensional landscapes of mobility that incorporate the many facets of mobility into a single fine-grained and continuous representation.

Here, we apply a neural embedding framework to real-world mobility trajectories and demonstrate that it can encode the complex landscape of human mobility into a dense and continuous vector-space representation, from which we can not only derive a meaningful functional distance between locations but also probe relationships based on culture, language, and even prestige along with the geographic relationship. We embed trajectories from three massive datasets: U.S. passenger flight itinerary records, Korean accommodation reservations, and a dataset of scientists’ career mobility between organizations captured in bibliometric records.

The flight itinerary data, from the Airline Origin and Destination Survey, consists of records of more than 300 million itineraries between 1993 and 2020 documenting domestic flights between 828 airports in the United States. A trajectory is constructed for each passenger flight itinerary, forming an ordered sequence of unique identifiers of the origin and destination airports. The Korean accommodation reservations consist of customer reservation histories across 2018 and 2020 for 1,038 unique accommodation locations in Seoul, South Korea. A trajectory is constructed for each customer, containing the ordered sequences of accommodations they reserved over time. Finally, we use scientific mobility data that captures the affiliation trajectories of nearly 3 million scientists across ten years. We focus in more detail on scientific mobility due to its richness and importance. Scientific mobility—which is a central driver of the globalized scientific enterprise czaika2018globalisation, box2008competition and strongly related to innovation braunerhjelm2020labor, kaiser2018innovation, impact sugimoto2017mostimpact, petersen2018multiscale, collaboration rodrigues2016mobility, and the diffusion of knowledge braunerhjelm2020labor, morgan2018prestige—is not only an important topic in the Science of Science but also ideal for our study thanks to its well-known structural properties such as the centrality of scientifically advanced countries and the strong prestige hierarchy clauset2015hierarchy, deville2014career. In spite of its importance, understandings of scientific mobility have been limited by the sheer scope and complexity of the phenomenon robinson2019mobility, deville2014career, being further confounded by the diminishing role of geography in shaping the landscape of scientific mobility.

Trajectories of scientific mobility are constructed using more than three million name-disambiguated authors who were mobile—having more than one affiliation—between 2008 and 2019, as evidenced by their publications indexed in the Web of Science database (see Methods). As a scientist’s career progresses, they move between organizations or pick up additional (simultaneous) affiliations forming affiliation trajectories (Fig. 1). Thus, the trajectories encode both migration and co-affiliation—the holding of multiple simultaneous co-affiliations involving the sharing of time and capital between locations—that is typical of scientific mobility rodrigues2016mobility, sugimoto2017mostimpact (see Supporting Information).

A vector-space embedding of locations (airports, accommodations, and organizations) is learned by using trajectories as input to the standard with skip-gram negative sampling, or word2vecneural-network architecture (see Methods). This neural embedding model, originally designed for learning language models mikolov2013word2vec, has been making breakthroughs by revealing novel insights into texts tshitoyan2019mat2vec, garg2018gender, kozlowski2018geometry, hamilton2016diachronic, le2014doc2vec, nakandala2017gendered and networks grover2016node2vec, linzhuo2020hyperbolic. The model is also computationally efficient, robust to noise, and can encode relations between entities as geometric relationships in the vector space levy2014neural, nakandala2017gendered, kozlowski2018geometry, an2018semaxis. As a result, each location is encoded into a single vector representation, and vectors relate to one another based on the likelihood of locations appearing adjacent to one another in the same trajectory.

Figure 1: Construction of affiliation trajectories from publication records. a. An author published five papers across five time periods, with only one affiliation listed in the byline of each paper. A unique identifier is assigned to each organization and they are assembled into an affiliation trajectory ordered by year of publication. b. If an author lists multiple organization affiliations within the same year, then organization IDs within that year are placed in random order in each training iteration of the word2vec model (for more detail, see Supporting Information).

To validate our approach, we evaluate the quality of vector representations with their performance in predicting real-world mobility flows using the gravity model framework zipf1946gravity. The Gravity Model is a widely used mobility model curiel2018citygravity, jung2008highwaygravity, hong2016busgravity, truscott2012epidemicgravity that models the expected flux, , between locations based on their populations and distance:

(1)

where is the population of locations , is a decay function with respect to distance between locations, and

is a constant estimated from data (see Methods). For the flight itinerary data, we use population

as the total number of unique passengers who passed through each airport, for the Korean accommodation reservation data, we use the total number of unique customers who booked with each accommodation, and for scientific mobility, we use the mean annual number of unique mobile and non-mobile authors who were affiliated with each organization. , which is often referred to as “expected flux” simini2012universal, is the expected frequency of the co-occurrence of location and in the trajectory in the gravity model. The gravity model dictates that the expected flow, , (), is proportional to the locations’ population, , and decays as a function of their distance, . We define the distance function in terms of either the geographic distance between locations or their functional distance in the vector space, which is calculated as the cosine distance between their vectors, termed the embedding distance. For geographic distance, we define as the standard power-law function, and for the embedding distance, we use the exponential function, selected as the best performing for each case (Fig. S6 and Fig. S7).

Embeddings provide functional distance between locations

We show that the embedding distance better predicts actual mobility flows than the geographic distance across three disparate datasets. In the case of flight itineraries, the embedding distance explains more than twice the expected flux between airports (, Fig. 2a) than does geographic distance (). Also, the embedding distance produces better predictions of actual flux between airports than does the geographic distance (Fig. 2b). In the case of Korean accommodation reservations, embedding distance better explains the expected flux (, Fig. 2c) than does geographic distance (), and predictions made using the embedding distance outperform those made with geographic distance (Fig. 2d). This performance is consistent in the case of scientific mobility: the embedding distance explains more than twice the expected flux (, Fig. 2e) than does the geographic distance (), and predictions made using the embedding distance outperform those using the geographic distance (Fig. 2f). These patterns hold for the subsets of only domestic (within-country organization pairs, Fig. S6 and Fig. S8c) and only international mobility flows (across-country organization pairs, Fig. S8d and Fig. S7). The embedding distance also out-performs alternative diffusion-based network distance measures including the personalized-page rank scores calculated from the underlying mobility network (Fig. S11, Fig. S12). In sum, our results demonstrate that, consistently, the embedding distance better captures patterns of actual mobility than does the geographic distance.

Figure 2: Embedding distance encodes functional distance and better predicts mobility in flights, accommodation reservations, and global scientific mobility. a

Embedding distance (cosine distance between organization vectors, top) better explains the expected flux of passengers between U.S. airports than does geographic distance (bottom). The red line is the line of the best fit. Black dots are mean flux across binned distances. 99% confidence intervals are plotted but are too small to be visible. Correlation is calculated on the data in the log-log scale. The lightness of each hex bin indicates the frequency of organization pairs within it.

b Predictions of flux between airport pairs made using embedding distance (top) outperform those made using geographic distance (bottom). Boxplots show the distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps with ; a perfect prediction yields all points on the black line. “RMSE” is the root-mean-squared error between the actual and predicted values. Results are consistent in the case of scientific mobility. For Korean accommodation reservations, embedding distance better explains the expected flux than does geographic distance (c), and produces better predictions (d). Similarly, in the case of global scientific mobility, embedding distance explains the expected flux between organizations (e) and allows for better predictions (f) than geographic distance.

Embeddings capture global structure of mobility

In the remainder of the paper, we focus on scientific mobility, leveraging its richness to investigate how the geometric space generated by the neural embedding method sheds light on the multi-faceted relationships between organizations. To explore the topological structure of the embedding, we use a topology-based dimensionality reduction method (UMAP mcinnes2018umap) to obtain a two-dimensional representation of the embedding space (Fig. 3a). By showing relationships between individual organizations, rather than aggregates such as nations or cities, this projection constitutes the largest and highest resolution “map” of scientific mobility to date.

Globally, the geographic constraints are conspicuous; organizations tend to form clusters based on their national affiliations and national clusters tend to be near their geographic neighbors. At the same time, the embedding space also reflects a mix of geographic, historic, cultural, and linguistic relationships between regions much more clearly than alternative network representations (Fig. S13) that have been common in studies of scientific mobility chinchilla2018global, czaika2018globalisation.

The embedding space also allows us to zoom in on subsets and re-project them to reveal local relationships. For example, re-projecting organizations located in Western, Southern, and Southeastern Asia with UMAP (Fig. 3b) reveals a gradient of countries between Egypt and the Philippines that largely corresponds to geography, but with some exceptions seemingly stemming from cultural and religious similarity; Malaysia, with its official religion of Islam, is nearer to Middle Eastern countries in the embedding space than many geographically-closer South Asian countries. We validate this finding quantitatively with the cosine distance between nations (the centroids of organizations vectors belonging to that country). Malaysia is nearer to many Islamic countries such as Iraq (), Pakistan (), and Saudi Arabia () than neighboring but Buddhist Thailand () and predominantly-secular Vietnam ().

Linguistic and historical ties also affect scientific mobility. We observe that Spanish-speaking Latin American nations are positioned near Spain (Fig. 3c), rather than Portuguese-speaking Brazil ( vs. for Mexico and vs. for Chile) reflecting linguistic and cultural ties. Similarly, North-African countries that were once under French rule such as Morocco are closer to France () than to similarly geographically-distant European countries such as Spain (), Portugal (), and Italy (). Comparable patterns exist even within a single country. For example, organizations within Quebec in Canada are located nearer France () than the United States ().

Mirroring the global pattern, organizations in the United States are largely arranged according to geography (Fig. 3d). Re-projecting organizations located in Massachusetts (Fig. 3e) reveals structure based on urban centers (Boston vs. Worcester), organization type (e.g., hospitals vs. universities), and university systems (University of Massachusetts system vs. Harvard & MIT). For example, even though UMass Boston is located in Boston, it clusters with other universities in the UMass System () rather than the other typically more highly-ranked and research-focused organizations in Boston (), implying a relative lack of mobility between the two systems. Similar structures can be observed in other states such as among New York’s CUNY and SUNY systems (Fig. S14), Pennsylvania’s state system (Fig. S15), Texas’s Agricultural and Mechanical universities (Fig. S16), and between the University of California and State University of California systems (Fig. S17).

Figure 3: Projection of embedding space reveals complex multi-scale structure of organizations. a. UMAP projection mcinnes2018umap of the embedding space reveals country-level clustering. Each point corresponds to an organization and its size indicates the average annual number of mobile and non-mobile authors affiliated with that organization from 2008 to 2019. Color indicates the region. The separation of organizations in Quebec and the rest of Canada is highlighted. b. Zooming into (re-projecting) the area containing countries in Western, South, and Southeast Asia shows a geographic and cultural gradient of country clusters. c. Similarly, zooming into the area containing organizations in Spain, Portugal, South, and Central America shows clustering by most widely-spoken majority language group: Spanish and Portuguese. d. Doing the same for organizations in the United States reveals geographic clustering based on state, roughly grouped by Census Bureau-designated regions, e. Zooming in further on Massachusetts reveals clustering based on urban center (Boston, Worcester), organizational sector (hospitals vs. universities), and university systems and prestige (UMass system vs. Harvard, MIT, etc.).

Just as the embedding space makes it possible to zoom in on subsets of organizations, it is also possible to zoom out

by aggregating organizational vectors. We can examine the country-level structure that governs scientific mobility. For the purpose, we define the representative vector of each country as the average of their organizational vectors and, using their cosine similarities, perform hierarchical clustering of nations that have at least 25 organizations represented in the embedding space, excluding the United States which is a dominant hub well connected to most other countries (see Fig.

4a). The six identified clusters roughly correspond to countries in East-Asia (orange), Scandinavia (dark purple), the British Commonwealth (light purple), and Central and Eastern Europe (light blue), and then two remaining diverse clusters contain a mix of European, Latin American, and Mediterranean countries. The cluster structure shows that not only geography but also linguistic ties between countries are related to scientific mobility.

We quantify the relative importance of geography (by region), and language (by the most widely-spoken language of each country) using the element-centric clustering similarity gates2019element, a method that can compare hierarchical clustering by explicitly adjusting the relative importance of different levels of the hierarchy with a scaling parameter , acting like a zooming lens. If is high, the similarity is based on the lower levels of the dendrogram, whereas when is low, the similarity is based on higher levels. Fig. 4b demonstrates that regional relationships play a major role at higher levels of the clustering process (low ), and language (family) explains the clustering more at the lower levels (high ). This suggests that the embedding space captures the hierarchical structure of mobility.

Figure 4: Geography, then language, conditions international mobility. a. Hierarchically clustered similarity matrix of country vectors aggregated as the mean of all organization vectors within the country, excluding the United States and countries with at most 25 organizations. Color of matrix cells corresponds to the cosine similarity between country vectors. Color of country names corresponds to their cluster. Color of three cell columns separated from the matrix corresponds to, from left to right, the region of the country, the language family ethnologue, and the dominant language. b. Element-centric cluster similarity gates2019element reveals the factors dictating hierarchical clustering. Region better explains the grouping of country vectors at higher levels of the clustering. Language family, and then the most widely-spoken language, better explain the fine-grained grouping of countries.

Embeddings capture latent prestige hierarchy

Prestige hierarchy is known to underpin the dynamics of scientific mobility, in which researchers tend to move to similar or less prestigious organizations deville2014career, clauset2015hierarchy. Could the embedding space, to which no explicit prestige information is given, encode a prestige hierarchy? This question is tested by exploiting the geometric properties of the embedding space with SemAxis an2018semaxis. Here, we use SemAxis to operationalize the abstract notion of academic prestige, defining an axis in the embedding space where poles are defined using known high- and low-ranked universities. As an external proxy of prestige, we use the Times Ranking of World Universities (we also use research impact from the Leiden Ranking waltman2012leidenrankings, see Supporting Information); the high-rank pole is defined as the average vector of the top five U.S. universities according to the rankings, whereas the low-rank pole is defined using the five bottom-ranked (geographically-matched by U.S. census region) universities. We derive an embedding-based ranking for universities based on the geometrical spectrum from the high-ranked to low-ranked poles (see Data and Methods).

The embedding space encodes the prestige hierarchy of U.S. universities that are coherent with real-world university rankings. The embedding-based ranking is strongly correlated with the Times ranking (Spearman’s , Fig. 5a). For reference, the correlation between the Times ranking and the publication impact scores from the Leiden Ranking waltman2012leidenrankings, a bibliometrically-based university ranking, is 0.87 (Spearman’s , Fig. 5b). The correlation between the embedding-based ranking and the Times ranking is robust regardless of the number of organizations used to define the axes (Fig. S18), such that even using only the single top-ranked and bottom-ranked universities produces a ranking that is significantly correlated with the Times ranking (Spearman’s , Fig. S18a). The correlation is also comparable to more direct measures such as node strength (sum of edge weights, Spearman’s

) and eigenvector centrality (Spearman’s

, see Supporting Information) from the mobility network. The strongest outliers that were ranked more highly in the Times ranking than in the embedding-based ranking tend to be large state universities such as Arizona State University and the University of Florida. Those ranked higher in the embedding-based ranking tend to be relatively-small universities near major urban areas such as the University of San Francisco and the University of Maryland Baltimore County, possibly reflecting exchanges of scholars with nearby high-ranked institutions at these locations. In sum, our results suggest that the embedding space is capable of capturing information about academic prestige, even when the representation is learned using data without explicit information on the direction of mobility (as in other formal models clauset2015hierarchy), or prestige.

The axes can be visualized to examine the relative position of organizations along the prestige axis, and along a geographic axis between California and Massachusetts. Prestigious universities such as Columbia, Stanford, MIT, Harvard, and Rockefeller are positioned towards the top of the axis (Fig. 5c). Universities at the bottom of this axis tend to be regional universities with lower national profiles (yet still ranked by Times Higher Education) and with more emphasis on teaching, such as Barry University and California State University at Long Beach. By projecting other types of organizations onto the prestige axis, SemAxis offers a new means of reason about the prestige of organizations for which rankings are often low-resolution, incomplete, or entirely absent, such as regional and liberal arts universities (Fig. 5d), government organizations (Fig. 5d), and research institutes (Fig. 5e). Their estimated prestige is speculative, though we find that it significantly correlates with their citation impact (Fig. S22).

Figure 5: Embedding captures latent geography and prestige hierarchy. a. Comparison between the ranking of organizations in the Times ranking and the embedding ranking derived using SemAxis. Un-filled points are those top and bottom five universities used to span the axis. Even when considering only a total of ten organization vectors, the estimate of the Spearman’s rank correlation between the embedding and Times ranking is (), which increases when more top-and-bottom ranked universities are included (Fig. S18). b. The Times ranking is correlated with Leiden Ranking of U.S. universities with Spearman’s . c-f. Illustration of SemAxis projection along two axes; the latent geographic axis, from California to Massachusetts (left to right) and the prestige axis. Shown for U.S. Universities (c), Regional and liberal arts colleges (d), Research institutes (e), and Government organizations (f). Full organization names are listed in Table S1.

We also find that the size (L2 norm) of the organization embedding vectors provides insights into the characteristics of organizations. Up to a point (around 1,000 researchers), the size of U.S. organization’s vectors tends to increase proportionally to the number of researchers (both mobile and non-mobile) with published work; these organizations are primarily teaching-focused institutions, agencies, and hospitals that either are not ranked or have a low ranking. However, at around 1,000 researchers, the size of the vector decreases as the number of researchers increases. These organizations are primarily research-intensive and prestigious universities with higher rank, research outputs, R&D funding, and doctoral students (Fig. S23). A similar pattern has been observed in applications of neural embedding to natural language, in which the size of word vectors were found to represent the word’s specificity, i.e., the word associated with the vector frequently co-appears with particular context wordsschakel2015measuring. If the word in question is universal, appearing frequently in many different contexts, it would not have a large norm due to a lack of strong association with a particular context. Likewise, an organization with a small norm, such as Harvard, appears in many contexts alongside many different organizations in affiliation trajectories—it is well-connected. The concavity of the curve emerges in part from the relationship between the size of the vector and the expected connectedness of the organization, given its size (). Large, prestigious, and well-funded research universities such as Princeton and Harvard have smaller vector norms because they appear in many different contexts compared to more teaching-focused organizations such as NY Medical College, and the University of Michigan at Flint. Some universities, such as the University of Alaska at Fairbanks, have considerably smaller vectors, which may be a result of their remote locations and unique circumstances.

We report that this curve is almost universal across many countries. For instance, China’s curve closely mirrors that of the United States (Fig. 6b). Smaller but scientifically advanced countries such as Australia and other populous countries such as Brazil also exhibit curves similar to the United States (Fig. 6

b, inset). Other nations exhibit different curves which lack the portions with decreasing norm, probably indicating the lack of internationally-prestigious institutions. Similar patterns can be found across many of the 30 countries with the most total researchers (Fig. 

S24; see Supporting Information for more discussion).

Figure 6: Size of organization embedding vectors captures prestige and size of organizations. a. Size (L2 norm) of organization embedding vectors compared to the number of researchers for U.S. universities. Color indicates the rank of the university from the Times ranking, with 1 being the highest ranked university. Uncolored points are universities not listed on the Times ranking. A concave-shape emerges, wherein larger universities tend to be more distant from the origin (large L2 norm); however, the more prestigious universities tend to have smaller L2 norms. b. We find a similar concave-curve pattern across many countries such as the United States, China, Australia, Brazil, and others (inset, and Fig. S24). Some countries exhibit variants of this pattern, such as Egypt, which is missing the right side of the curve. The loess regression lines are shown for each selected country, and for the aggregate of remaining countries, with 99% confidence intervals. Loess lines are also shown for organizations in Australia, Brazil, and Egypt (inset).

Conclusion

Neural embedding approaches offer a novel, data-driven solution for learning an effective and robust representation of locations based on trajectory data, encoding the complex and multi-faceted nature of mobility. We demonstrated that a functional distance derived from the embedding can be used with the gravity model of mobility to better predict real-world mobility than does geographic distance. Embedding distance outperformed geographic distance across distinct and disparate domains, including U.S. flight itineraries, Korean hotel accommodation reservations, and global scientific mobility. Focusing on scientific mobility, we find that neural embedding’s performance is driven by its ability to encode many aspects of scientific mobility into a single representation, including global and regional geography, shared languages, and prestige hierarchies, even without explicit information on these factors.

While we focus on three domains of mobility, this approach may be broadly applicable to other kinds of mobility data, such as general human migration, transit-network mobility, and more. Moreover, this approach can be used to learn a functional distance even between entities for which no physical analog exists, such as between occupational categories based on individuals’ career trajectories. In addition to providing a functional distance that supports modeling and predicting mobility patterns, the structure of the neural embedding space is amenable to a range of unique applications for studying mobility. As we have shown, the embedding space allows the visualization of the complex structure of scientific mobility at high resolution across multiple scales, providing a large and detailed map of the landscape of global scientific mobility. Embedding also allows us to quantitatively explore abstract notions such as academic prestige, and can potentially be generalized to other abstract axes. Investigation of the structure of the embedding space, such as the vector norm, reveals universal patterns based on the organization’s size and their vector norm that could be leveraged in future studies of mobility.

In spite of its promise, our study has several limitations. First, the skip-gram word2vec model does not leverage directionality, meaning that embedding will be less effective at capturing mobility for which directionality is critical. Second, the neural embedding approach is most useful in cases of mobility between discrete geographic units such as between countries, cities, and businesses; this approach is less useful in the case of mobility between locations represented using geographic coordinates, such as in the modeling of animal movements. Neither of these methodological limitations is insurmountable, and future work can aim to incorporate directionality and identify meaningful representations of continuous mobility data. Finally, the case of scientific mobility presents domain-specific limitations. Reliance on bibliometric metadata means that we capture only long-term mobility such as migration, rather than the array or more frequent short-term mobility such as conference travel and temporary visits. The kinds of mobility we do capture—migration, and co-affiliation—although conceptually different, are treated identically by our model. Also, our data might further suffer from bias based on publication rates: researchers at prestigious organizations tend to have more publications, leading to these organizations appearing more frequently in affiliation trajectories.

Mobility and migration are at the core of human nature and history, driving societal phenomena as diverse as the growth and evolution of cities wef2017migration, curiel2018citygravity, epidemics kraemer2020covid, truscott2012epidemicgravity, economies kaluza2010cargo, kerr2011immigration, and innovation kaiser2018innovation, sugimoto2017mostimpact, petersen2018multiscale, morgan2018prestige, rodrigues2016mobility. However, the paradigm of scientific migration may be changing. Traditional hubs of migration have experienced many politically-motivated policy changes that affect scientific mobility, such as travel restrictions in the U.S. and U.K. chinchilla2018travelban. At the same time, other nations, such as China, are growing into major scientific powers and attractors of talent cao2020returning. Unprecedented health crises such as the COVID-19 pandemic threaten to bring drastic global changes to travel and migration by tightening borders and halting travel. With the changing paradigm of global mobility, we need now, more than ever, new tools and approaches to capture and understand human mobility in order to inform sensible, effective, sustainable, and humane policies.

Methods

U.S. flight itinerary data

We source U.S. airport itinerary data from the Origin and Destination Survey (DB1B), provided by the Bureau of Transportation Statistics at the United States Department of Transportation. DB1B is a sample of 10 percent of domestic airline tickets between 1993 and 2020, comprising 307,760,841 passenger itineraries between 828 U.S. airports. Each itinerary is associated with a trajectory of airports including the origin, destination, and intermediary stops.

Korean accommodation reservation data

We source Korean accommodation reservation data from collaboration with Goodchoice Company LTD.. The data contains customer-level reservation trajectories spanning the period of August 2018 through July 2020 and comprising 1,038 unique accommodation locations in Seoul, South Korea.

Scientific mobility data

We source co-affiliation trajectories of authors from the Web of Science database hosted by the Center for Science and Technology Studies at Leiden University. Trajectories are constructed from author affiliations listed on the byline of publications for an author. Given the limitations of author-name disambiguation, we limit our analyses to papers published after 2008, when the Web of Science began providing full names and institutional affiliations caron2014disambiguation that improved disambiguation (see Supporting Information). This yields 33,934,672 author-affiliation combinations representing 12,963,792 authors. Each author-affiliation combination is associated with the publication year and an ID linking it to one of 8,661 disambiguated organizational affiliations (see Supporting Information for more detail). Trajectories are represented as the list of author-affiliation combinations, ordered by year of publication, and randomly ordered for combinations within the same year. The most fine-grained geographic unit in this data is the organization, such as a university, research institute, business, or government agency.

Here, authors are classified as mobile when they have at least two distinct organization IDs in their trajectory, meaning that they have published using two or more distinct affiliations between 2008 and 2019. Under this definition, mobile authors constitute 3,007,192 or 23.2% of all authors and 17,700,095 author-affiliation combinations. Mobile authors were associated with 2.5 distinct organizational affiliations on average. Rates of mobility differ across countries. For example, France, Qatar, the USA, Iraq, and Luxembourg had the most mobile authors (Fig. 

S2c). However, due to their size, the USA, accounted for nearly 40 % of all mobile authors worldwide (Fig. S2a), with 10 countries accounting for 80 % of all mobility (Fig. S2b). The countries with the highest proportion of mobile scientists are France, Qatar, the United States, and Iraq, whereas those with the lowest are Jamaica, Serbia, Bosnia & Herzegovina, and North Macedonia (Fig. S2c). In most cases, countries with a high degree of inter-organization mobility also have a high degree of international mobility, indicating that a high proportion of their total mobility is international (Fig. S2d); However, some countries such as France and the United States seem to have more domestic mobility than international mobility. While the number of publications has increased year-to-year, the mobility and disciplinary makeup of the dataset has not notably changed across the period of study (Fig. S1).

Embedding

We embed trajectories by treating them analogously to sentences and locations analogously to words. For U.S. airport itinerary, trajectories are formed from the flight itineraries of individual passenger, in which airports correspond to unique identifiers. In the case of Korean accommodation reservations, trajectories comprise a sequence of accommodations reserved over a customer’s history. For scientific mobility, an“affiliation trajectories” is constructed for each mobile author, which is built by concatenating together their ordered list of unique organization identifiers, as demonstrated in Fig. 1a. In more complex cases, such as listing multiple affiliations on the same paper or publishing with different affiliations on multiple publications in the same year, the order is randomized within that year, as shown in Fig. 1b.

These trajectories are used as input to the standard skip-gram negative sampling word embedding, commonly known as word2vec mikolov2013word2vec. word2vec constructs dense and continuous vector representations of words and phrases, in which distance between words corresponds to a notion of semantic distance. By embedding trajectories, we aim to learn a dense vector for every location, for which the distance between vectors relates to the tendency for two locations to occur in similar contexts. Suppose a trajectory, denoted by (), where is the th location in the trajectory. A location, , is considered to have context locations, , that appear in the window surrounding up to a time lag of , where is the window size parameter truncated at and . Then, the model learns probability , where and , by maximizing its log likelihood given by

(2)

where,

(3)

where , , is the entire set of unique locations represented in the data, and and are the “in-vector” and “out-vector” respectively. We follow the standard practice and only use the in-vector, , which is known to be superior to the out-vector in link prediction benchmarks linzhuo2020hyperbolic, tshitoyan2019mat2vec, garg2018gender, kozlowski2018geometry, hamilton2016diachronic, le2014doc2vec, nakandala2017gendered.

We used the word2vec implementation in the python package gensim. The skip-gram negative sampling word2vec model has several tunable hyper-parameters, including the embedding dimension , the size of the context window , the minimum frequency threshold , initial learning rate , and the number of iterations. For main results regarding scientific mobility, we used and , which were the parameters that best explained the flux between locations, though results were robust across different settings (Fig. S5). We also use same setting for U.S. airport itinerary and Korean accommodation reservation data. To mitigate the effect of less common locations, we set , limiting to locations appearing at least 50 times across the training trajectories; 744 unique airport for U.S. airport itinerary, 1004 unique accommodations for Korean accommodation reservation data, and 6,580 unique organizations for scientific mobility appear in the resulting embedding. We set to its default value of 0.025 and iterate five times over all training trajectories. For scientific mobility, across each training iteration, the order of organizations within a single year is randomized to remove unclear sequential order..

Distance

We calculate as the total number of co-occurrence between two locations and across the data-set. In scientific mobility, indicates that the number of co-occurrence between both organization and between 2008 and 2019 is 10, as evidenced from their publications. Here, we treat for the sake of simplicity and, in the case of scientific mobility, because directionality cannot easily be derived from bibliometric records, or may not be particularly informative (see Supporting Information).

We calculate two forms of distance between locations. The geographic distance,

, is the pairwise geographic distance between locations. Geographic distance is calculated as the great circle distance, in kilometers, between pairs of locations. In the case of U.S. flight itinerary and scientific mobility, we impute distance to 1 km when their distance is less than one kilometer. In the case of Korean accommodation reservation data, because this data is intra-city mobility trajectory, we impute distance to 0.01 km when their distance is less than 0.01 km. The embedding distance with the cosine distance,

, is calculated as , where and are the embedding vectors for locations and , respectively. Note that is not a formal metric because it does not satisfy the triangle inequality. Nevertheless, cosine distance is often shown to be useful in practice lerman2007embedding, brown1970migration, kim2018functional.

Gravity Law

We model co-occurences for locations and (referred to as flux), using the gravity law of mobility zipf1946gravity. The gravity law of mobility, which was inspired by Newton’s law of gravity, postulates that attraction between two locations is a function of their population and the distance between them. This formulation and variants have proven useful for modeling and predicting many kinds of mobility jung2008highwaygravity, curiel2018citygravity, truscott2012epidemicgravity, hong2016busgravity. In the gravity law of mobility, the expected flux, between two locations and is defined as,

(4)

where and are the population of locations, defined as the total number of passenger who passed through each airport for U.S. airport itineraries, the total number of customer who booked with each accommodation for Korean accommodation reservations, and the yearly-average count of unique authors, both mobile and non-mobile, affiliated with each organization for scientific mobility. is a decay function of distance between locations and . There are two popular forms for the : one is a power law function in the form , and the other is an exponential function in the form chen2015distance. The parameters for and

are fit to given mobility data using a log-linear regression jung2008highwaygravity, curiel2018citygravity, truscott2012epidemicgravity, hong2016busgravity, simini2012universal.

We consider separate variants of for the geographic distance, , and the embedding distance, , report the best-fit model of each distance. For the geographic distance, we use the power-law function of the gravity law, (Eq. 5). For the embedding distance, we use the exponential function, with (Eq. 6).

(5)
(6)

where is the actual flow from the data. The gravity law of mobility is sensitive to , or zero movement between locations. In our dataset, non-zero flows account for only 4.2 % of all possible pairs of the 6,580 organizations for scientific mobility, while 76.4% of all possible pairs of the 744 airports for U.S. airport Itinerary and 62.5 % of all possible pairs of the 1,004 accommodations for Korean accommodation reservation data. This value is comparable to other common applications of the gravity law, such as phone calls, commuting, and migration simini2012universal. We follow standard practice and exclude zero flows from our analysis.

SemAxis

SemAxis and similar studies an2018semaxis, nakandala2017gendered, kozlowski2018geometry demonstrated that “semantic axes” can be found from an embedding space by defining the “poles” and the latent semantic relationship along the semantic axis can be extracted with simple arithmetic. In the case of natural language, the poles of the axis could be “good” and “bad”, “surprising” and “unsurprising”, or “masculine” and “feminine”. We can use SemAxis to leverage the semantic properties of the embedding vectors to operationalize abstract relationships between organizations.

Let and be the set of positive and negative pole organization vectors respectively. Then, the average vectors of each set can be calculated as and . From these average vectors of each set of poles, the semantic axis is defined as . Then, a score of organization is calculated as the cosine similarity of the organization’s vector with the axis,

(7)

where a higher score for organization indicates that is more closely aligned to than .

We define two axes to capture geography and academic prestige, respectively. The poles of the geographic axis are defined as the mean vector of all vectors corresponding to organizations in California, and then the mean of all vectors of organizations in Massachusetts. For the prestige axis, we define a subset of top-ranked universities according to either the Times World University Ranking or based on the mean normalized research impact sourced from the Leiden Ranking. The other end of the prestige axis is the geographically-matched (according to census region) set of universities ranked at the bottom of these rankings. For example, if 20 top-ranked universities are selected and six of them are in the Northeastern U.S., then the bottom twenty will be chosen to also include six from the Northeastern U.S.. From the prestige axis, we derive a ranking of universities that we then compare to other formal university rankings using Spearman rank correlation.

Acknowledgments

We thank the Center for Science and Technology Studies at Leiden University for managing and making available the dataset of scientific mobility. We also thank the Goodchoice Company LTD. for making available the dataset of Korean accommodation reservation data. For their comments, we thank Guillaume Cabanac, Cassidy R. Sugimoto, Vincent Lariviére, Alessandro Flammini, Filippo Menczer, Lili Miao, Xiaoran Yan, and Inho Hong. This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-19-1-0391. Rodrigo Costas is partially funded by the South African DST-NRF Centre of Excellence in Scientometrics and Science, Technology and Innovation Policy (SciSTIP).

Author Contributions

All authors contributed extensively to the work presented in this paper. D. M. and J.Y. were involved in all stages of conceptualization, analysis, and writing, S.K. developed the theoretical framework, R.C. assembled input data, and W.J., S.M., and Y.A. contributed to conceptualization. All authors discussed the results and commented on the manuscript at all stages.

Data Availability

The U.S. airline itinerary dataset can be found at https://www.transtats.bts.gov/DataIndex.asp. The raw Korean accommodation reservation dataset, due to privacy concerns, cannot be shared publicly. Due to its proprietary nature, the global scientific mobility dataset, sourced from the Web of Science, cannot be provided; however, metadata and trained neural embeddings have been published at https://doi.org/10.6084/m9.figshare.13072790.v1

Code Availability

Code used in this analysis can be found at https://github.com/murrayds/sci-mobility-emb

Supporting Information

S1 Text

Mobility and science.

As scholars move, they bring their knowledge, skills, and social connections with them—collectively the movements of researchers shape the structure and direction of the global scientific enterprise. For example, prestige-driven mobility between doctoral-granting and employing institution is highly unequal clauset2015hierarchy, deville2014career, which affects the diffusion of ideas across academia morgan2018prestige. By placing researchers in new social settings, mobility can lead to the formation of new collaborative relationships rodrigues2016mobility, which in turn spurs the further diffusion of knowledge and innovations braunerhjelm2020labor, azoulay2011diffusion, kaiser2018innovation, armano2017innovation. Perhaps resulting from the selection effects of who gets to move, or the reconfiguring of social and epistemic networks, movement is associated with increased scientific impact sugimoto2017mostimpact, petersen2018multiscale, jonkers2013return, franzoni2014advantage. At the national level, the understanding of mobility has progressed beyond simplistic narratives of brain drain and brain gain, and instead adopts a new perspective of flows of talent meyer2001network, ioannidis2014braindrain, gaillard1998circulation. Under this flow model, a mobile researcher is viewed as contributing to both their origin and destination countries, a perspective that fosters that is evidenced by the strong science of open countries wagner2017open. Perhaps because of these individual and national benefits, policy-makers have come to recognize the importance of global mobility box2008competition, oecd2010innovation. Movement is a key mechanism that has clear impacts on the composition and direction of the global scientific workforce and our collective scientific understanding. Understanding the structure and dynamics of mobility is thus essential for understanding global science.

S2 Text

Modeling scientific mobility.

There are many ways of modeling scientific mobility from bibliographic data, the first consideration being the unit of analysis. Most studies of mobility have focused on country-level mobility–the flows of researchers across nations sugimoto2017mostimpact, scellato2015migrant, robinson-garcia2018indicators, franzoni2012foreign-born. Practically, country-level analyses benefit from higher reliability, such that idiosyncrasies and errors inherent to bibliographic databases are mitigated by this higher level of aggregation. Epistemically, country-level analysis is useful for national science governance who aims to understand the status of their country in the global landscape and make informed policy decisions. Analyses at lower levels of analysis are far less common. Regional-level scientific mobility–the flow of researchers between regions or cities within or across countries has been only minimally studied vaccario2019mobility, possibly due to lack of reliable long-term data and lack of policy relevance to national-level lawmakers. Organization-level mobility has the potential to inform institutional policy and to understand the composition of mobility within a single country or region, especially as it relates to organization performance, prestige, and inequality albarran2017topeconomic, deville2014career, morgan2018prestige, clauset2015hierarchy. However, affiliation disambiguation and noise in bibliometric data makes large-scale organization-level analysis challenging. Here, we learn neural-network embeddings of scientific mobility at the level of organizations using a curated bibliographic database. These embeddings are robust to noise, and so are capable of representing clear structure even amid issues with organizational disambiguation. In doing so, embeddings also capture a more detailed understanding of mobility than has been previously studied.

Another consideration when analyzing scientific mobility is what kinds of mobility to study. Typical understandings of mobility are directional: movement is always from one place and to another. However, scientific mobility is more complicated. For example, scientists often hold multiple affiliations at a time markova2016synchronous, listing them as co-affiliations on a single paper, or even choosing a subset of affiliations to use fohabeultiple simultaneous projects robinson2019mobility. Even clearly-directional migration to another institution is complex–researchers may continue to publish with an old affiliation for projects that began before their move, and they may maintain social and organizational links to their old institution (e.g., collaborators, projects, graduate students) such that there is no clear breakage after migrating. There is also a whole range of short-term scientific mobility, such as visiting scholarships and short-term visits that are only visible through intensive efforts such as manual extraction from CVs woolley2009cv, sandstrom2009cv, canibano2011temporary. Here, we focus on more long-term mobility that can be derived from bibliographic data. Due to the complexity of scientific mobility, we make the simplifying assumption that all scientific mobility is symmetric or without direction such that any move from an organization to organization is equivalent to a move from to . By assuming non-directional mobility, all mobility events are commensurate, meaning that they can be treated identically in our analysis–this allows us to represent the complexity of mobility without making decisions about the directional of their mobility or which is their main affiliation. Moreover, this assumption has the practical advantage of matching the data format expected by the word2vec model, as well as the theoretical advantage of adhering to the symmetricity assumption of the gravity model of mobility.

S3 Text

Building affiliation trajectories.

For each mobile researcher who has at least two distinct affiliations, we construct an affiliation trajectory based on the affiliations listed on their published papers indexed in the Web of Science database between 2008 and 2019. An author is considered mobile if they published with at least two distinct affiliations during the time period of study. Affiliation names were manually disambiguated, and each was mapped to a unique organization identifier. An affiliation trajectory for an individual researcher is a sequence of organizations in ascending order of year of publication. if a researcher published papers with affiliation in year , in , in and again in , then the affiliation trajectory is expressed as .

In the case that an individual lists multiple affiliations in a single year, affiliations listed on publications published in that year are shuffled between each iteration of the word2vec

training process (each epoch). For example, an author who published with affiliation

in , and affiliations and in could appear as one of or in each training iteration. This effectively removes the effect of order within a year, as the order cannot be meaningfully established based on co-affiliations in a single paper, or on different affiliations listed on separate papers, for which its date of publication may not be representative of the actual completion of the project.

Other than restricting to only mobile researchers, we do not perform any filtering or reductions to affiliation trajectories. In the case than an author publishes with organization four times in , and affiliation two times in , then their trajectory will be . Although mobile authors who publish more papers will have longer trajectories, word2vec will skip duplicate consecutive organization IDs, mitigating the impact of long repetitive trajectories.

S4 Text

Network-based personalized page rank distances.

We examine the gravity model on the Personalized Page Rank (PPR)jeh2003scaling as a benchmark on the network. We construct the co-occurrence network of organizations, in which each edge between organizations and represents a co-occurence of and in the same affiliation trajectory, with weight given by the sum of the co-occurences over all researchers. and edges are co-occurrence between two organizations. The Personalized Page Rank is a ranking algorithm for nodes based on a random walk process on networks. The walker visiting at a node moves to a neighboring node chosen randomly with a probability proportionally to the weight of the edge in one step. Furthermore, with probability , the walker is teleported back to the starting node. The rank of a node is determined by the probability that the walker visits the node in the stationary state. The stationary distribution of the random walker starting from node , denoted by , is given by

(8)

where is a column vector of length with entries that are all zero except the th entry that equals to one, is the weighted adjacency matrix. We used here.

We can think as a representation vector of the organization , and calculate the distance between organizations and , with measuring distance between and to examine the gravity law. We consider two distance measures in this analysis. The first one is cosine distance which is used for our embedding method, . Also, if we think

as a discrete probability distribution, then we can consider Jensen–Shannon divergence (JSD), can be written as,

(9)
(10)

where . We report the result with cosine distance (, Fig. S11) and Jensen–Shannon divergence (, Fig. S12). In both cases, the performance is under the performance of the model with geographical distance. Even though the length of the PPR vectors is extremely larger than the length of our embedding vectors, result with the embedding distance outperforms both of them.

S5 Text

Organization disambiguation and metadata.

Affiliations mapped to one of 8,661 organizations, disambiguated following that originally designed for the Leiden Rankings of World Universities waltman2012leidenrankings. Organizational records were associated with a full name, a type indicating the sector (e.g., University, Government, Industry), and an identifier for the country and city of the organization. Sixteen different sector types were included in the analysis, which we aggregated to four high-level codes: University, Hospital, Government, and Other. Each record was also associated with a latitude and longitude. However, for many organizations, these geographic coordinates were missing or incorrect. We manually updated the coordinates of 2,267 organizations by searching the institution name and city on Google Maps; in cases where a precise location of the organization could not be identified, we used the coordinates returned when searching the name of the city. The data was further enriched with country-level information, including region, most widely-spoken language, and its language family (e.g., the language family of Spanish is Italic). State/province-level information was added using the reverse geocoding service LocationIQ using each organization’s latitude and longitude as input. Regional census classifications were added for states in the United States. For each organization, we calculated size as the average number of unique authors (mobile and non-mobile) who published with that organization across each year of our dataset; in the case that authors publish with multiple affiliations in a single year, they are counted towards each.

As a result of our disambiguation procedure, some affiliations are mapped to two organizations, one specific, and one more general. For example, any author affiliated with “Indiana University Bloomington” will also be listed as being affiliated with the “Indiana University System”, a more general designation for all public universities in Indiana. However, a more general organization may not always occur alongside the more specific one. For example, a researcher affiliated with the smaller regional school “Indiana University South Bend” will be listed as affiliated with only the “Indiana University System”. We identify all specific organizations that always co-occur along with a more general one. For every career trajectory that includes one of these specific organizations, we remove all occurrences of the more general organization; trajectories containing only a general designation are not altered.

S6 Text

Author name dismabiguation.

Author-name disambiguation, the problem of associating names on papers with individuals authors, remains difficult for the use of bibliometric data dangelo2020disambiguation. Authors in our dataset have been disambiguated using a rule-based algorithm that makes use of author and paper metadata, such as physical addresses, co-authors, and journal, to score papers on the likelihood of belonging to an author cluster—a cluster of publications believed to have been authored by the same individual caron2014disambiguation. We limit our period of analysis to the period of 2008 to 2019, as in 2008 the Web of Science began indexing additional author-level metadata such as full names and email addresses. The disambiguation algorithm is conservative, favoring splitting clusters over merging. Past studies have validated this data and shown that the disambiguated authors are comparable to ground-truth records such as those from ORCID and useful for a wide range of bibliometric studies sugimoto2017mostimpact, robinson2019mobility, chinchilla2018global, chinchilla2018travelban.

S7 Text

Reconstructing Times ranking with network measure.

The performance of the embedding ranking in reconstructing the Times ranking is comparable to that of network-derived measures such as degree strength (Spearman’s , Fig. S19a) and eigencentrality centrality (Spearman’s , Fig. S19b). However, while both embedding- and network-based measures relate to university prestige, they are qualitatively and quantitatively different. The embedding-ranking of U.S. universities is less correlated with degree strength (Spearman’s , Fig. S20a) and eigenvector centrality (Spearman’s ) than with the Times ranking itself (Spearman’s , Fig. S20b). The embedding ranking over-ranks large research-intensive universities such as North Carolina State University, University of Florida, and Texas A&M University, whereas the network-derived ranking over-ranks smaller, more specialized universities such as Brandeis University, Yeshiva University, and University of San Francisco. This suggests that the embedding encodes information on prestige hierarchy at least as well as a network representation, with some noticeable qualitative differences.

S8 Text

Speculation on variations of the convex-curve pattern.

The convex-curve pattern observed in Fig. 6 repeats across many countries, with variations. For example, the representative vector of Chinese organizations has a larger norm than that of the U.S. ( vs , Table S2), causing its curve to be shifted upwards with a larger peak vector norm; this may reflect a tendency for organizations in the U.S. to appear more frequently in different contexts than Chinese organizations. Other nations such as Poland, Iran, and Turkey show a linear relationship between an organization’s number of researchers and the vector norm, indicating that their largest organizations belong to very specific contexts (Fig. S24

). The organization-level distribution of vector norms reveals deeper heterogeneity. The distribution of the vector norms for the U.S. is relatively skewed, suggesting their large norm is driven by a small and tight community of organizations (

skew, Fig. S25). Germany and the U.K. have comparable representative vector norms to the U.S. ( and , respectively), with lower skewness (skew and skew), suggesting more tight community of organizations. The vector norms of organizations in some countries are even more skewed, such as in Iran (, skew) and China (, skew), indicating the strong difference between their most- and least-connected organizations. For some countries, their organizations are positively-skewed, though seemingly for different reasons. For example, Austria has a balanced distribution of organization vector norms, suggesting a diverse range of organizations with most being well connected (, ); Russia, in contrast, has a number of organization vectors of moderate norms, but also several isolated organizations with large vector norms (, ).

-2cm Short Full Short Full Stanford Stanford Univ Northwestern Northwestern Univ Columbia Columbia Univ Ball State Ball State Univ Harvard Harvard Univ IU Bloomington Indiana Univ, Bloomington UCLA Univ of California, Los Angeles Stevens Institute Stevens Institute of Technology Cal State Long Beach California State Univ, Long Beach NJIT New Jersey Institute of Technology Wright State Wright State Univ NYU New York Univ U Toledo Univ of Toledo SUNY Albany Univ at Albany, The State Univ of New York Boston U Boston Univ NY Medical College New York Medical College Suffolk Suffolk Univ Miami University Miami Univ CUNY City Univ of New York (CUNY) IU Pennsylvania Indiana Univ of Pennsylvania U Arizona Univ of Arizona Baylor Baylor College of Medicine OSU Ohio State Univ UT Health Center Univ of Texas Health Science Center MIT Massachusetts Institute of Technology Bard College Bard College Princeton Princeton Univ Stonehill College Stonehill College GCU Grand Canyon Univ Carleton College Carleton College Northcentral Northcentral Univ Hanover College Hanover College UCSF Univ of California, San Francisco Queens College Queens College Fielding Fielding Graduate Univ DePauw DePauw College Pepperdine Pepperdine Univ Naval Academy United States Naval Academy Argosy Argosy Univ Cal State San Marcos California State Univ San Marcos Yale Yale Univ Broad Inst Broad Institute U Hartford Univ of Hartford Forsyth Inst Forsyth Institute FAU Florida Atlantic Univ U Alaska Museum Univ of Alaska Museum of the North U Miami Univ of Miami Lawrence Berkeley Lawrence Berkeley Natl Laboratory UWF The Univ of West Florida Allen Institute Allen Institute for Brain Science FIT Florida Institute of Technology RTI International RTI InterNatl Purdue Purdue Univ, West Lafayette Fermilab Fermilab Notre Dame Univ of Notre Dame State of NY State of New York Indiana State Indiana State Univ Mayo Clinic Mayo Clinic Saint Mary’s Saint Mary’s College Fish and Wildlife Fish and Wildlife Research Institute Tufts Tufts Univ EPA United States Environmental Protection Agency Mattel Mattel Children’s Hospital US Army United States Army Clark Clark Univ NSF Natl Science Foundation UMass Amherst Univ of Massachusetts Amherst US Navy United States Navy Montclair Montclair State Univ US Air Force United States Air Force Farleigh Dickinson Fairleigh Dickinson Univ-Metro Campus Ames Laboratory Ames Laboratory Rockefeller Rockefeller Univ Olin College Oin College of Engineering Adelphi Adelphi Univ Scrips Institute Scrips Institute Barnard Barnard College Idaho Natl Lab Idaho Natl Laboratory Saint John Fisher Saint John Fisher College Dana Faber Dana Faber Cancer Institute U Penn Univ of Pennsylvania Dept of Agriculture United States Department of Agriculture Villanova Villanova Univ DOE United States Department of Energy Widener Widener Univ-Main Campus NIAMS Natl Institute of Arthritis, Skin Diseases Robert Morris Robert Morris Univ JMI Labs JMI Laboratories U Cincinnati Univ of Cincinnati Whitehead Inst Whitehead Institute of Biomedical Research Case Western Case Western Reserve Univ Wellesley Wellesley Univ Ashland Ashland Univ UT Health, San Antonio Univ of Texas Health Science Center, San Antonio Texas A&M Texas A&M Univ-Commerce UNT Univ of North Texas Texas Southern Texas Southern Univ UT Southwestern Med Univ of Texas Southwestern Medical Center Baylor Univ of Mary Hardin-Baylor UT El Paso Univ of Texas, El Paso U Washington Univ of Washington - Seattle USF Univ of South Florida, Tampa Washington State Washington State Univ Florida A&M Florida Agricultural and Mechanical Univ Seattle Pacific Seattle Pacific Univ Barry Barry Univ Cal State Fresno California State Univ-Fresno UMass Dartmouth Univ of Massachusetts Dartmouth Northern Arizona Northern Arizona Univ Worcester Poly Worcester Polytechnic Institute IUPUI Indiana Univ - Purdue Univ Indianapolis Umass Boston Univ of Massachusetts Boston U Dayton Univ of Dayton MGH Inst MGH Institute of Health Professions U Conn Univ of Connecticut Joseph W. Jones Center Joseph W. Jones Ecological Research Center ASU Arizona State Univ Vaccine Research Center Vaccine Research Center, San Diego U Florida Univ of Florida LA Ag Center Lousianna Agricultural Center Northern Illinois Northern Illinois Univ FL Fish and Wildlife Florida Fish and Wildlife Conservation Commission Concordia Chicago Concordia Univ-Chicago NHLBI Natl Heart, Lung, and Blood Institute U Chicago Univ of Chicago NY Dept. of Health New York Department of Health SIU Edwardsville Southern Illinois Univ, Edwardsville St Michaels Saint Michaels College SIU Carbondale Southern Illinois Univ, Carbondale

Table S1: Full organization names
Country L2 Norm # Organizations
United States 2.39 1281
Germany 2.6 485
United Kingdom 2.61 514
Austria 2.64 74
France 2.83 688
Belgium 2.84 84
Switzerland 2.85 66
Spain 2.94 322
China 2.97 497
India 2.99 114
Poland 3.02 145
Canada 3.02 147
Italy 3.04 386
Russia 3.08 187
Norway 3.1 122
Netherlands 3.11 136
Sweden 3.16 75
Brazil 3.16 286
Finland 3.17 66
Denmark 3.21 54
Czech Republic 3.23 97
Greece 3.24 62
Australia 3.24 90
Turkey 3.28 99
South Korea 3.28 156
Israel 3.32 71
Portugal 3.33 57
Japan 3.35 465
Iran 3.57 68
Taiwan 3.67 72
Table S2: L2 Norm of country’s representative vectors. Shown for top 30 countries with the most unique mobile and non-mobile researchers
Figure S1: Publications over time. a. The number of papers published by mobile authors has been steadily increasing from 2008 to 2017, with a small decrease in 2018, which may be due to an artifact of the Web of Science indexing process. Lines correspond to publications by mobile authors, by authors with affiliations in at least two cities, at least two regions, and at least two countries. We did not find major changes in the publication patterns of mobile authors during this time period. b. Lines correspond to the proportion of publications classified as Biology and Health (black), Physics and Engineering (purple), Life and Earth Science (magenta), Social Science and Humanities (orange), and Math and Computer Science (yellow). The rate of publication in Biology and Health has leveled since about 2013, whereas the rate of publication in other fields has steadily increased. c. While the absolute count of publications has increased, the percentage of mobile scholars, and those with affiliations in at least least two cities, regions, or countries, as a proportion of all publications, has decreased over time. d. The proportion of authors’ publications across fields has largely remained steady. Biology and Health Science has comprised the majority of publications across nearly all years but has steadily declined in proportion. However, the proportion of Social Science and Humanities publications has been steadily increasing.
Figure S2: Extent and nature of mobility by country. a. The proportion of all mobile researchers contributed by each country. Over 30% of all mobile researchers have been affiliated with organizations in the U.S. during the period of study. b. Cumulative distribution of data shown in (a). The U.S., China, and France, the U.K., and Germany comprise about 70% of all mobile researchers. c. The proportion of each country’s researchers who are mobile. The dashed line indicates the proportion of all researchers in the data who are mobile. France, followed by Qatar and the U.S. have the highest proportion of mobile researchers. d.

First two principal components of four variables: proportion of researchers in each country mobile across organizations, proportion mobile across cities, proportion mobile across regions, and proportion mobile across countries. The countries are roughly sorted in order of the number of mobile researchers and the fraction of international mobile researchers in the first and second principal components, which are indicated by PC1 and PC2, respectively. PC1 explains 88.3% of the total variance, whereas PC2 explains 9.5% of the total variance.

Figure S3:

Reverse cumulative-distribution function of mobile researchers by geographic scale.

a. Survival probability of mobile researchers with respect to the number of organizations in the their affiliation trajectory. All mobile authors were affiliated with at least two organizations (i.e., survival probability of one) and about 35.0% were affiliated with three or more. b. About 68% of mobile authors listed at least two cities represented in their career trajectories. c. 45% of mobile authors have two or more regions represented in their career trajectories. d. Only 14% of mobile authors had two or more countries represented in their career trajectories.
Figure S4: Larger dimensions, smaller window size improves embedding performance.

The correlation, or amount of flux explained by the embedding distance with varying skip-gram negative sampling hyperparameters. Window size refers to

, the size of the context window that defines the context in a trajectory. Smaller window sizes result in an embedding that explain more flux. Embedding dimensions refer to the size of the embedding vector. Larger vectors perform better, though little difference between 200 and 300. All variants perform better on same-country pairs of organizations than on all organizations. All variants perform worse on different-country pairs on organizations. Embeddings with larger dimensions outperform mid-size embeddings for the different-country case.
Figure S5: Cosine distance is correlated with dot product similarity. We find a relatively high correlation between the embedding distance—one minus the cosine similarity—and the dot product similarity between organization vectors (). Color of each hex bin indicates the frequency of organization pairs. Black dots indicate the mean dot product similarity averaged over binned sets at the same embedding distance. Red line is line of best fit.
Figure S6: For geographic distance, the power-decay gravity model is better. Flux between organization pairs predicted by the gravity model with different distance decay functions, i.e., exponential decay function (a) and power-law decay function (b) using geographic distance. Boxplots show distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps with ; a perfect prediction yields all points on the black line. Shown for all pairs of organization (a-b), domestic (c-d), and international only (e-f) mobility. The gravity model with the power-decay function outperforms that with an exponential decay function.
Figure S7: For embedding distance, the exponential-decay gravity model is better. Flux between organization pairs predicted by the gravity model with different distance decay functions, i.e., exponential decay function (a) and power-law decay function (b) using embedding distance. Boxplots show distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps with ; a perfect prediction yields all points on the black line. Shown for all pairs of organization (a-b), domestic (c-d), and international only (e-f) mobility. The gravity model with the exponential decay function outperforms that with a power-decay function.
Figure S8: Embedding distance explains more variance for global, within, and across country flux than geographic distance. a. Embedding distance explains more flux than geographic distance (b). Red line is the line of best fit. Black dots are mean flux across binned distances. Color of each hex bin indicates frequency of organization pairs. Results here are identical to those shown in Fig. 2. c-d. embedding distance explains more variance when considering only within-country organization pairs. e-f. embedding distance is more robust than geographic distance when considering only across-country organization pairs.
Figure S9: Little difference between gravity predictions fit on all or subsets of data. Predictions of flux between organization pairs made using embedding distance out-performs those made using geographic distance. Boxplots show distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps the line ; a perfect prediction yields all points on the black line. a-b. Predictions are made with parameters estimated from all pairs of organizations (as in Fig. 2e), and showing only the subsets of predictions for organization pairs in the same country (a) and in different countries (b). c-d. Predictions made using parameters estimated from the subset of organizations in the same country (c) and different country d); this is the same data as shown in Fig. 2f and Fig. 2h.
Figure S10: Examine gravity model with dot product on the embedding space. Performance of dot product similarities in explaining and predicting mobility. Similarity scores are calculated as the pairwise dot product between organizational vectors. Dot product similarity performs better than geographic distance, though worse than cosine similarity in explaining global mobility (a), or domestic (b) or international (c) country mobility. Red line is line of best fit. Black dots are mean flux across binned distances. Color indicates frequency of organization pairs within each hex bin. Similarly, PPR distance performs comparably to geographic distance in predicting global (d), domestic (e) and international (f) scientific mobility. Boxplots show distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps ; a perfect prediction yields all points on the black line.
Figure S11: Personalized page rank with cosine distance. Performance of personalized page rank scores in explaining and predicting mobility. Personalized page rank is calculated for the underlying mobility network, and distance measured as the cosine distnace between PPR probability distribution vectors. PPR cosine distance performs roughly similar to geographic distance in explaining global(a), domestic (b), or international (c) country mobility. Red line is the line of best fit. Black dots indicate the mean flux across binned distances. Color of hex bind indicates frequency of organization pairs. Similarly, PPR distance performs comparably to geographic distance in predicting global (d), domestic (e) and international (f) scientific mobility. Boxplots show distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps ; a perfect prediction yields all points on the black line.
Figure S12: Personalized page rank with Jensen-Shannon Divergence. Performance of personalized page rank scores in explaining and predicting mobility. Personalized page rank is calculated for the underlying mobility network, and distance measured as the Jensen-Shannon Divergence (JSD) between PPR probability distribution vectors. PPR JSD performs roughly similar to geographic distance in explaining global mobility (a), or domestic (b) or international (c) country mobility. Overall, PPR JSD explains more variance in mobility than using cosine distance (Fig. S11), except for international mobility, for which cosine similarity out-performs JSD. Red line is the line of best fit. Black dots are mean flux across binned distances. Color of hex bind indicates frequency of organization pairs. Similarly, PPR JSD performs comparably to geographic distance in predicting global (d), domestic (e) and international (f) scientific mobility. Boxplots show distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps ; a perfect prediction yields all points on the black line.
Figure S13: Visualization of global mobility network. The network demonstrates country-level structure, but not at the detail or the extent of the global UMAP projection (Fig. 3a). Each node corresponds to an organization, whereas weighted edges (not shown) correspond to the flow of mobile researchers between the two organization. Nodes are colored by the country of the organization. Nodes are positioned using the Force Atlas layout algorithm.
Figure S14: UMAP Projection of organizations in New York. Each point corresponds to an organization and its size indicates the average annual number of mobile and non-mobile authors affiliated with that organization from 2008 to 2019. Color indicates the sector.
Figure S15: UMAP Projection of organizations in Pennsylvania. UMAP projection of the embedding space of organizations in Pennsylvania reveals clustering based on geography, sector, and academic prestige. Each point corresponds to an organization and its size indicates the average annual number of mobile and non-mobile authors affiliated with that organization from 2008 to 2019. Color indicates the sector.
Figure S16: UMAP Projection of organizations in Texas. Each point corresponds to an organization and its size indicates the average annual number of mobile and non-mobile authors affiliated with that organization from 2008 to 2019. Color indicates the sector.
Figure S17: UMAP Projection of organizations in California. Each point corresponds to an organization and its size indicates the average annual number of mobile and non-mobile authors affiliated with that organization from 2008 to 2019. Color indicates the sector.
Figure S18: SemRank hierarchy is robust. a. Spearman’s () between Times prestige rank and embedding rank derived using SemAxis, with poles defined using the top and bottom (geographically matched) ranked universities. Black points show spearman correlation using all organizations; white points show correlation using only universalizes not aggregated in the poles. Including more universities improves performance, but quickly saturates after around five universities. b - f. Comparison between the Times and SemAxis ranks of universities, by the number of universities used to define the poles (n). White points are those top and bottom 20 universities aggregated to define the ends of the axis. The grey box corresponds to the top 20 and bottom 20 ranks. Spearman’s details the estimate from Spearman correlation between the two rankings using all universities, including those used to define the ends of each axis.
Figure S19: Network centrality is strongly correlated with Times ranking. Comparison between the ranking of organizations by their network-centrality rank and their rank in the 2018 Times Higher Education ranking of U.S. Universities . The Times rank is correlated with degree centrality rank (a) with Spearman’s , and is correlated with the eigenvector centrality rank (b) with Spearman’s .
Figure S20: Network centrality less correlated with Embedding rank. Comparison between the ranking of organizations by their network-centrality rank and the embedding rank derived with SemAxis with poles defined using the top five to geographically-matched bottom five universities ranked by the 2018 Times Higher Education ranking of U.S. Universities . Embedding rank is correlated with degree centrality rank (a) with Spearman’s , and is correlated with the eigenvector centrality rank (b) with Spearman’s .
Figure S21: Geography and prestige SemAxis by U.S. state. SemAxis projection along two axes, comparing California to Massachusetts universities (left to right), and between the top 20 and geographically-matched bottom 20 universities ranked by the 2018 Times Higher Education ranking of U.S. Universities (bottom to top). Points correspond to universities shown for California (a), Arizona (b), Washington (c), Massachusetts (d), Connecticut (e), New York (f), Texas (g), Pennsylvania (h), and Florida (i). Grey points correspond to all other U.S. universities. Full organization names listed in Table S1.
Figure S22: SemAxis reconstructs publication impact in non-university sectors. Comparison between the ranking of organizations in each non-university sector by their citation impact and the embedding rank. Citation impact is calculated as the mean-normalized citation score using papers published in the Web of Science database between 2008 and 2019. The embedding rank is derived by first projecting non-university organizations onto the SemAxis axis formed with poles defined using the top five to geographically-matched bottom five universities ranked by the 2018 Times Higher Education ranking of U.S. Universities. a Shows how the correlation between the citation impact and SemAxis rankings differ while varying the size threshold for including an organization. Size is calculated as the mean annualized number of unique authors publishing with that organization. Annotations show the number of organizations remaining at thresholds pf 0, 50, and 100. b. Comparison of organizations using a size threshold of 10 for regional and liberal arts colleges, and 50 for research institutes and government organizations; these thresholds were chosen as points thresholds of stability in a. The impact rank is correlated with the embedding rank for regional and liberal arts colleges with Spearman’s , research institutes with Spearman’s , and for government organizations with Spearman’s .
Figure S23: Factors relating to the L2 norm of vectors for U.S. universities Correlation between the L2 norm of organization embedding vectors of U.S. universities and characteristics of U.S. universities. Dots correspond to organizations. The red line is the line of the best fit with corresponding 99% confidence intervals. Red text is the regression estimate. The blue line is the loess regression line with 99% confidence intervals. Number of authors is the average annual count of unique mobile and non-mobile authors. Rankings are derived from the Times Ranking of World Universities, and the Leiden Rankings of Universities. Remaining variables come from the Carnegie Classification of Higher Education Institutions. The factors that best explain are the number of authors, the rank, the amount of Science and Engineering (S&E) funding, and the number of doctorates granted.
Figure S24: Concave-curve repeats across most of 30 countries with most researchers. Size (L2 norm) of organization embedding vectors compared to their number of researchers for U.S. universities. Loess regression line is shown for each country with 99% confidence intervals. Countries shown are the 30 with the largest number of total unique mobile and non-mobile researchers.
Figure S25: Distribution of organization embedding vector norms by country. Histogram showing the distribution of L2 norm values of organization embedding vectors in each of the 30 countries with the largest number of total unique mobile and non-mobile researchers. Text in each panel shows the number of organizations in the country (n) and the GINI index of inequality of the distribution (g); a small GINI index indicates that the L2 norms of organizations are more balanced, whereas a high GINI value indicates that they are more unequal.