Changing institution is an integral part of academic life and a scientist’s key career decision, which can potentially play an important role in education, scientific productivity, and the generation of scientific knowledge. Scientific migration allows scientists to find environments where they are more effective in doing their research and contribute to the success of research institutions. Yet, our understanding of the factors influencing a relocation decision, such as the scientific profile of a scientist or the quality of their scientific environment, is limited. Furthermore, while the role of social relationships is known to influence human activity in several contexts (cho2011friendship; wang2013quantifying), it is not clear what the contribution is of a scientist’s collaboration network on their decision to relocate. Previous research on the migration patterns of scientists is limited to the analysis of large-scale surveys on country-level movements (azoulay2017mobility; appelt2015factors), or to the investigation of the effects of changing institution on scientific production and citations (deville2014career; sugimoto2017scientists).
In this work we investigate the other perspective, i.e., how the scientific profile of a scientist influences their decision to move, based on a large dataset of publications from the American Physical Society (APS), consisting of all the publications in APS journals from 1950 to 2009 – 60,000 scientists, 3,500 institutions and 360,000 articles. Our main objective is to predict where a scientist will move in the next year. We approach this problem by dividing it into two subproblems, constructing thus a two-stage predictive model
. We first predict, using data mining techniques, which researcher in the APS database will move (i.e., change institution) in the next year. At this step we describe a scientist’s profile as a multidimensional vector of variables describing three main aspects: the scientist’s recent scientific career, the quality of their scientific environment and the structure of their scientific collaboration network. From the constructed predictive model, we also identify the main factors influencing scientific migration. Secondly, for those researchers who are predicted to move, we predict which institution they will choose using a novelsocial-gravity model, an adaptation of the traditional gravity model of human mobility to include the factors identified at the first step.
Our experiments on the APS dataset reveal two crucial results: (i) a scientist’s xenophilia, i.e., their tendency to collaborate with scientists at external institutions, is strongly correlated with their decision to migrate; (ii) our approach performs well in both the prediction phases, with the proposed social-gravity model producing a 85% reduction of the prediction error with respect to the state-of-the-art gravity model.111The data and the related code to reproduce the research will be made available in the camera-ready version of the paper.
This work provides several contributions to the scientific community. First and foremost, we build a novel model of scientific migration that combines data mining with a customised model of human mobility. Furthermore, we provide insight, based on data, into the factors affecting the decision to change institution and those involved in choosing the next employer. Our predictive model can help institutions and governments understand scientific mobility and implement policies to attract the best scientists or prevent their departure, hence improving the quality of research. At the same time this type of predictions can facilitate the construction of services that recommend new jobs to scientists based on their profile, or help search committees seek successful candidates for their research jobs.
The rest of the paper is organised as follows. In the next section we discuss related work on scientific migration. In Section 3 we formalise our problem by defining career trajectories and scientific profiles. Section 4 describes our data, the analysis performed, and the results of each of our prediction phases. We conclude the paper with a discussion in Section 5.
2. Related Work
Various aspects of scientific profiles have been analysed in recent years, with publication data being central in the process. One line of research is on collaboration networks, which brought important insights and results in the field of complex network theory, and its applications on analysing real-world networks (perra2012activity; colizza2006detecting; newman2001structure). A different aspect is the evaluation of the productivity of scientists, relevant for career progression in academia, with multiple performance indices being proposed over the years (sinatra2016quantifying; wang2013quantifying). Despite their importance to scientific productivity and education, little effort has been put in the literature on understanding how scientists make career moves. The studies proposed in the literature in this regard can be grouped into three strands of research.
A first strand focuses on large-scale surveys on country-level movements and reveal long-term cultural and economic priorities (auriol2010careers; vannoorden2012global; moed2013studying). Appelt et al. (appelt2015factors) use a gravity-based empirical framework to investigate the factors that influence the international mobility of scientists, finding that geographic distance, as well as socio-economic disparities and scientific proximity, negatively correlate with the mobility of scientists between two countries. Azoulay et al. (azoulay2017mobility) investigate the professional and personal determinants of the decision to relocate to a new institution, finding that scientists are more likely to move when they are highly productive and when their local collaborators are fewer and less accomplished than their distant collaborators, while they find it costly to disrupt the social networks of their children.
A second strand of research focuses on understanding the impact of a scientist’s relocation to their scientific impact. Analyzing the relocations and the scientific performance of scientists, Deville et al. (deville2014career) find that while moves from elite to lower-rank institutions lead to a moderate decrease in scientific performance, moves to elite institutions does not necessarily result in subsequent performance gain. Sinatra et al. (sinatra2017quantifying) offer empirical evidence that scientific impact is randomly distributed within the sequence of papers published by an individual during the scientific career, implying that temporal changes in impact can be explained by temporal changes in productivity or luck. Sugimoto (sugimoto2017scientists) analyzes the migration traces of scientists extracted from Web of Science and reveals that, regardless the nation of origin, scientists who relocate are more highly cited than their non-moving counterparts.
In the context of studying labor mobility, the availability of massive datasets of individuals’ career path fostered works on predicting individuals’ next jobs (outside the academia). Paparrizos et al. (paparrizos2011machine)
build a system to recommend new jobs to people who are seeking a job, using all their past job transitions as well as their employees data. They train a machine learning model to show that job transitions can be accurately predicted, significantly improving over a baseline that always predicts the most frequent institution in the data. Recently, Li et al.(li2017nemo) propose a system to predict next career moves based on profile context matching and career path mining from a real-world LinkedIn dataset. They show that their system can predict future career moves, revealing interesting insights in micro-level labor mobility.
Our work is placed on the line of conjunction of the aforementioned strands of research. We explore the characteristics of scientific performance which most affect the decision to relocate, focusing on aspects that have not been investigated yet in the literature, such as a scientist’s propensity to collaborate with external institutions and their relations with the peer environment. Moreover, we propose an algorithm to predict next career move which is tailored for scientists, hence considering science-specific features and the distance between scientific institutions. From a methodological point of view, our work provides a novel solution to the next (scientific) career move problem, as we combine data mining predictive models with global generative migration models.
3.1. Career trajectory
A career trajectory indicates the time-ordered sequence of institutions the scientist worked at during a given time window. Formally, we define a scientist ’s career trajectory as a sequence of tuples:
where is a timestamp (year), and is the scientist’s institution (i.e., their affiliation) at time , with . Two consecutive affiliations in a career trajectory indicate a move, i.e., that the scientists moved from an institution to another. A move in is formally defined as a pair of consecutive tuples in . For example, is a career trajectory indicating that scientist moved from Evanston, Illinois to Hamburg, Germany in year 1973.
Figure 1 shows an illustrative example of a 40-years long career trajectory: the scientist , initially at Stanford University, moves to other four institutions during their career, each migration being detected by the changing of the main affiliation in ’s publications.
3.2. Scientific Profile
We define the scientific profile of a scientist during a time window as the multidimensional feature vector:
where each element of is a feature describing a specific aspect of ’s scientific activity during a time window of years ending at time .
Three macro-aspects compose an individual ’s scientific profile: (i) their scientific career, in terms of ’s experience, publications and citations; (ii) their scientific environment, i.e., the level of production of ’s colleagues at ’s current institution; and (iii) their scientific relationships, indicating the working relationships established with collaborators at external institutions during the years. Table 1 describes the features composing scientific profile.
The scientific career includes features describing individual characteristics of scientist in the past
years. As proxies of scientific production, we consider the amount of publications the scientist produced and the citations they got during the considered period. Moreover, we estimate’s experience with the number of years since their first publication and define ’s scientific mobility using boolean values which represent whether they have or have not changed institutions in the last years.
A scientist’s production shapes, and it is shaped by, the scientific environment where they operate. For this reason we estimate the level of production of ’s environment as the number of citations and the number of publications of ’s colleagues during the years. A colleague is a scientist working at the same institution as at time . Moreover, we consider the differential of citations, i.e., the mean difference between ’s citations and their colleagues’ citations, as well as the differential of publications, i.e., the mean difference between ’s publications and the publications by their colleagues.
Scientific collaboration is a proven mechanism for promoting excellence in scientific research, as higher collaboration rates are linked to higher citation rates (appelt2015factors; abramo2017relationship; sugimoto2017scientists). For this reason we take into account a scientist ’s collaborations by estimating the size of ’s collaboration circle using three features: the number of institutions collaborates with during the years, the total number of distinct collaborators of and ’s tendency to collaborate with scientist at external institutions (xenophilia), computed as the ratio of external collaborators to the total number of collaborators in the years.
|publications||number of papers published in the last years|
|citations||number of citations received in the last years|
|experience||years since the first publication|
|(i) career||mobility||whether she changed institutions in the last years|
|mean number of citations received by colleagues in the last years|
|mean number of papers published by colleagues in the last years|
|mean difference between citations and colleagues’ citations in the past years|
|mean difference between publications and the colleagues’ publications in the last years|
|institutions||number of institutions she has collaborated with in the past years|
|collaborators||number of scientists she collaborated with in the last years|
|(iii) relationships||xenophilia||ratio of external collaborators to total collaborators in the last years|
3.3. Problem definition
We refer to next institution prediction as the task of predicting the institution of a scientist in year given their recent scientific profile up to year . Formally, let be a set of scientists and be a set of institutions. Given a scientist , a history window , a year and a scientific profile , the next institution prediction consists of two phases:
Move prediction phase: predicting whether will move to a new institution in year ;
Destination prediction phase: if moves, predicting the institution where moves to.
We address these phases by the approach detailed in the following section.
4. Next institution prediction
Our dataset consists of all articles published in the American Physical Society (APS) journals from 1950 to 2009. For each article, the date of publication, the author names and the corresponding affiliations are stored. In addition to this, location information (latitude and longitude) is available for every affiliation that appears in . In total, we have over 60,000 scientists, 3,500 institutions and 360,000 articles.
4.2. Computation of career trajectories
For each scientist in , we construct a career trajectory as follows. First, we sort all ’s publications by time, from the least recent to the most recent publication, and we build the time-ordered sequence of ’s affiliations: , where is the total number of papers published by , is the year of publication of ’s -th paper and is ’s affiliation on paper (). Note that a scientist may specify more than one affiliation in a publication. We disambiguate these cases using the first affiliation reported by the scientist, as suggested in the literature (deville2014career). We then initialize and add tuple , with , to if , i.e., if the institution associated with publication is different from the institution associated with the previous publication . Algorithm 1 shows the pseudo-code of the algorithm to build ’s career trajectory from their list of publications.
4.3. Computation of scientific profile
Given time and history window , from we compute the career, environment and relationships features for the scientist’s activity in the preceding years in the following way. The number of publications by a scientist is given by the total number of papers in for which is an author and the publication date falls within the period . We compute the number of citations of as the sum of citations to all papers in for which is an author and for which the citing paper is published in the period . The experience of is computed as the difference , where is the time of ’s first publication in . Finally, the mobility of relates to how recently they have changed institution: if has moved within the period , the mobility feature has value , if they have not moved the feature has value , if at time they are at their first institution (i.e., the only institution they have been affiliated with so far in the dataset) they are assigned a value of .
To compute the environmental features we define the colleagues (or peers) of as all the scientists who publish a paper during the period that are affiliated with ’s institution at time . For each colleague we compute their publications and citations as described above. The colleagues’ citations and publications features are then computed as the mean number of citations and the mean number of publications for all colleagues, respectively. The differential of citations is determined by taking the difference between the scientist’s citations and each peer’s citations individually, and then taking the mean of these differences. The differential of citations is determined in an identical way: the mean difference between the number of publications made by the scientist and each peer.
To compute the relationships features we define a collaborator as a scientist who is a co-author of for at least one paper published in the period (,). The collaborators feature is then the number of distinct collaborators in the list of co-authors and the institutions feature is the number of distinct affiliations. We compute xenophilia as the ratio between the number of collaborators at external institutions divided by the total number of collaborators.
4.4. Phase #1: Move Prediction
The move prediction phase aims at predicting if a scientist will migrate the next year given their recent scientific profile. Given the set of scientists , for each we compute the feature vector that describes her scientific profile between years and . Since the features in
have different ranges of values and distributions, we standardize them by computing their quantiles with respect to each feature of the other scientists in. By doing this, each element in lies between and and represents how the features of compare to all other scientists .
We then assign to a label , where 1 indicates that will migrate the next year, and 0 that will not migrate. From the scientists’ feature vectors we construct a dataset of examples each associated with the corresponding label in . For examples where , we use the move events of all scientists: all values of for which the affiliation of in year is different from the affiliation of in year . For examples where , for each we generate a random value between ’s first and last publications in until is not a year in which moved. We repeat this process times to ensure that our classes, and are approximately balanced. In total, our dataset contains examples: examples for which and examples for which .
To investigate to what extent we can predict a future migration given a scientist’s history, for each value we train two predictors and on the dataset and the associated labels , where
is a logit and
is a decision tree.222We use the Python package scikit-learn (sklearn). We evaluate the predictors with 10-fold cross-validation and investigate the goodness of predictions – in terms of accuracy, recall, precision, F1-score and AUC – varying . Table 2 shows the goodness of prediction of the best tree and the best logit.
), compared with a baseline classifier (dummy). The models are evaluated with a 10-fold cross-validation, and the goodness of predictions are measures in terms of accuracy (ACC), recall, precision (prec), F1-score (F1) and Aurea Under the Curve (AUC).
We find that for and for . Both predictors perform better than a baseline classifier which generates predictions based upon the class distribution of the training set.
Figure 2 shows how the AUC score changes with the size of the history window for and , . We observe that: (i) the tree performs slightly better than the logit; (ii) both curves stabilise at suggesting that a history window adds no additional predictive power to either model; (iii) both curves decrease slightly between and . The decrease may be explained by a change in feature values due to, for example, the average length of a PhD or tenure track position being approximately 5 years. In other words, the last five years of a scientist provide with sufficient information to predict, with reasonable accuracy, whether they will move or not.
Figure 3 shows the feature importances resulting from
, indicating that xenophilia (i.e., the ratio of external collaborators to total collaborators) is the strongest predictor of the probability to migrate. This means that scientists with a strong propensity to collaborate with other scientists in external institutions are more likely to migrate.
Figure 4 shows the standardised features of a scientist correctly predicted to move (red bars) and a scientist correctly predicted to stay (grey bars) using model . We observe that the scientist who moves has high levels of xenophilia compared to the scientist who does not move. In contrast, the scientist who does not move scores highly against other scientists for features, such as mobility and collaborators, that are of little importance according to Figure 3.
4.5. Phase #2: Destination prediction
The destination prediction phase aims at predicting the institution of destination of a scientist that will move. The task of estimating the probability that a scientist will relocate to a given institution can be interpreted as a classification problem with as many classes as institutions. The state-of-the-art model to estimate mobility and migration flows is the gravity model (RefWorks:134; jung2008; Pappalardo2016934)
, which is a multinomial logistic regression based on distance and size (population) of locations. Specifically, the features of a traditional singly-constrained gravity model are the size of the potential destination, estimated here as the logarithm of the number of scientists affiliated to the destination institution, and the logarithm of the distance between the scientist’s current institution and the potential destination. We compare the performance of the traditional gravity model with asocial-gravity model, an extended model that includes additional indicators of quality and social interactions with the potential destination. The new social and quality features are: (i) the fraction of collaborators at the potential destination; (ii) the average number of papers published by the scientists at the potential destination; (iii) the average number of citations received by the scientists at the potential destination. We compute these features using .
We train the model maximising its likelihood using stochastic gradient descent over the train set, which contains 200 scientists selected at random from . In order to reduce the computational cost of the optimisation, which can be quite high when the total number of destination locations is over , we consider an approximation of the likelihood computed considering a subset of 100 potential destinations, as proposed in (li2017nemo). The subset of potential destinations is extracted for each move of each scientist in . This subset of potential destinations always includes the true destination, while the other locations are randomly selected with a probability that increases linearly with their sizes and slowly decreases with the distances from the origin location. This ensures that the most relevant potential destinations, i.e., the larger and closer to the origin, are included in the likelihood’s estimate. Numerical tests show that the optimal size of the subset of potential destinations is 100 locations, i.e., considering more than 100 potential destinations does not significantly improve the model’s performance.
The model’s performance is evaluated considering all the remaining scientists in , i.e., . For each scientist we compute the probabilities, , to relocate to any institution according to the model. We then sort all institutions in decreasing order of and define the rank of each institution so that the model’s top prediction has the largest and rank 1. We then consider the rank of the scientist’s actual destination,
, and we use it to compute two statistics: (1) the cumulative distribution function (CDF) of the ranksof all the moves of all scientists in (Figure 5); and (2) the Mean Percentile Ranking (MPR) (li2017nemo) defined as:
where is the rank of scientist ’s actual destination.
Results show that including information on the collaboration network significantly improves the model’s performance. In particular, the social-gravity model that includes quality and social information has CDF, i.e., for 93% of the scientists in the real destination is among the top 10 model’s predictions. This is considerably better than the original gravity model without quality and social information, which has CDF, i.e., only for 41% of the scientists the real destination is among the top 10 model’s predictions (see Figure 5). This result, that a social-gravity model which incorporates social information is superior, is further supported by the MPR: while , corresponding to an error reduction of 85%. These results are summarised in Table 3.
Figure 6 shows the international collaborations and the predicted destinations for scientist whose relocations are shown in Figure 1. In Figure 6a we see that the institutions of ’s collaborators, plotted as blue circles, are spread around the world, reflecting ’s high xenophilia. In Figure 6b, the institutions are ranked according to the predictions of the social-gravity model. We observe that ’s true destination (the red triangle) has rank 1, indicating that the social-gravity model correctly predicts the scientist’s next career move. This is in contrast to Figure 6c, where institutions are ranked according to the original gravity model and we observe that the highest ranked institution is not the true destination, rather it is an institution in close proximity to the origin (the green triangle).
Are the migration decisions of scientists influenced by the place where they reside? We investigate this aspect by selecting scientists residing in Europe, , where , and the United States of America, , . We find that, if and , for both prediction phases there is no significant change in the performance of the models. This is also confirmed for the case , . This means that there are no significant differences behind the decision making of scientists originating in Europe and the United States. In other words, our two-phase approach is applicable on both global scientific migration and scientific migration from a single continent.
5. Conclusions and future works
In this paper, we proposed an efficient solution to the problem of predicting the future institution of a scientist. In the first move prediction phase, we used data mining to predict whether or not a scientist will migrate on the basis of the quality of their career, environment and collaborations. Given the good accuracy achieved in solving the first phase, we moved to the second, and more challenging, destination prediction phase. Experiments showed that our social-gravity model, obtained by injecting information about a scientist’s collaboration network into the traditional gravity model, achieves a prediction error which is up to 85% lower than state-of-the-art approaches.
Our results provide us with a deeper understanding of the factors influencing a scientist’s decision to relocate. We discovered that features associated with scientific collaboration are a key factor in the decision to move, highlighting the importance of social interactions in human activities. Previous studies in human mobility have demonstrated that information on individual human mobility can be used to improve link prediction in social networks (cho2011friendship; wang2011human). Our results suggests that the opposite is also true: information on individual social connections can be used to significantly increase the predictive power of migration models.
Our work paves the road to many research lines. For example it would be interesting to generalize our approach to other classes of high-skilled individuals, such as senior managers (iredale2016high) or soccer players (liu2016anatomy). As the relocation of those individuals strongly affects the success of both the origin and destination companies, predicting their relocation decisions can have wide economic consequences in the companies’ revenues and future profits. We plan to exploit the proposed framework for the creation of a human migration model for the general population. In this context, we can use mobile phone data and social media data to describe individual relocations and social relationships, respectively, and construct a social-gravity model for human migration.
Luca Pappalardo and Alina Sîrbu were supported by the European Commission through the Horizon2020 European project ”SoBigData Research Infrastructure — Big Data and Social Mining Ecosystem” (grant agreement 654024). Filippo Simini is supported by EPSRC First Grant EP/P012906/1. We thank Daniele Fadda for his support on data visualisation.