Cities and countries in the global scientist mobility network

Global mobility and migration of scientists is an important modern phenomenon with economic and political implications. As scientists become ever more footloose it is important to identify general patterns and regularities at a global scale and how it impacts a country’s scientific output. The analysis of mobility and brain circulation patterns at global scale remains challenging, due to difficulties in obtaining individual level mobility data. In this work we trace intercity and international mobility through bibliographic records. We reconstruct the intercity and international mobility network of 3.7 million life scientists moving between 5 thousand cities and 189 Countries. In this exploratory analysis we offer evidence that international scientist mobility is marked by national borders and show that international mobility boosts the scientific output of selected countries.


Introduction
Scientists are highly mobile professionals, especially in the early phase of their careers. The tendency to move has been observed in the past (Cardwell 1972;Mokyr 2016), but the size of the phenomenon has drastically increased over the years in a globalized market for high-skill labour (Culotta 2017;Geuna 2015;OECD 2017). Modern economies require a highly skilled labour force to maintain their competitive advantage and grow (Chambers et al. 1998;Solimano 2008;Ozden and Rapoport 2018;Zucker and Darby 2007). The economic relevance makes it essential to understand the structure and the evolution of this kind of mobility at a global scale. However, individual-level mobility data is challenging to collect and is the primary reason for the lack of high resolution and large scale investigations of the phenomenon. Despite the importance of understanding the global mobility of high-skill labour for education, migration and innovation policies, evidence and literature are scant (Fortunato et al. 2018). Previous research on the mobility of scientists has relied on large-scale surveys (Franzoni et al. 2012;Franzoni et al. 2014;Scellato et al. 2017;Franzoni et al. 2018;Petersen 2018), and more recently massive bibliographic databases (Bohannon and Doran 2017;Deville et al. 2014;Graf and Kalthaus 2018).
There are other sources of mobility information (e.g. Job search portals, social media). However, papers offer the most direct and high-frequency signal of scientific activity.
We contribute to the literature on scientific and high skill labour mobility by constructing and analyzing a large scale and global scientist mobility dataset of 3.7 Million scientists working in 189 Countries and 5,531 cities. In this work, we address through an exploratory analysis three questions addressing how cities and countries are affected by international scientist mobility. Specifically, we look at (1) how the centrality of cities in the global mobility network has evolved, (2) how national borders and cultural similarity constrain intercity and international mobility and finally (3) which countries benefit most from the international exchange.
We take advantage of the fact that scientists, especially in some disciplines, regularly publish in their career, and the affiliations listed on publications can be geo-referenced. We are taking inspiration from bibliographic approaches and use MEDLINE, a large open-access publications repository primarily covering research in the life sciences. We reconstruct the mobility paths of scientists through their publication history in MEDLINE, using disambiguated author names (AUTHOR-ITY Torvik and Smalheiser (2009)) and georeferenced affiliation records (MAPAFFIL Torvik (2015)) as well as journal impact scores (SCIMAGO SCImago Journal & Country Rank [Portal]). With this data, we reconstruct individual level publication histories with affiliation and impact scores. Moreover, we look not only at international mobility but also at intercities moves to capture within-country mobility.
International mobility of high skill-labour is associated with "Brain Drain", the idea that high skill labour leaves their home country to its detriment and benefit for the receiving country. Several authors (Saxenian 2005;Agrawal et al. 2006;Agrawal et al. 2011) have pointed out that there are positive spillover effects to the sending country, highlighting that global mobility is not a zero-sum game, suggesting that a more fitting term to describe the mobility of high skill labour is "brain circulation". In this work, we will not assess the causal link between scientist mobility and spillover effects, e.g., from diasporas or international collaboration. We take a high-level view of the international mobility of scientists looking at the effect of national borders, the evolution of the centrality of cities and the benefits to countries to characterize this data for future research. Moreover, we quantify and discuss the impact of international and intercity scientists mobility and how it relates to scientific output. We address first which "mobility communities" are present in the data and subsequently discuss which countries benefit most from international turnover. Note that we do not have information on the nationality of the authors, and when talking about mobility, we do not talk about migration, which would require this information.
The rest of the paper is structured as follows. We introduce first the data and methodology to extract individual career trajectories and the mobility network. We then describe which cities lie at its centre using the widely used PageRank centrality measure. To highlight how national borders constraint mobility, we use a community detection approach. Then we discuss which countries benefit most from international mobility of scientists by estimating scientific output growth due to international turnover. Finally, we discuss the implications of these results and offer ideas for future research.

Data
For the analysis of scientist mobility we use four datasets, MEDLINE, AUTHOR-ITY, MAPAFFIL, and SCIMAGO. MEDLINE provides open access to more than 26 million records of scientific publications, with most of the corpus covering research in the life sciences. The data goes as far back as 1867 (earliest publication in the dataset) and is updated continuously. However, we focus on papers in the period from 1990 to 2009. We restrict our analysis to this period to have good coverage and make use of existing highquality disambiguations of scientists (AUTHOR-ITY) and affiliations (MAPAFFIL), which are restricted to this time interval. MAPAFFIL lists for a large portion of MEDLINE papers the disambiguated city corresponding to the affiliation of each author as listed on the paper (ca. 37,396,671 author-locations) and is freely available for download from www. nlm.nih.gov. AUTHOR-ITY contains the disambiguated names of 61,658,514 appearances of names on MEDLINE papers (author-name instances). These author-name instances have been mapped to 9,300,182 disambiguated authors. MAPAFFIL, is a disambiguation of affiliations listed on MEDLINE papers. This dataset allows us to map the affiliation string to the city this affiliation is located in. SCIMAGO is a publicly accessible dataset of annual journal impact scores (SCImago Journal & Country Rank [Portal]) and is freely accessible at https://scimagojr.com.
With the extracted publication, we can reconstruct the path for a given author over time, as witnessed by the affiliations on the papers the author publishes. In other words, we have a path for author i over several years, indicating where she passed through. It might and does happen, that an author has multiple publications in the same year as well as multiple locations. Possible reasons for multiple locations are that the author had multiple affiliations, or that the publication took some time to publish and an earlier affiliation is listed. Here we define what a move is and how we extract it from the empirically observed publication sequences. To determine a move, and just as importantly a nonmove, we define mobility by determining the location of an author within a given time window before a year of interest (t) (i.e. the move year) and assess where she is located, in the window after.
More specifically to determine the source and destination of a move, for a given time interval we chose a candidate move-year (t) and several buffer years (b) around it (see Fig. 1). To transform a publication path into a single edge representing a move, we proceed as follows. We chose a "move year" t of interest. The move year represents the year around which the decision to move happened. Next, we choose a number b of years around t defining two windows: before [ t − b, t) and after [ t, t + b). Given these two windows, we proceed to determine in which location any given author was before and after. If the locations differ, then the author moved. Otherwise, she stayed.
To determine a unique starting position in window [ t − b, t), we choose the longest uninterrupted sequence of locations closest to t. Take, for example, the observed publication sequence as illustrated in Fig. 1. Here we have the publication history {B 1998 , L 1999 , L 2001 , B 2001 , B 2002 , C 2004 , C 2006 }, move year 2004 and a buffer of b = 5 years before and after. The Uppercase letter indicates the city and index the year. To determine the starting location, we take all publications in the interval [ 1999,2004) and chose the locations with the longest sequence closest to 2004. In this example, we observe 3 publications in B, but only 2 of these are within the [ 1999,2004) window, so we discard B 1998 . On the other hand, we observe 2 publications in L and one simultaneously with B. According to the rule mentioned above, we chose B as the source since it is closest to 2004 even though both L and B have 2 observations. As the destination of the move we chose C since in this case, it is the only observed location in the window [ 2004,2009). Creating the mobility network from MEDLINE publications. The scientific publications by a single author are illustrated as a sequence of green circles from top to bottom. Each publication has a time (in rows) and location (in columns) associated with it. We take a buffer time (i.e. 5 years) before and after a candidate move from Boston (B) to Chicago (C) in 2004. In this example, we identify Boston as the source, since it is the longest sequence within the window and closest to the end of the move year. Similarly, the destination is Chicago since it is the only observed city in the second window. Each move is tracked in a similar way and added to the mobility network by incrementing the edge weight accordingly We chose this method since it discards ambiguous affiliations in publication sequences with spurious affiliations (e.g. multiple affiliations in the same year but either of these appears only once). This definition allows us to carry out several robustness checks in generating the network. For example, we can increase the number of publications required in each location before and after to reduce the chance that a move was only temporary (e.g. visiting or double affiliations). For this method we require precisely one source and one destination location, which jointly define a move, it would certainly be possible to include double affiliations, but then our definition of "move" would no longer be unambiguous. For example, an author has two locations before, and two locations after, in this case, we would need to employ a convention of how to treat this case, e.g. all 4 possible links but instead of a weight of 1 we use a weight of 1/4. To reduce the number of assumptions and keep the method as simple as possible, we have opted to use the proposed method, which yields only simple moves, i.e. one source and one target. Similarly, we can restrict the size of the windows, thus requiring that authors have fewer holes in their publication history, however, doing so will drop any scientist not publishing at least once in the two periods. Note that a mobility network is a snapshot of aggregated inventor level simple moves (i.e. one source and one target). Thus we consider only one move.
When analysing the impact of mobility on the scientific output, we will also rely on the impact factor of the journal the paper was published in using SCIMAGO.

Global cities as hubs
Which cities are at the centre of the exchange of life scientists? How do different countries fare in this comparison? To answer these question, we look at the topological centrality of cities in the international mobility network extracted as described before. Explicitly, we compute the PageRank centrality of cities in this weighted and directed network from 1998 to 2004. A bump-plot, i.e., a plot showing the changes in ranking over time, for this measure is shown in Fig. 2.
The top 5 cities by centrality in the mobility network are Boston, New York, London, Paris and Bethesda in that order. Except for Boston and New York overtaking London and Paris, the top 5 cities in the international mobility network did not change. Among the top 10 cities, there have been some changes in ranking, but overall the cities in this group have remained the same from 1998 to 2004. Note that 8 out of these top 10 locations are situated in the United States. The dominance of the US in the ranking suggests that the global mobility network is influenced in large part by US cities. However, looking at the top 40 cities, we see that the rest of the world, is better represented, but that the positions in the rankings are changing significantly over time. Among these cities, Beijing stands out by going from lower ranks in 1998 to 11th places in 2004.

National border effects
Co-authorship networks have been found by Hoekman et al. (2010); Chessa et al. (2013) to be influenced by national borders resulting in collaborations being more likely within  b Probability to leave country for selected countries and global mean (1990 to 2004). Note: the "country" is the country from which the move originates, not necessarily the nationality of the author than across countries. In line with these findings, we test the hypothesis that countries have a stronger within than across mobility.
Figures 3a shows the pattern of cross country mobility in 2004. Most scientists do not leave their country (as indicated by the main diagonal). Note also that certain countries have few exchanges with all other countries, as indicated by having only a few off-diagonal elements brighter than the rest. This means that while the network is dense (i.e. all major countries have at least one exchange), there are preferences. Note also that the probability of leaving the country has increased steadily year by year as can be seen in Fig. 3b. The global probability of observing a move, i.e. that any given scientist moves abroad if we look at five years before and after, has never dipped since 1990. The listed countries fall into two categories, below the global mean and above. The US, Japan and Italy are below the global average, indicating a stronger within mobility. Moves originating from the US tend to be mostly within the US. This number had gone from 5% in 1990 to 8.1% in 2004. However, compared to France (16.8%) and the global average (12%), it is low. Note that scientists based in the US do not leave the country as often as most other countries, but there is a substantial domestic mobility.
The international mobility patterns in Fig. 3 suggest that international mobility varies by country and that there is more mobility within than across. The notion of "more within" and "less across" is made precise by the measure of modularity (Newman and Girvan 2004). At a high level, modularity is a quality score of how well a given partitioning of nodes (i.e. set of cities) separates nodes which are well connected but have few ties to members of other partitions. More specifically, modularity measures the ratio of links falling within a given partition minus the ratio of links we would expect from a random network (see Newman and Girvan (2004) for more detail). Thus this null model represents a mobility network where scientists move without regard for geographic proximity or national borders. Coceptually we carry out the analysis shown in Fig. 4, where we want to see if the community structure in the topology of the mobility newtork (mobility layer) conincides with national borders in the (geography layer). We estimate the communities Fig. 4 To test that intercity mobility is marked by national borders we extract the communities from the mobility network (upper layer) and compare them with the geographical boundaries of the countries these cities are located in by maximizing the modularity of the partition following the Louvain algorithm (Blondel et al. 2008) implemented by Traag (2017).
If the null hypothesis that scientists move without regard for national borders were correct, we should find that the community structure we obtain by maximizing the modularity does not coincide with any geographic or political boundaries. The spatial organization of the communities; however, as shown in Fig. 5 reveals that national borders are geographically clustered and respect national borders. A breakdown of countries as they fall within the various communities in 2004 is available in the Appendix (Table 4). For example, we find, especially in Europe, that national borders coincide with the spatial boundaries of the mobility communities. However, the picture changes when looking at North America. Here we also observe a national component in the form of Canada and Mexico being identified as separate communities. However, within the US, the identified communities are less spatially segregated than in the rest of the world. Beyond the pure border effect, the community structure reveals some new patterns. We see that countries sharing a language or more generally are culturally similar are more likely to fall within the same community. For example, three majority German-speaking countries, Germany, Austria and Switzerland are identified as belonging to the same mobility community. Several former French colonies in North Africa are placed in the majority French community, suggesting that a scientific exchange persists. Belgium's cities, on the other hand, are split along the countries language divide (French, Dutch), mirroring findings on the same divide using mobile phone data (Sobolevsky et al. 2013). Even more strikingly is the placement of Spain and Portugal in different communities. The two countries share a border but not a language. However, as Fig. 5 shows, Portugal and Brazil have more exchange among themselves than Portugal has with Spain even though one is across the ocean and the other a next-door neighbour. Similarly, Spain and Mexico are placed in the same community, both countries share a colonial history and language, as do Portugal and Brazil. This result would suggest that scientific mobility is influenced by language and possibly by cultural similarity.
We should note that community detection through modularity maximization may fail to separate communities which are "too small" due to the method's "resolution limit" (Fortunato and Barthelemy 2007). Ground truth communities, which are not of comparable size to the identified communities, may be lumped together with larger communities or split up. In practice, this could mean that we have lumped "small" communities together, which probably should be kept separate, for example, Greece, Cyprus and Jordan are placed in the same community. While Greece and Cyprus share a language, the inclusion of Jordan in this community is most likely because Jordan has had an exchange with the other two but was "erroneously" placed in the same community.

National gains
The ability of a country to be at the forefront of research and innovation is in part determined by its ability to attract bright and talented scientists, in addition to retaining the highly trained individuals already working for national institutions. For this reason, high skill labour mobility and brain circulation is a significant concern at the country level. To estimate which countries are the primary beneficiaries of international mobility, we estimate the contribution to the national scientific output growth coming from the mobile scientist population. By doing so, we can compare scientific output across countries and identify the primary direct beneficiaries from international mobility.
Before we can discuss the impact of mobility on scientific output, we need to define how to measure scientific output. Only relying on the number of publications is not a good proxy for scientific relevance. We, therefore, rely on the impact score of the journal a paper was published in. Specifically, we use the number of citations per document in the two years before. We obtain this information from SCIMAGO. To compute the scientific production of a given location for a given period, we obtain the papers listing that city among the affiliations of any of the authors and obtain the impact factor for that journal in that year. If multiple authors are working in different cities, we apportion this score equally. For example, for a paper published in a journal with an impact of 6 and 3 authors, 2 of which reside in city A and 1 in city B. We would then add 6/3 * 2 = 4 to the running total of city A and 6/3 * 1 = 2 to B. For given year y and period, i.e., before [ y − 5, y) and after [ y, y + 5), the total scientific output is the sum of the fractional impacts associated with that city. At country-level, these impact factors are summed up.
The total scientific output produced within a country can be accounted for in the following way. Scientific output produced by authors staying in their city (S), moving domestically (D), coming in from abroad (I) and leaving the country (L). The total output for a given time period within a country before A 0 and after A 1 are given by A 0 ≡ S 0 + D 0 + L 0 and A 1 ≡ S 1 + D 1 + I 1 respectively. Note that in A 0 the output contains the production of those individuals who will leave the country L 0 in the second period and A 1 the production of those that will come in the second period I 1 . Based on this breakdown, we can define indicators identifying the growth due to the four mobility types, i.e., S, D, I and L. Specifically this is the output after O 1 net of the output before O 0 divided by the output O 0 before, i.e., g = (O 1 − O 0 )/O 0 . The indicators are defined in detail in Table 1. Overall growth g A for the country, g S growth due to stationary scientists, g D growth due to domestically mobile scientists and most relevant for the brain circulation discussion g I , the gain due to international turnover. Morevoer, the growth of a country g A can be expressed as a weighted sum of the individual growth rates g S , g D and g I as shown in Eq. 1, where the weights are w S , w D and w I respectively.
We also report these weights to understand the importance of the contribution of each growth component. Additionally, to indicate the generational turnover, we also report the mean difference between incoming and leaving scientists, age. The results for the largest countries in the dataset for the intervals before [1999,2004) and after [2004,2009) are reported in Table 2. We do find that the US has increased its overall output by 14%, but that the gain due international exchange was as high as 61%. While overall, only 6% of the scientific production comes from these scientists, it is clear that their contribution is a net benefit to the US research system. Similarly, the contribution to the national scientific output is positive for most countries, with several exceptions like Argentina, India and Israel, to name a few. In this comparison, China stands out, with an astonishing growth rate of 141% overall and 158% due to international exchanges. Overall, this result suggests that China is a prime example of a country gaining by participating in the international exchange of scientists. If we look at the age value in Table 2 we also note that the scientists moving to the US are younger than the scientists leaving, on average 1.2 years younger. On the other hand and line with the idea that the increasing scientific output of China is due to returnees, we find that scientists moving to China are on average 1.16 years older. Give these two statistics we argue that the US can "rejuvenate" their scientific labour force and China can entice more senior ex-pats to move back. As for the proportion of scientific output due to international mobility, i.e., w I , we see that for several countries it is a significant proportion. So, for example the scientific output growth of China, India, Argentina, Russia and Switzerland has a weight of more than 20% (w I > 0.2), suggeseting a strong exposure to international mobility. From these statistics, a picture emerges that certain countries are more exposed than others, and the gain is not unequivocally positive. Note that this is only a direct measure of scientific output and disregards any other benefits to a country from participating in the international exchange of scientists such as diasporas and sustained international collaborations (Saxenian 2005;Agrawal et al. 2006;Agrawal et al. 2011). These measures do, however, offer a first-level approximation of the primary beneficiaries of international scientific mobility.

Discussion
The mobility of scientists and its impact on scientific production are still poorly understood, mainly due to the lack of harmonized international statistics. We contribute to filling this gap by providing a description of scientist mobility at the city and country level. The mobility network highlights the existence of a highly connected core of cities. However, the reach and mobility of scientists is constraint by geographical factors. These hubs are predominantly found in the US, which may benefit from high levels of domestic and low cultural and linguistic barriers to international mobility. This hypothesis is supported by the community structure of the mobility network, which, unlike in many other countries, is not delineated by a clear spatial segregation. Moreover, we find that not all countries benefit equally from international exchange. In most cases, the direct benefit, without considering any indirect benefits such as international collaborations and diasporas, is ambiguous. In other words, gains in national scientific output, as highlighted by the output growth due to international turnover, do not provide a clear signal that international exchange is unequivocally beneficial to all participants. The list of countries negatively affected by international mobility is not limited to less developed and emerging countries, such as India. However, it includes scientific powerhouses such as Japan and Sweden, for which the balance of international scientist turnover is negative. Among the listed countries, China stands out by having the most substantial overall growth by far (141%), while also benefiting from international exchange. As argued above, this is in part explained by the fact that on average incoming scientists are more senior than the leaving scientists and that these leaving scientists are less prolific than the incoming scientists. In other words, China did benefit exceptionally from international mobility by sending scientists abroad and attract them back. With the returning scientists, Chinas has been able to build up their local innovation system susbstantially. These observations highlight the importance to investigate international mobility not purely as "brain drain" but rather as "brain circultion" as argued by Agrawal et al. (2006); Saxenian (2005). The analysis on community structure implied in the mobility network, reveals that national borders do affect the mobility patterns we observe, a finding which mirrors the observed tendency to form collaboration within national borders (Cerina et al. 2014;Hoekman et al. 2010). Further analysis on the determinants on which scientists move abroad and the potential benefit of this turnover, can inform policies to enhance international collaborations, such as the European Research Area (ERA) initiative. This is an exploratory investigation of the international and intercity mobility network and does only provide a high-level overview of the phenomenon. The macro patterns, however, we have identified raise several questions and possible future research direction regarding, e.g., the push and pull forces driving mobility and how emerging and developed economies have fared during the globalization and integration of research. The primary take-home message from this analysis is that scientist mobility has a strong spatial component and individual level career trajectories offer a window into this phenomenon.
In summary, this study makes three contributions. First, it introduces an approach to extract mobility networks from bibliographic data and augments it with quality indicators. Second, it characterizes the international flows of scientists highlighting the importance of national barriers. Third, it quantifies the gains from mobility to countries. This study has several strengths. We reconstruct intercity mobility networks for specific time intervals, making it potentially useful for evaluating the impact of research policies. The dataset has extensive coverage of life scientists spanning multiple countries, career stages and productivity levels (i.e. not only star scientists). However, it has also some limitations. First, by using PUBMED, there are linguistic and field biases, containing mostly English journals in the life sciences. More importantly, however, we lack biographical details, such as nationality and ethnicity, which would better inform migration policies. Our contribution, which is based on open access data, is meant to stimulate follow-on research to investigate the determinants of mobility, including the impact of reverse brain drain and research policies to attract the best talent. Future research can shed more light on the role of collaboration networks, both as drivers of mobility and as vectors of knowledge diffusion. Moreover, more work is needed to understand the attractiveness of global cities and their role in the global knowledge economy. In this work, we have limited ourselves to a descriptive analysis of the mobility network, omitting causality claims. However, the richness of the dataset makes it potentially useful for use in determining causal relocation factors. The global nature and good temporal coverage mean that several natural experiments can be identified, which can help to isolate the determinants of mobility. An example of this is the estimation of the impact of stem cell legislation on the US on stem cell scientist mobility (US states offer various degrees of support). Similarly, the effect of regional projects (e.g. opening a new research campus), aiming to improve scientific output or innovation, can be quantitatively analyzed as part of the national or international research system. This dataset, in conjunction with natural language processing techniques and text mining, can also be used to follow the mobility and diffusion of new ideas and concepts in the life sciences. By estimating the relative importance of mobility and collaboration research policies, optimizing diffusion could be devised. Moreover, by exploiting the available information on collaborations and the data on mobility, we can estimate the propensity of scientists to remain in contact with their home country and city. This approach could be used to replicate findings on the benefit of international mobility and their spillovers which go deeper into the issue of brain drain and circulation.