Computational Socioeconomics

05/15/2019
by   Jian Gao, et al.
University of Fribourg
0

Uncovering the structure of socioeconomic systems and timely estimation of socioeconomic status are significant for economic development. The understanding of socioeconomic processes provides foundations to quantify global economic development, to map regional industrial structure, and to infer individual socioeconomic status. In this review, we will make a brief manifesto about a new interdisciplinary research field named Computational Socioeconomics, followed by detailed introduction about data resources, computational tools, data-driven methods, theoretical models and novel applications at multiple resolutions, including the quantification of global economic inequality and complexity, the map of regional industrial structure and urban perception, the estimation of individual socioeconomic status and demographic, and the real-time monitoring of emergent events. This review, together with pioneering works we have highlighted, will draw increasing interdisciplinary attentions and induce a methodological shift in future socioeconomic studies.

READ FULL TEXT VIEW PDF
08/01/2022

RISeer: Inspecting the Status and Dynamics of Regional Industrial Structure via Visual Analytics

Restructuring the regional industrial structure (RIS) has the potential ...
09/18/2020

Industrial Topics in Urban Labor System

Categorization is an essential component for us to understand the world ...
12/01/2020

Uncovering the socioeconomic facets of human mobility

Given the rapid recent trend of urbanization, a better understanding of ...
06/10/2016

Fuzzy-Klassen Model for Development Disparities Analysis based on Gross Regional Domestic Product Sector of a Region

Analysis of regional development imbalances quadrant has a very importan...
05/20/2021

Aggregate Learning for Mixed Frequency Data

Large and acute economic shocks such as the 2007-2009 financial crisis a...
03/31/2020

A spatial agent based model for simulating and optimizing networked eco-industrial systems

Industrial symbiosis involves creating integrated cycles of by-products ...

1 Introduction

Many branches of science have experienced the paradigm shift from qualitative to quantitative studies. Even the most representative one for quantitative sciences, physical science, has undergone a long period for qualitative explorations in its early stages. For example, more than two thousand years ago, Aristotle raised the famous four elements theory, which claims that the four classical elements, namely earth, water, air and fire, are the material basis of the physical world. At almost the same time, some Chinese ancient philosophers proposed the Wu Xing theory (i.e., the Chinese five elements theory), which is a fivefold conceptual scheme that uses the proportion of ingredients and movements of the five elements (i.e., metal, wood, water, fire and earth) to explain a wide array of phenomena, from the cosmic cycles to the validity of a dynasty. For about two thousand years, the ancient Greek system, contributed by Aristotle and some others, represents the most advanced understanding of the world, which is indeed one of the most influential theory in human history. Up to the end of the middle ages, thanks to the quantitative analyses and experimental verifications, these ancient theories, such as Aristotle’s four elements theory and kinetic theory, were progressively replaced by modern scientific theories like the Atomic theory and the Newton’s laws.

In contrast to physical science that concentrates on the study of matter and its motion through space and time, social science investigates the social structure based on the activities of and relations between human beings, including sociology, economics, politics, linguistics, jurisprudence, and many other branches. In comparison with physical science, the way from qualitative to quantitative studies is more difficult for social science. On the one hand, the objects under social science study are much more complex than those under physical science study. An individual person is one of the most important units for social science study, playing an analogous role to an atom in physical science Ball2004 . However, human behaviors exhibit heterogeneity and burstiness: different people have much different behavioral patterns and even the same person shows far different behaviors in different spaces and times Barabasi2010 . Therefore, except a certain success in analyzing the flow of human crowds Hughes2003 ; Helbing2010 , to treat human beings as atoms will kill many interesting social phenomena. Some other objects under study are naturally not easy to be characterized numerically, such as policies and legal provisions. On the other hand, social science study inevitably suffers from uncertainty and incompleteness. The factors affecting social development are countless, and thus any seemingly coverall theory cannot include all relevant factors and be self-contained. In addition, every single factor is unstable and not independent, being affected by other factors and the external environment. The above intrinsic complexity makes it infeasible to quantitatively test and verify any social theory through controllable repeated experiments in a closed environment, while such experimental verification is indeed the methodological cornerstone that pushes forward physical science and other branches of natural science Popper2005 . At the same time, social science is not good at quantitative predictions to the future, with many predictions from experts and complicated theories being no better than wild guesses Silver2012 . What a pity is that such incorrect predictions cannot subvert the corresponding social theories (much different from physical science) since the mistakes are attributed to the unknow/undetected factors or emergent events Taleb2007 , instead of the flaws of the theories themselves.

Up to now, along with the development of quantitative methods, social science has successfully learned how to be wise after the event. That is to say, we can always find some theoretical models (possibly together with some cosmetic changes) to provide qualitatively correct or even quantitatively accurate explanations after the event. However, these theories are usually powerless in predicting the future. Confronting such straits, social scientists should not turn back to the qualitative description, but insist on quantitative explanation and prediction, and evaluate the validity of a theory based on its explanatory power and prediction accuracy before the event. In fact, social science study recently shows higher and higher level of quantification and becomes increasingly dependent on real data Lazer2009 ; Shah2015 . However, the traditional way to obtain real data has many limitations. For example, survey data from questionnaires and self-reports usually contains a small number of samples and suffers from social desirability bias (i.e., subjects tend to give socially acceptable answers, instead of the real facts) Fisher1993 . Larger-scale and more precise data, such as data from economic census, usually consumes huge resources and lacks timeliness. In many poor countries and regions, population-scale economic census is not feasible. Fortunately, thanks to the digital wave that sweeps across the whole world Mayer2013 , social scientists have an unprecedented opportunity to develop a quantitative methodology. Indeed, it is for the first time in history, data in the processes of social and economic development, as well as the data of human activities, are recorded by more and more sensing devices, online platforms and other data acquisition terminals. However, these data are not well-structured and are different from the normally handled data in social science. Typical examples include satellite remote sensing data, mobile phone data, social media data, and so on. On the one hand, to understand and analyze these data asks for advanced techniques in data mining and machine learning, which is a considerable challenge to traditional social scientists. On the other hand, these data are of larger size, almost in real time and with higher resolution, which can reduce the sparsity and bias in small-size data, and reduce the invisible parts in the developing processes (e.g., data points in two consecutive censuses are usually across a few years, and the changes in between are not visible). Therefore, based on these large-scale novel data, we can in principle make great progress in perceiving socioeconomic situations, evaluating and amending known theories, enlightening and creating new theories, detecting abnormal events, predicting future trends, and so on.

The above-mentioned challenges and corresponding attempts have led to the emergence of a new scientific branch, which studies various phenomena in socioeconomic development by using quantitative methods that based on large-scale real data, with particular attention to the economic development problems related to social processes and the social problems related to economic development. We name it as Computational Socioeconomics, which is immature, but future-pointing and burgeoning. The computational socioeconomics can be considered as a new branch of socioeconomics resulted from the transformation of methodology, or as a new branch of computational social science by emphasizing on socioeconomic problems.

In the above definition, three keywords are worth paying close attention to. The first one is “quantitative methods”, which emphasizes the usage of numerical values, rather than qualitative description, in characterizing problems and presenting results. In the 5th century BC, the ancient Greek doctor Hippocrates (who is often referred to as the “Father of Medicine”) proposed the four temperaments theory, which suggests that there are four fundamental personality types: sanguine, choleric, melancholic, and phlegmatic, and the personality type of an individual is determined by the excess or lack of four body fluids: blood, yellow bile, black bile, and phlegm Merenda1987 . Such a qualitative theory, analogous to the impacts of the four elements theory on physical science, has ruled social psychology (in particular the studies on personality) for more than two thousand years. In despite of some reasonable ingredients, the four temperaments theory has stayed on the level of qualitative description, and thus failed to accumulate scientifically solid achievements in its long-time development. Only after modern psychologists obtained quantitative evaluations of the Big Five personality traits via standard scales, personality analysis became an important research domain that plays central roles in many issues of social psychology Gosling2003 . Such example show the importance and necessity of the development of quantitative methods. The second one is “real data”, which emphasizes that any theoretical model should respect real data and use the explanatory power and prediction accuracy for real data as the evaluation criteria for its validity. Economics shows a high level of quantification, with most theoretical models being precisely described by a group of elegant equations. Accordingly, given the values of necessary parameters, many targeted economic variables are calculable. However, the majority of economic theories have cocooned themselves in a quantitative fantasyland consisted of ideal assumptions while largely ignored real data. It eventually makes the classical economic theories beautiful rather than practical. For the short term, it cannot predict the upcoming economic crisis Battiston2016 (but it can always find out graceful and reasonable theoretical explanations after the crisis Reinhart2009 ). For a long time, it failed to provide effective strategies on economic development for more than a hundred of developing countries over the world Lin2011 . The third one is “large-scale”, which emphasizes the importance of population-scale data (i.e., the data that can directly reflect the entire population under study, instead of a small sample). A very small data set may not only bring statistical bias, but result in completely wrong conclusions. For example, a widely accepted theory by academic community, which has also been validated by various experiments on small-scale social networks, is that the interacting strength between two connected individuals (which can be measured by the frequency and duration of mobile communication or the number of comments, replies and mentions on a social platform, and so on) decays as the increase of the range of their link (the range of a link is defined as the shortest distance between its two endpoints after the removal of this link, and a large link range indicates that the two corresponding endpoints locate in two distant communities with few overlapping nodes) Granovetter1973 ; Onnela2007 . A very recent experiment on 11 population-scale social networks, however, shows that the interacting strengths through very long-range links are not weaker than those through short-range links Park2018 , which fundamentally challenges our traditional understanding of social network organization.

In comparison with routine methods in social science, the increasing diversity and volume of data lead to methodological changes in two aspects. Firstly, simple statistical tools are not suitable for analyzing unstructured data, such as remote sensing images, street views, social networks, textual content, and so on. Therefore, researchers are badly in need of artificial intelligence, in particular advanced techniques of data mining and machine learning, such as deep learning

Lecun2015 . Secondly, with population-scale data, sampling is not a necessary method to estimate the statistical properties of the whole population. Instead, one can concentrate on a small-size subset sampled from the original data in hand, and add new dimensions of data. These new data dimensions are usually of high values, which can be obtained from traditional ways like manual labelling and questionnaire survey. Using such a small sample as the training data, one can learn a model to infer new dimensions of data from the original dimensions. Applying such model to the whole data set, new dimensions of data for all individuals appeared in the original dataset can be obtained in principle. Such method integrates some routine methods like sampling, labelling and surveying, while it is much more powerful. For example, it is relatively easy to obtain the population-scale data on mobile communication and mobility (all can be obtained from mobile phones), in contrast, it is very hard to know the household income of every family since a poor country cannot support a population-scale economic census and such data is usually treated as official secrets that are not open for public or research institutions. Under the circumstances, we can obtain household incomes of a certain number of families (what we need is just a tiny fraction of all families) via routine questionnaires. These much smaller data set can be used as training data, based on which we can apply machine learning techniques to build a model that can predict household income of a family from the mobile phone data of the family members Blumenstock2016 . Although the inferred data is not perfect, it can be very close to the real data under a certain well-designed algorithm. Notice that, a significant advantage is that the high-value data for almost every individual can be obtained at a very low cost. As shown in Figure 1, combining the accessible population-scale data, a small sample of high-value but hard-to-get data, and a properly selected or well-designed algorithm to infer the high-value data for individuals other than the sample is a novel and representative method in the computational socioeconomics study, showing the deep integration of social science and computer science methods.

Figure 1: Illustration of the relationship between the entire population, the easily accessible population-scale data and the small-size high-value data. The small-size data set contains some high-value data dimensions that do not appear in the original population-scale data.

Long-term speaking, no matter computational socioeconomics will become a mature branch of science with distinct borderlines or it will completely integrate into the framework of traditional social science, the above-mentioned novel perspective and methodology, driven by big data and artificial intelligence, will definitely become the mainstream in the future and change the landscape of science research in a profound and irrevocable way. Inspired by this positive judgment, we decide to present this review article. In addition, there are three technical reasons for us to write this review. Firstly, computational socioeconomics is an emerging research domain with research findings published in disparate journals and conference proceedings across many disciplines. Therefore, it is necessary to collect these results together. Secondly, we would like to sort and classify representative results according to the objects and data sets under study, so that it is easy for readers to see the landscapes of both methods and achievements. A proper taxonomy can largely reduce the difficulty to master the related knowledge and methods. Although the presented one is built just according to the current progresses of this field, it will evolve to be a more systematic and reasonable one along with further studies. Thirdly, in the nascent stage of computational socioeconomics, different research articles used different expressions to describe essentially the same problems and methods, and thus it is valuable to unify the problem description and the symbolic system. In a word, we hope this review will become a handbook for researchers who are willing to contribute to the development of computational socioeconomics. Furthermore, the paradigm shift in methodology, as presented in this review, is not only relevant to socioeconomics, but also to most branches of social science and to many other qualitative disciplines beyond social science.

The remainder of this review article is organized as follows. The second section will discuss some important problems at the macroscopic scale, such as the world economic development, the competitive powers of countries, the inequality problem, and so on. The third section will mainly concentrate on the urban scale and introduce some novel ways to solve problems related to the regional economic development, such as how to precisely perceive regional socioeconomic status and how to choose the suitable development paths and strategies for a city. The fourth section will focus on individual level, discussing how to make use of some unobtrusive data to estimate the individual socioeconomic status, including income, employment situation, and even health condition. The fifth section will go a little beyond the scope of computational socioeconomics, and to discuss how the frequently-used data in the previous sections can be utilized to benefit the emergency management and disaster assistance. We cover such issue because the emergency management is an increasingly important social problem and the reported methods are consistent to the methods introduced in the previous sections. In this review, many different data resources have played important roles, among which the following three are the most important: remote sensing satellites, mobile phones and social media platforms. For each of the three data resources, there are some certain representative analytics tools and methods, so it is very effective to sort the results according to the data resources and corresponding methods. Finally, in the last section, we will summarize representative progresses, explore the tendency of the development of computational socioeconomics, discuss the challenges and opportunities in this emerging field, and outline some potentially interesting and significant open issues.

2 Global development, inequality and complexity

2.1 World development and poverty mapping

Revealing the status of economic development is one of the long-standing problems in socioeconomics Kuznets1955 ; Gao2016 . Recently, data with improved quantity and quality have been used to map nations’ economic characteristics such as poverty, which comes with economic development and is a major cause of societal instability. Based on an international poverty line at USD 1.25 a day in 2008 Ravallion2009 , 1.2 billion people (21%) lived in poverty in 2012 Ravallion2013 . Reducing poverty is thus a key target of the Millennium Development Goals (MDGs). To approach this goal, the first step is to accurately map the spatial distribution of poverty Hulme2007 . New data and tools have been utilized to better reveal, explain and predict global poverty and economic inequality. In this section, we will briefly introduce literature that map poverty from satellite imagery, infer socioeconomic status from mobile phone (MP) data and fight against poverty with combined data.

2.1.1 Remote sensing observes poverty

Remote sensing (RS) is the acquisition of information by using sensor technologies to detect objects on earth, which is originally used in earth science disciplines Paul1981 . In recent years, high resolution data from RS, for example, nighttime lights (NTLs) satellite imagery, has been used to supply information about economic activity, especially in developing countries where traditional economic census data are insufficient Ghosh2013 . With a great potential for recording the presence of humanity on the surface of earth, NTLs data can provide an unambiguous indication of the spatial distribution of economic development. Indeed, NTLs have been found to be a powerful predictor of ambient population density and economic activity. Nightsat Elvidge2007 is a concept for a satellite system, which is capable of global observation. The Nightsat can capture the location and density of lighted infrastructures within human settlement.

One of the pioneering works by Elvidge et al. Elvidge1997 suggested that NTLs data can be used as a proxy for socioeconomic development in developing countries. Lighted area has a high correlation with the gross domestic product (GDP) and electric power consumption. Moreover, lighted area is strongly correlated with GDP for 21 countries. Latter, by combing lighted area with ancillary statistical information of a city, Doll et al. Doll2000 investigated the potential of NTLs data for quantitative estimation of global socioeconomic parameters. They found that the country-level total lighted area exhibits significantly high correlations with GDP and emission. Sutton and Costanza Sutton2002 estimated the amount of light energy (LE) from satellite images with global coverage at a high spatial resolution. They found that LE is correlated with GDP at the national level and can serve as a more accurate indicator of economic activity. That is because LE is more spatially explicit and can be directly observed and easily updated almost in real time.

Together with census and survey data, NTLs data have been applied in mapping poverty. Ebener et al. Ebener2005 applied regression methods using NTLs imagery to model the distribution of wealth within 171 countries at the national level and 26 countries at the subnational level. They showed that NTLs data is correlated with GDP per capita and other socioeconomic indicators. Noor et al. Noor2008

computed asset-based poverty by applying the principal component analysis (PCA) to NTLs data and household survey data of 37 countries in Africa. They found that the mean brightness of NTLs data can offer a robust and inexpensive alternative to asset-based poverty indices derived from survey data, suggesting that it is possible to explore and track economic inequity at subnational levels by leveraging NTLs data. For Uganda, Rogers et al.

Rogers2006 presented a discriminant analysis model to predict poverty after combining satellite imagery and household survey data. They estimated the poverty index by the likelihood of each pixel falling within a specified poverty class. They found that external and independent data have descriptive power for poverty mapping. These novel data sources are likely to outperform socioeconomic datasets that are internally correlated and exploited by small area methods.

A spatially disaggregated global poverty map of 233 countries was produced by Elvidge et al. Elvidge2009 . The poverty levels were estimated by dividing the LandScan population count data Dobson2000 by the brightness from NTLs data (see Figure 2). The produced poverty indices correlate very strongly with other widely accepted measures, suggesting that satellite imagery can enhance the knowledge of socioeconomic conditions around the world at a fine spatial resolution. Later, Ghosh et al. Ghosh2010 proposed a model to estimate the world-wide economic activity. In their model, a grid of nonagricultural economic activity was created according to the NTLs, while a grid of agricultural activity was created according to the LandScan population grid. Then, by integrating the two grids, a disaggregated map of total economic activity was produced, which can provide an alternative means for measuring global economic activity and predicting future socioeconomic trends.

Figure 2: Percentage of population in poverty for subnational administrative units. The world poverty map was estimated based on satellite data-derived poverty index. Figure from Elvidge2009 .

To better estimate true income growth from NTLs, Henderson et al. Henderson2012 developed a statistical framework to estimate two parameters. One is a coefficient that maps NTLs growth into a proxy for GDP growth, and the other is an optimal weight to combine this proxy with national account data. After applying the method to countries with very low-quality national account data, Henderson et al. Henderson2012 demonstrated the key role of NTLs data in analyzing growth at the subnational and supranational levels. With the NTLs data, income data and papulation data of 748 regions across 54 countries in Africa, Mveyange Mveyange2015 estimated the regional income inequality by calculate two standard measures of inequality, the Gini index and the mean log deviation (MLD) measure Khandker2009 . After presenting the empirical model, they showed that the estimated inequality index has significant and positive correlations with income-based regional inequality indicators, suggesting that NTLs are good proxies to estimate regional inequality. These results are especially meaningful in the lack of reliable and consistent subnational income data.

Cauwels et al. Cauwels2013 explored the dynamics and spatial distribution of global NTLs for 160 different countries. They found that the center of light moves eastwards about 60 km per year, and there is a tendency of global centralization of light. After introducing spatial light Gini coefficients, they found a universal pattern of human settlements across different countries. Ghosh et al. Ghosh2013 summarized literature that leveraged NTLs to develop a variety of alternative measures of human well-being. They introduced the application of NTLs to estimate various human well-being indicators (e.g., GDP, poverty, informal economic activity and remittances), develop the night light development index (NLDI), map the human ecological footprint, measure the electrification rates and estimate the ICT Development Index (IDI). Recently, Bennett et al. Bennett2017 summarized the methods to correlate NTLs with socioeconomic parameters including urbanization, economic activity and population. They highlighted the value of NTLs for detecting, estimating and monitoring socioeconomic dynamics.

NTLs data are successful in revealing economic activity, however, it is not effective for less developed areas due to the uniformly dark of satellite imagery in these areas. To this point, Jean et al. Jean2016

applied deep learning algorithms to learn the relationship between NTLs and daytime satellite imagery. The former can predict the wealth distribution while the latter contains rich information about landscape features. They employed a multi-step transfer learning approach

Pan2010

to train a convolutional neural network (CNN)

Xie2016

. In particular, a linear chain transfer learning graph was constructed. First, they transferred knowledge from the object recognition on the ImageNet (Problem 1)

Krizhevsky2017 , an object classification image dataset of over 14 million images from 1000 different categories, to the prediction of NTL intensity from daytime satellite imagery (Problem 2). They chose the model trained on ImageNet as the starting CNN model Chatfield2014 , and then constructed the fully convolutional model. Formally, given an unrolled -dimensional input

, the fully connected layers perform a matrix-vector product,

(1)

where is a weight matrix, is a bias term, is a nonlinear function, and is the output. Then, they transferred knowledge from Problem 2 to the prediction of poverty from daytime satellite imagery (Problem 3), for which the amount of training data is limited. The illustration of the method is summarized by Blumenstock Blumenstock2016 (see Figure 3), and technical details are presented in the early work by Xie et al. Xie2016

. The image features extracted from the daytime imagery can explain up to 75% of the variation in the average household asset across five African countries. Moreover, the method is able to reconstruct survey-based indicators of regional poverty with high accuracy. Using only publicly available data, the method has broad potential applications in tracking and targeting poverty in developing countries.

Figure 3: Predicting poverty from satellite images using Convolutional Neural Network (CNN). Figure from Blumenstock2016 .

Other RS data and machine learning approaches can also be used in quantifying poverty-environment relationships. By applying the principal component analysis (PCA) and spatial models in the field of geostatistics, Sedda et al. Sedda2015 demonstrated the correlations between the normalized difference vegetation index (NDVI, a measure of vegetation greenness in RS Imran2014 ), intensity of poverty, and health for a large area of West Africa. They found that high NDVI is associated with low poverty and child mortality. Their results highlight the utility of satellite-based metrics for poverty analysis. With high-resolution daytime satellite imagery, the UN Global Pulse Lab Kampala built a proxy indicator for poverty based on the household’s roof counting. The research project entitled “Measuring Poverty with Machine Roof Counting” UNGP2019 developed image processing software to count the roofs and identify the type of roof that a house has. Watmough et al. Watmough2016

applied a random forests approach to study the relationships between welfare and geographic metrics for over 14,000 villages in India. They found that geographic metrics account for 61% and 57% of the variation in the lowest and highest welfare quintile, respectively. These methods help estimate socioeconomic status in less developed countries where household surveys remain lacking.

2.1.2 Mobile phones reveal socioeconomic status

Mobile phones (MPs), serving as ubiquitous sensors, are increasingly common in developing economies. Compared to coarse-grained remote sensing, MPs are able to capture an enormous information and provide cost-effective data at the individual level, such as the frequency and timing of communication events Onnela2007 ; Hong2009 ; Zhao2011b , the traveling patterns Gonzalez2008 , the histories of consumption and expenditure Blumenstock2010 , and so on. With MP logs that are related to housing, education, health, etc., socioeconomic status can be inferred by employing regression models and machine learning approaches at the aggregated subnational and national levels.

To explore the relationship between MP usages and wealth in developing countries, Blumenstock et al. Blumenstock2010 presented a novel method that contains three steps: (1) modeling the relationship between assets and expenditures using Demographic and Health Survey (DHS) data; (2) conducting a phone-survey with a small subset of MP users to collect information on asset ownership; (3) obtaining call detail records (CDRs) for the individuals in the phone survey and creating a single dataset that use call histories to predict annual expenditures. By analyzing the data from Rwanda, they found that household expenditures are positively correlated with MP usages, mainly with the numbers of international calls, the number of different districts contacted, and the average airtime credit purchase. Airtime credit is money in MP number account, ready to spend on texts, calls and data. These results suggest that the annual expenditures of MP users can be predicted only using their anonymous phone usage data. Blumenstock and Eagle Blumenstock2012 later found that MP usages in Rwanda are not uniform. They provided a quantitative description about the demographic and socioeconomic structure of MP usages, for example, phone owners are considerably richer and predominantly male. Moreover, Blumenstock et al. Blumenstock2016b showed that Rwandans use MP network to transfer their airtime credit to those affected by disasters. In particular, transfers tend to be sent to rich individuals and between pairs of individuals with a strong history of reciprocal.

Individual MP data can be aggregated to estimate socioeconomic status at the national level. By analyzing CDRs and airtime credit purchase histories, Gutierrez et al. Gutierrez2013 mapped the relative income of individuals, the diversity and inequality of income, and the socioeconomic segregation for fine-grained regions in Côte d’Ivoire. In particular, they quantified the variation in purchase amounts of each user by using the Coefficient of Variation (CV),

(2)

where and

are the standard deviation and the mean of the purchase amounts. They found that urban areas clearly stand out in diversity, showing the opportunity to obtain real-time and low-cost socioeconomic statistics. Also for Côte d’Ivoire, Smith et al.

Smith2013 demonstrated how aggregated CDRs can be mined to derive proxies of socioeconomic indicators. They found strongly negative correlations between the communication activity within a region and the multidimensional poverty index (MPI) UN2010 , a survey-based indicator that measures a region’s actual poverty. Further, they derived a linear model to estimate the poverty level using the diversity of communication. Their work suggests CDRs as an invaluable source for poverty estimation, even without the knowledge of individual behavior.

MP data from Côte d’Ivoire has also been used to explore the relations between national communication network and socioeconomic dynamics. Mao et al. Mao2015 introduced the CallRank indicator–the PageRank centrality Brin1998 calculated over the MP communication network–to quantify the relative importance of an area and tested the correlation between network features and socioeconomic indicators. They found that the outgoing call ratio consistently correlates with local socioeconomic statistics such as low poverty rate and high annual income. Moreover, the Gini index exhibits significant correlations with CallRank and other CDRs-based indicators. Further, to quantify the strength of the rich-club effect Zhou2004 ; Flammini2006 , they measured the weighted rich-club coefficient of the MP communication network,

(3)

where , and corresponds to the null model generated by randomizing the original MP network while preserving its degree distribution. Here, each node has a richness parameter as the average annual income of the region, is the total number of links, is the number of links to the region, is the sum of the weights attached to these links, and with are the ranked weights of links on the network. If , network shows the rich-club effect in comparison with the null model. The extent to which is larger than 1 indicates the strongness of the rich-club effect. After analyzing the CDRs, Mao et al. Mao2015 found that rich areas form rich club in MP communication, where rich areas communicate more frequently with each other.

By analyzing anonymized records of interactions on Rwanda’s MP network and the follow-up phone surveys of some individual subscribers, Blumenstock et al. Blumenstock2015 predicted the wealth of MP users. They demonstrated that the predicted attributes of individuals can accurately reconstruct the distribution of the entire nation’s wealth. Specifically, they used a two-step approach in feature engineering and model selection, where the first step generates a thousand metrics from the MP data, and the second step eliminates irrelevant metrics and selects a parsimonious model using the elastic net regularization Zou2005 . After applying this machine learning approach to analyze the survey data, they found that individual wealth can be well predicted and individuals in relative poverty can be accurately identified. Then, they generated out-of-sample predictions for 1.5 million MP users and produced the wealth map of Rwanda at a very high resolution (see Figure 4). Further, they found a strong correlation between the government “ground truth” data and the predicted wealth data after aggregating them to the district level. Their method is promising to map the distribution of wealth and other socioeconomic indicators for the full national population. Other works that leveraged MP data to infer socioeconomic status at the regional or urban levels will be introduced in the following sections.

Figure 4: High-resolution map of poverty and wealth predicted from mobile phone call records of 1.5 million users in Rwanda. Figure from Blumenstock2015 .

2.1.3 Combined data for better inference

Novel sources of data with a high spatial resolution have been used to provide an up-to-date indication of living conditions. For example, remote sensing (RS) data capture information about physical properties of the land, which are cost-effective but relatively coarse in urban areas. By contrast, call detail records (CDRs) from mobile phones (MPs) have high spatial resolution in urban areas but the resolution is usually insufficient in rural areas due to the sparsity of towers. Therefore, some recent works estimate socioeconomic status by combining data from different domains such as LandScan population Dobson2000 , RS and MPs.

While RS-only and CDRs-only models perform comparably in mapping poverty, Steele et al. Steele2017 demonstrated that their combination can produce better predictive maps of socioeconomic status in Bangladesh. Specifically, they employed hierarchical Bayesian geostatistical models (BGMs) Blangiardo2015 that combine RS data, CDRs and traditional survey-based data to map three commonly used indicators of living standards, namely, Wealth Index (WI), Progress out of Poverty Index (PPI) and reported household income (Income). The BGMs are built on the scale of the Voronoi polygons, which approximate the mobile tower coverage areas using Voronoi tessellation Okabe1992 . They applied BGMs to predict the poverty metrics (WI, PPI and Income) for each Voronoi polygon as a posterior distribution with completely modeled uncertainty around estimates. Then, they generated prediction maps with associated uncertainty using the posterior mean and standard deviation (see Figure 5). Their method using combined CDRs CRS data exhibits a better predictive power (highest ) for the observed data than RS-only method () and CDRs-only method (). Similarly, Njuguna and McSharry Njuguna2017

built a linear model to predict MPI based on the combination of CDRs, RS and LandScan datasets in Rwanda. They extracted four meaningful features that proxy socioeconomic status from the combine dataset, specifically, nighttime lights (NTLs) per capita from RS data, mobile ownership per capita from CDRs, average daily call volume per phone from CDRs, and population density from LandScan data. They proposed a simple linear regression model using the four features to predict MPI, as

(4)

where

stands for the value of the corresponding feature. This model can explain 76% of the variance in MPI across 295 sectors in Rwanda. These results suggest that combination of multiple data sources can yield socioeconomic estimates at a high spatial resolution.

Figure 5: Maps of predicted living standards based on call detail records (CDRs) and remote sensing (RS) data in Bangladesh. Mean wealth index (A) with uncertainty (D); mean likelihood being below $2.50/day (B) with uncertainty (E); and mean income (C) with uncertainty (F). The maps show the posterior mean and standard deviation from CDR-RS models for the WI and income data (A,C), and the RS model for the PPI (B). Red color indicates poorer areas in prediction maps, and higher error in uncertainty maps. Figure from Steele2017 .

Exhaust from digital and physical commodities can provide rich information about socioeconomic status, and thus proxy indicators can be built by leveraging these novel data sources. For example, United Nation Global Pulse launched the project entitled “Building Proxy Indicators of National Wellbeing with Postal Data” UNGlobalPulse2016 , which investigates the potential of using the international postal flow network to approximate indicators of countries’ socioeconomic profiles. The project collected 14 million electronic postal records of 187 countries from 2010 to 2014. The dataset covers 680,000 post offices and forms the world’s largest postal network. Results show that indicators gathered from the postal network correlate well with fourteen widely used socioeconomic indicators such as GDP and Human Development Index (HDI). This work demonstrates that structural features of world flow networks can be used to produce proxy indicators of socioeconomic status.

Meanwhile, Hristova et al. Hristova2016 examined how digital traces and the network structure can reveal the socioeconomic profiles of different countries. They measured the position of each country in six different global networks (trade, postal, migration, international flights, IP and digital communications) and built proxies for a number of socioeconomic indicators including GDP per capita and HDI ranking and other twelve indicators. In particular, they applied the multilayer network model Kivel2014 to characterize the strength of these international ties, where six networks representing six types of international ties are considered as six layers of the multiplex network with each pair of nodes possibly having one relationship in each layer. Formally, the multiplex network Battiston2014 is denoted as

(5)

where each layer contains a set of edges and a set of nodes , and is the total number of networks. The multiplex neighborhood of a node is defined as the union of its neighborhoods on each layer:

(6)

where is the neighbourhood of node in layer . The global multiplex degree of node is defined as , and the weighted global multiplex degree is defined as

(7)

where is the total number of nodes. The network metrics have predictability to several socioeconomic indicators. The global multiplex degree is the best-performing degree in terms of consistently high performance across all fourteen indicators. In particular, the global degree exhibits the most highly negative correlation with the HDI ranking (Spearman’s rank correlation ). These results show that a nation’s socioeconomic proxy indicators can be constructed based on different global networks after combining the data from multiple sources.

2.2 Economic complexity and fitness of nations

Understanding how economies develop to prosperity is a long-standing challenge in economics. In traditional literature, as an aggregated monetary indicator, GDP has been widely used to identify the stages of economic development of countries. Recently, a novel index named economic complexity has been proposed as the root in the gaps of economic development. In particular, the new steam of literature introduce a variety of non-monetary metrics based on international trade networks to quantitatively assess a country’s potential for future economic growth. In this section, we will briefly introduce recent works on economic complexity index, fitness index, and some variant indices, as well as their applications to predict world economic development.

2.2.1 Product space and economic complexity

Economic development has been traditionally measured by aggregated variables like GDP, however, such averages can not capture the increasing diversity that is associated with economic development. An insight raised recently is that the mix and diversity of products and industries are highly suggestive to economic growth. Hausmann et al. Hausmann2007 introduced the level of sophistication–the income level of a country’s exports–to the characterization of products and demonstrated that it can predict subsequent economic growth. Specifically, they first construct an index called PRODY, which represents the income level associated with a product. The PRODY index for product is given by

(8)

where is the total export of product by county , is the total export of country , and is the GDP per capita of country . Indeed, the PRODY index is a weighted average of the per capita GDPs of countries exporting a given product. Then, they construct the PRODY index, which represents the income level associated with a country’s export basket. The PRODY index for country is given by

(9)

Indeed, the PRODY index is a weighted average of the PRODY for the country, where the weights are the shares of the products in the total exports of the country. After analyzing the international trade data covering over 5,000 products and 124 countries, Hausmann et al. Hausmann2007 found that countries with high initial sophistication of export baskets (EXPY) tend to perform better in subsequent economic growth. These results suggest that countries have economically meaningful differences in the specialization patterns of exporting baskets, and countries export more sophistication products are likely to grow more rapidly.

Later, Hidalgo et al. Hidalgo2007

illuminated this viewpoint through analyzing the network of relatedness between products, named product space, which is built based on the international trade data. Products are considered to have high relatedness if they have a high probability to be co-exported by many countries in the international trade. Formally, the proximity between products

and is defined as

(10)

where is the conditional probability that country is a significant exporter of product given that it has been a significant exporter of product . The significant exporter of a product is identified by the revealed comparative advantage (RCA) Balassa1965 . The RCA value is defined as the share of product in the export basket of country to the share of product in the world trade. Specifically, the of country in product is defined by

(11)

where is the total export of product by country . If , country is a significant exporter of product . Larger proximity means higher relatedness between products and . Based on the proximity measure, the product space is generated and visualized (see Figure 6). It can be seen that the product space has a core-periphery structure with more-sophisticated products locating in the core and less-sophisticated products occupying the periphery (see Ref. Borgatti2000 ; Holme2005 for the definition of core-periphery structure in networks). Richer and poorer countries tend to export products that are located in the core and periphery, respectively. More significantly, countries move through the “product space” by developing products that are related to what they currently have. These results provide explanations to the fact that economic development a path-dependent process Neffke2011 and not all countries face the same opportunities in development.

Figure 6: The network representation of the product space built on the international trade data. Links are color coded with their proximity value. The sizes of the nodes are proportional to world trade, and their colors are chosen according to the product classification. Figure after Hidalgo2007 .

In particular, it is hard for poor countries to move toward new products with high sophistication since these countries tend to occupy the peripheries of the product space with current exports of less-sophisticated products. Using the concept of product space to explore the international trade data, Abdon and Felipe Abdon2011 studied the opportunity for economic growth and structural transformation of Sub-Saharan Africa (SSA) countries. They found that the majority of SSA countries are trapped in the export of products that are unsophisticated, standard and poorly connected in the product space. This makes the structural transformation of a region being particularly difficult, because the nearby products are in the periphery and the current capabilities are not enough to jump into more sophisticated products. To solve this problem, governments must implement policies and provide public inputs that can give incentives for the private sector to invest in the more sophisticated activities.

Further, Hidalgo and Hausmann Hidalgo2009 quantified the economic complexity of nations based on international trade data and demonstrated its central role in a country’s economic development. In particular, they proposed the Method of Reflections (MR) to characterize the structure of “country-product” bipartite network and showed that the variables produced by the MR method can be interpreted as indicators of economic complexity. Formally, the bipartite network can be represented by an adjacency matrix , where if country is a significant exporter () of product , and if otherwise. The economic complexity index (ECI) of country is then defined as

(12)

where is the number of countries, and are functions of mean and stand deviation that operate on the elements of vector , and

is the eigenvector associated with the 2nd largest eigenvalue of the matrix

(13)

Indeed, the matrix is defined through a set of linear iterative equations by connecting countries who have similar products, weighted by the inverse of the ubiquity of product () and normalized by the diversity of country (). Formally, putting the equation (the average ubiquity of product) into the equation (the average diversity of country) can generate the equation

(14)

where is the number of iteration. The economic complexity of country is given by , where is country ’s complexity in the previous iteration step. For more mathematical details, readers are encouraged to read the book on economic complexity wrote by Hausmann et al. Hausmann2014 . Empirical results showed that countries’ ECIs are highly correlated with their income levels are predictive of their future growth. Indeed, economic development is a process that requires acquiring more complex sets of capabilities to move towards new activities associated with higher levels of productivity. Therefore, efforts should focus on generating the conditions that allow complexity to emerge, so that sustained growth and prosperity in economic development will appear.

From a network perspective, uncovering the characteristics of the “country-product” bipartite network is very important for understanding economic development. Hausmann and Hidalgo Hausmann2011 proposed an analytic framework to account for the nature of the bipartite network structure. They found that countries differ in their product diversification and in the ubiquity of their exported products. Countries with more capabilities are able to produce less ubiquitous products. This logic explains the negative relationship between the diversification of countries and the average ubiquity of the products that they produce. Later, Bustos et al. Bustos2012 studied the presence and absence of industries in international and domestic economies. They found that “country-product” bipartite networks are significantly nested Patterson1986 , and the dynamics of nestedness can predict the evolution of industrial ecosystems (see Refs. Bascompte2003 ; Lin2018 for details on nestedness in networks). Moreover, the nestedness tends to be constant over time, making the pattern of industrial appearances predictable. Felipe et al. Felipe2012 applied MR to rank 5107 products and 124 countries in the international trade. They found that countries’ export shares of products of different complexity vary with the level of their income per capita. Specifically, export shares of the most complex products increase with income, while the export share of the less complex products decrease with income. Moreover, MR can distinguish products that require more complex or simpler capabilities, and the complexity rankings of countries exhibit a high correlation with their technological capabilities.

2.2.2 Fitness index and economic dynamics

The Fitness index employs a statistical approach to define a new set of metrics to quantify the fitness of countries and the complexity of products through coupled nonlinear maps. Based on the analysis of the “country-product” bipartite networks of international trade, Caldarelli et al. Caldarelli2012

proposed a new method based on biased Markov chain process to rank countries in a more conceptually consistent way, where a two-parameter bias is used to account for the bipartite network structure. Formally, the Markov process is given by

(15)

where is the fitness of country , is the complexity of product , is the interaction step, and is the Markov transition matrix given by

(16)

where and are free parameters. In a vectorial formalism, country ’s fitness is

, where the ergodic stochastic matrix

is defined as . The complexity of product is , where the ergodic stochastic matrix is (see Ref. Caldarelli2012 for mathematical details). After analyzing these equations, Caldarelli et al. Caldarelli2012 revealed a strongly nonlinear entanglement between the diversification of a country and the ubiquity of its products in determining the competitiveness of countries and the complexity of products. In particular, having more-sophisticated products in the portfolio contributes more to the competitiveness of a country than having many less-sophisticated products.

Moving forward, Tacchella et al. Tacchella2012 developed a so-called Fitness-Complexity Method (FCM) using coupled nonlinear maps, whose fixed point can define new metrics for the fitness of countries and the complexity of products. In their iterative algorithm, fitness of countries and complexity of products interact in a nonlinear and self-consistent mathematical way. Specifically, the fitness of a country is proportional to the number of its products weighted by their complexity. In turn, the complexity of a product is inversely proportional to the number of countries exporting it weighted by the inverse of their fitness (similar methods have also been proposed for search engine Kleinberg1999 and online reputation systems Zhou2011 ). Formally, the coupling between the fitness of country and the complexity of product is given by the nonlinear iterative scheme:

(17)

where and are respectively normalized in each step by and , given the initial condition and . The nonlinear iteration goes until the stationary state is reached (see Ref. Pugliese2016 for the convergence property), in which reflects the fitness of countries and reflects the complexity of products. Indeed, FCM is based on the idea that (i) a diversified country gives limited information on the complexity of products, and (ii) a poorly diversified country tends to have a specific product of a low level sophistication. Therefore, a nonlinear iteration is needed to bound the complexity of industries by the fitness of the less competitive provinces having them. After applied to the international trade data, FCM performs better than MR in capturing the bipartite network structure, in defining an effective non-monetary matric for economic complexity, and in quantifying a country’s potential for growth.

Meanwhile, Cristelli et al. Cristelli2013 argued that nonlinear dependence is the fundamental element and the nonlinear approach is consistent with the structure of the unweighted “country-product” bipartite network. Moreover, they analyzed the case of including weights in the matrix through , where is the total export of product by country . After comparing MR and FCM in both economic and mathematical aspects, they found that FCM is more conceptually consistent and well-grounded from an economic point of view. Taking into account the triangular structure of the bipartite network, Tacchella et al. Tacchella2013 discussed how to define suitable non-monetary metrics for both the complexity of products and the diversification of countries. In particular, they argued the conceptual flaws of MR by using three toy models and demonstrated that FCM is able to grasp the level of competitiveness of a country by defining the simplest metrics that seem to be consistent with the triangular-like pattern.

This branch of studies has provided new perspectives to cast economic prediction into the conceptual scheme of forecasting the evolution of a dynamical system, for example, weather dynamics. Cristelli et al. Cristelli2015 compared the non-monetary metrics, in particular the fitness of countries, with their monetary figures, say GDPpc. They showed that FCM is able to quantify the hidden growth potential of countries. More interestingly, they demonstrated that the pattern of countries’ evolution in the Fitness-GDPpc plane is strongly heterogeneous with two regimes of very different predictability features (see Figure 7). Specifically, there is a strongly predictable area of economic development, named the laminar regime, while the predictability is low in the so-called chaotic regime. Two kinds of evolution patterns can be observed in the laminar regime, where emerging economies develop rapidly and developed economies enjoy stable growth. In the chaotic regime, the dynamics of countries are highly diverse and unstable, leading to the difficulty in predicting the economic development. In this case, tools like regressions are no more appropriate in developing a predictive scheme.

Figure 7:

The heterogeneous dynamics of countries in the Fitness-GDPpc plane. (a) A finer coarse graining of the dynamics highlights two regimes. One regime is the laminar region (right), where fitness is the driving force of the growth, and the evolution of countries in this region is highly predictable. The other regime is the chaotic region (left), where the issues are very close to the problems of predictability for dynamical systems and to develop a predictive scheme using tools like regressions is no more appropriate. (b) The continuous interpolation of the coarse grained dynamics. The predictability in the two regimes, laminar and chaotic, is better illustrated. Figure from

Cristelli2015 .

To address this issue, Cristelli et al. Cristelli2015 defined a selective predictability scheme to assess future evolution of countries by resembling the method of analogues Lorenz1969 , which was developed to predict the evolution of a dynamical system given the knowledge of the past but without the laws of motion. The framework provides insights to the regime-dependent economic predictability and opens new paths to economic forecasting. Recently, Tacchella et al. Tacchella2018 applied this scheme to predict the five-year GDP growth. In the Fitness-GDPpc plane, they repeatedly sampled analogues with a Gaussian kernel (centred on the present state of a country) and performed a bootstrap of previously observed evolution (weighted by the distance of the analogues starting points), resulting in the global distribution of possible outcomes. They further refined the forecast by taking into account the strong self-correlation of GDP growth. Specifically, the forecast based on the global distribution is combined with the forecast that assumes a past five-year growth by a certain weighted averaging. This scheme outperforms the International Monetary Fund (IMF) five-year GDPpc forecast IMF2016 by more than 25% in accuracy. Moreover, the method’s forecasting errors are predictable and not correlated with IMF errors, showing its complementarity to traditional approaches.

2.2.3 Variant indices and development analysis

Many recent studies have highlighted the importance of complexity and capabilities in economic development. The pioneering work by Hidalgo and Hausmann Hidalgo2009 introduced MR to extract the competitiveness of countries and the complexity of product from the “country-product” bipartite networks with the assumption that there are linear interactions between the two metrics. Tacchella et al. Tacchella2012 ; Tacchella2013 proposed FCM and emphasized the necessary of nonlinear coupling between the fitness of countries and the complexity of products. Mariani et al. Mariani2015 quantitatively compared the ability of MR and FCM in ranking countries and products by their importance in networks. Based on the international trade data of 132 countries and 723 products, they found that FCM outperforms MR in ranking both products and countries. In particular, FCM captures the nestedness of the bipartite network and ranks nodes better by their importance.

Mariani et al. Mariani2015 proposed a modified FCM (MFCM for short), in which the nonlinear coupling is governed by a tunable parameter. By adjusting the parameter, we can find a better tradeoff between the favor on countries with diversified exports and the penalization on products with a large number of exporting countries. Formally, MFCM is defined by the equations

(18)

where is the tunable parameter. When , MFCM degenerates to FCM. The correlation between the product complexity and the product ubiquity decreases with the increase of . When , the ranking of product complexity by MFCM is perfectly correlated with that by FCM, however the ranking is volatile (very sensitive to noise). For this reason, MFCM with larger can only be applied to high-quality data instead of noisy data. When input data is reliable, MFCM is able to produce better rankings of countries and products.

Wu et al. Wu2016 showed some rigorous mathematical properties of the fitness-complexity metric for nested networks. They introduced a simpler variant of FCM, named Minimal Extremal Metric (MEM), where the complexity of a product is equal to the fitness of the least-fit country that exports it. Formally, MEM defines the fitness of country and the complexity of product by

(19)

Obviously, in MEM, only the fitness of the least-fit country contributes to the product complexity . In the limit , MEM is a special case of MFCM. Results based on the analysis of the international trade data show that MEM can reproduce the nested structure of the “country-product” bipartite network but it is highly sensitive to noise in data.

Morrison et al. Morrison2017 provided both theoretical and numerical evidence for the intrinsic instability in the nonlinear map employed by FCM. Using the preferential attachment model (see Refs. Barabasi1999 ; Foschi2014 ) and two real-world datasets (trade and patent), they showed that FCM is unstable to even small perturbations in the network, while MR does not suffer from this problem. That is because the nonlinear iterative approach in FCM amplifies the effects of countries with low fitness on the complexity of a product and highlights economies producing exclusive niche products, which are produced by a very few countries but not necessarily the most sophisticated. Adding a product exported by only a single country may lead to a global reorganization of the fitness landscape. Therefore, FCM has a serious problem when applied to dynamic economical systems with new products entering markets.

With new methodologies, attentions have been paid to better understand economic development, innovation and industrialization. Based on the international trade data, Zaccaria et al. Zaccaria2014 built a hierarchically directed network by measuring the taxonomy of products through computing the excess frequency of co-occurrence of two products comparing to the random binomial case. Formally, the taxonomy between products and is defined by projecting the “country-product” matrix to a unipartite space as (similar to Zhou2007 )

(20)

The taxonomy network presents the temporal connections between products and suggests the most relevant products for the development of countries. Indeed, the structure of the taxonomy network is suggestive to the potential growth of countries. Later, Saracco et al. Saracco2015 proposed a dynamical network approach to model the process of country’s innovation and competition on the evolution of the export baskets. Their dynamical model can accurately reproduce the main features observed in the evolution of the “country-product” bipartite network. Moreover, their model suggests that countries can follow different paths in the “product space” Hidalgo2007 ; Zaccaria2014 to gradually diversify their export baskets.

Focusing on the time evolution of trade volume, average complexity and competitiveness, Zaccaria et al. Zaccaria2016 compared the exports of different sectors in Netherlands. They found that high-tech related sectors have high average complexity but low competitiveness, while sectors heavily relying on raw materials have a low complexity but high competitiveness such as Energy and Horticulture sectors. Indeed, not only products but also services are important in explaining economic stability and predicting future growth. Stojkoski et al. Stojkoski2016 found that services have in general higher economic complexity than products. The sophistication and diversification of service exports can provide an additional route for economic growth in both developing and developed countries. Countries that are not able to diversify service portfolio may face diminishing growth prospects.

Hartmann et al. Hartmann2017 found that countries exporting more complex products have lower levels of income inequality. In particular, economic complexity index (ECI) outperforms GDP in explaining income inequality. Based on the international trade data, they calculated the Product Complexity Index (PCI) using the method proposed by Hidalgo and Hausmann Hidalgo2009 . Further, they estimated the level of income inequality associated with products by introducing the Product Gini Index (PGI), which is a weighted average of the Gini coefficients of the countries that export a product (see Figure 8A for the PGIs of products in the product space). There is a strong and negative correlation between PCI and PGI, showing that sophisticated products tend to have low levels of inequality (see Figure 8B). Moreover, countries with high (low) level of ECI are more likely to specialize in high-PCI (low-PCI) products, suggesting that the productive structure of a country may condition its range of income inequality. Recently, Mealy et al. Mealy2019

interpreted economic complexity metrics by showing that ECI and PCI are equivalent to a spectral clustering algorithm, which divides a similarity network into two parts. Moreover, these measures are closely related to many dimensionality reduction methods such as correspondence analysis and diffusion maps. Their findings shed some new light on the empirical success of ECI and PCI in explaining specialization patterns of countries in economic growth.

Figure 8: The product space and income inequality. (A) In the product space, nodes are colored according to the Product Gini Index (PGI) as measured during 1995-2008. The sizes of nodes are proportional to the volume of the international trade during 2000-2008. The networks are based on a proximity matrix representing 775 SITC-4 product classes exported during 1963-2008. The link strength (proximity) is based on the conditional probability that the products are co-exported. (B) The relationship between the Product Complexity Index (PCI) and the Product Gini Index (PGI) in the 2000-2008. Figure from Hartmann2017 .

Pugliese et al. Pugliese2017 analyzed the role of complexity in economic development and found that economies with differentiated products face a lower barrier in the transition towards industrialization. They extended the concept of poverty trap to include the two factors of economic complexity and GPDpc (see also Ref. Cristelli2015 ). They defined an index of development and industrialization, named Complex Index of Relative Development (CIRD), by the equation:

(21)

where is the fitness of country at time , is the GDPpc of country at time , and is a tunable parameter. The use of the CIRD index allows to study development as a monodimensional process. In particular, is a threshold for countries to exit the poverty trap, and the increase of the input growth reaches its maximum at this critical point. The CIRD index facilitates our understanding of industrialization dynamics and is helpful for development analysis. Sbardella et al. Sbardella2017 analyzed the relationship between wage inequality and industrialization using fitness and GDPpc. They found that movement of wage inequality along with the industrialization follows a longitudinally persistent pattern. This finding is comparable to theories proposed by Kuznets Kuznets1955 , who hypothesized that countries with an average level of development suffer the highest levels of wage inequality.

Along with the literature, some online platforms have been developed and launched to help understand the evolution of countries’ productive structures and economic development. For example, Simoes and Hidalgo Simoes2011

launched a data visualization site, named Observatory of Economic Complexity (OEC) (

https://atlas.media.mit.edu). The OEC combines a number of international trade datasets and serves more than millions of interactive visualizations including imports and exports, origins and destinations, product space, economic complexity rankings based on MR, income inequality, and so on. Meanwhile, the GROWTHCOM Project launched a data platform (http://www.growthcom.eu), which provides visualization tools of the product network Zaccaria2014 and the countries’ trajectories in the fitness-GDPpc plane Cristelli2015 .

2.3 Spatial demography and culture evolution

High resolution and near real-time data from new sources like remote sensing (RS), mobile phone (MP) and social media (SM) are complementary to traditional costly data with a long-time delay in inferring population distributions and demographics. Moreover, these so-called socioeconomic big data, together with methods from interdisciplinary fields including statistical physics and computer sciences, have been used to predict international migration and quantify world culture evolution. In this subsection, we will briefly introduce some methods using new data sources to map world population, estimate international migration and study culture evolution.

2.3.1 World population distribution

Knowing the spatial distribution of population on earth is critical for many socioeconomic applications such as accurate environmental impact assessments, human health adaptive strategies and disease burden estimation Linard2012 . Developed countries have substantial resources to create accurate and contemporary population datasets with high spatial resolution Patel2017 , however, relevant data are often scarce, outdated and unreliable in low-income countries due to economic constrains. In addition, acquiring census data in a timely and accurate manner is very difficult due to the rapid change of population and some administrative challenges. As a results, our knowledge of population distribution in many areas of the world remains poor thus far. Fortunately, technologies developed during the past decades have opened new ways for us to estimate and map world population distribution in a more timely manner and with a relatively lower cost.

Some large-area gridded world population distribution datasets have been built based on multiple data resources. Tobler et al. Tobler1997 developed the first version of Gridded Population of the World (GPW) database by transforming population counts from census units to a grid. The Global Rural Urban Mapping Project (GRUMP) utilizes higher resolution inputs and renders outputs at a 30 arc-second resolution (approximately 1km). In addition to census data, spatial covariate datasets are also used to estimate populations. For example, the LandScan Global Population Project Dobson2000 produced the world-wide 1998 LandScan population database at a 30 arc-second resolution based on the land cover database derived from satellite imagery and urban area vector data Bhaduri2007 . Tatem et al. Tatem2007 produced the 100m gridded population map by combining land cover information and census data under the Malaria Atlas Project. The semi-automated population distribution mapping at unprecedented spatial resolution produces more accurate results at a spatial resolution of about 100m in East Africa.

Cheriyadat et al. Cheriyadat2007 generated human settlement maps based on high-resolution satellite imagery. Their algorithm employed gray level co-occurrence matrices Martinao2003 to generate texture and edge patterns from satellite imagery that are useful in urban land cover classification. Liao et al. Liao2010

presented a high-accuracy population mapping method that integrates genetic programming (GP)

Kishore2001

and genetic algorithms (GA)

Holland1992 with geographic information systems (GIS). Specifically, they applied GIS to identify relevant factors (e.g., land-cover types and transport infrastructure) and use GP and GA to transform census data to population grids. Deng et al. Deng2010 estimated small-area population by incorporating GIS, remote sensing (RS) and demographic data into a popular demographic model. They demonstrated that the derived spatial factors can significantly improve the accuracy of small-area population estimation.

Gaughan et al. Gaughan2013 constructed an accurate and high-resolution population distribution dataset for Southeast Asia. They modeled population distributions for 2010 and 2015 by combining satellite-derived settlement maps, land cover information, and ancillary datasets on infrastructure. Stevens et al. Stevens2015 presented a new semi-automated dasymetric modeling approach, where RS and geospatial data are combined to model the dasymetric weights and the random forest model is used to generate a gridded prediction of population density at about 100m resolution. Patel et al. Patel2015 presented a novel method to map multitemporal settlement and population from Landsat imagery using Google Earth Engine, which is an online environmental data monitoring platform that provides analysis capabilities on Landsat data by leveraging cloud computing services. They demonstrated that the integration of GEE-derived urban extents improves the quality of population mapping.

Spatial covariates derived from satellite imagery and land cover are typically static in nature and are not direct measures of people’s presence on earth Patel2017 . Thanks to the rapid adoption of Internet and mobile devices in developing countries, there is a great potential of using digital records to do population mapping. For example, call detail records (CDRs) can overcome many limitations of census-based data since MPs have a high penetration rate across the world. For urban areas, Pulselli et al. Pulselli2008 developed a technique to monitor population density in real time based on MP chatting, given that the intensity of activity in the area covered by an antenna is proportional to the number of MP users. Based on MP location data, Dan and He Dan2010

proposed a dynamic distribution model to estimate urban population density using an improved K-means clustering algorithm

Lloyd1982 . Kang et al. Kang2012 discussed several fundamental issues on using CDRs to estimate population distributions. After analyzing the CDRs of nearly two million MP subscribers, they found that the number of calls other than the total daily call volume serves as a good estimator of population distribution.

Figure 9: Comparison of predicted population density datasets with baseline data for mainland Portugal. (A) Population density derived from the national census. (B) Population estimated by the mobile phone method. (C) Population density estimated by the remotely sensing method. (D-F) Close-ups around the capital city Lisbon. Figure from Deville2014 .

Recently, using both RS and MP data, Deville et al. Deville2014 produced spatially and temporarily explicit estimations of population densities at national scales (see Figure 9). Based on over one billion CDRs from Portugal and France, they estimated the population density of an administrative unit using a two-step method that relies on the density of MP users, where is the Voronoi polygon Okabe1992 associated with tower . The nighttime density for unit is calculated by

(22)

where is the area of unit , and is the intersection area of unit and the Voronoi polygon . The density is compared with the census-derived population densities through

(23)

where and . By transforming Eq. (23) to , the two parameters and can be fitted by a linear regression on training data. Further, they combined the MP method with the RS method proposed by Stevens et al. Stevens2015 , who used the random forest model to generate gridded predictions of population density. Formally, the population density in pixel is estimated by

(24)

where is the weight assigned to pixel and is the total population. Combining MP and RS data can produce population datasets with a high spatial and temporal resolution.

Douglass et al. Douglass2015 created high-resolution maps of population distribution by combing telecommunications data, satellite imagery and census data in Milan, Italy. They fitted population and call data by applying an elementary model that is similar to Eq. (23). They found that the total out-call volume has the strongest correlation (about 0.68) with the grid-level population. Further, they employed a random forest regression to predict population using features of land cover measures, call activity measures and their combinations. They found that building land cover and calls made out at 10am are the top-two predictors that are sufficient to provide accurate predictions. Lulli et al. Lulli2016 proposed a function to capture similarities between individual call profiles (ICPs). The similarity of ICPs is captured by combining the Euclidean similarity and the Jaccard similarity. Then, they built a clustering algorithm to provide clusters of individuals based on the similarity between ICPs. Using an automatic classifier to label the clusters, their method can estimate the number of residents, commuters and visitors in a given region. At the urban scale, Khodabandelou et al. Khodabandelou2016 estimated population density by applying Eq. (22) and Eq. (23) based on the mobile network traffic metadata. Their method can estimate both static and dynamic populations across different cities.

Calling activities are powerful in mapping populations, however, it is usually not easy to obtain due to privacy concerns Kosinski2013 . For example, some highly sensitive traits and attributes can be inferred from digital records of human behavior Tsavli2015 . The increasingly available social media data presents alternative opportunities in estimating population distribution. Twitter has gained worldwide popularity, making the geotagged tweets show detailed depictions of human activity. Leetaru et al. Leetaru2013 explored over 1.5 billion tweets posted by over 70 million users. They found a high correlation (0.79) between geotagged tweets and the NASA City Lights imagery. The most accurate feature is the self-reported user location field, exhibiting a correlation 0.72 with the geotagged baseline. Their work demonstrates the potential of geotagged tweets in world population mapping.

Very recently, volunteered geographic information (VGI) collected from the Internet (e.g., check-in data Yang2017 ) has been used to estimate population at a fine scale. Yao et al. Yao2017 presented a framework to map population distribution at the building level by integrating national census data with two geospatial data sources. One is the points-of-interest (POIs) provided by Baidu Map Services, and the other is the real-time Tencent user densities (RTUD). They employed the random forest algorithm Breiman2001 to analyze the two geospatial datasets and downscale the street-level population distribution to the grid level. Then, they proposed an iterative gravity model that can efficiently estimates the population density in each building and study area. Their method achieves a high correlation to the official census data.

The WorldPop collection recently brings together publications describing detailed and open-access spatial demographic datasets built using transparent approaches Tatem2017 . For the Latin America and the Caribbean region, Sorichetta et al. Sorichetta2015 opened an archive of high-resolution gridded population datasets for 2010, 2015 and 2020 based on the most recent official population count data for 28 countries. Gaughan et al. Gaughan2016 opened mainland China population maps for 1990, 2000 and 2010 after analyzing temporally-explicit census data using an ensemble prediction model. Lloyd et al. Lloyd2017 described the datasets and production methodology for the 3 and 30 arc-second resolution global gridded population data. The basis of the archive contains four tiled raster datasets and other layers.

2.3.2 International migration

International migration is one of the major reasons of demographic, economic and political changes. Literature suggested some determinants of migration such as family and personal networks Boyd1989 and revealed the impact of immigrants on the host country’s economy Friedberg1995 . There are some bottlenecks in studying migration such as data availability, data quality, data collection rules, and inconsistencies in measurement. For example, a person may involve multiple migrations during a given year, but most systems considered the number of migrations instead of migrants, resulting in the overestimate of the amount of immigrants. Moreover, “migration” defined by different countries may differ substantially, which results in the inconsistencies among international data. In addition, which country is reporting the data will lead to significant different patterns of migration Beer2010 . Census and registered migration data are helpful for the estimation of international migrations. By combining census migration data and patient registration data, Raymer et al. Raymer2007 developed a log-linear model to estimate elderly migration flows in England and Wales. Their model extends the spatial interaction model (see Ref. Willekens1999 for details) by adding a third variable of interest, such as health status in migration data. Formally, the log-linear model with an offset is given by

(25)

where is the expected migration flow from origin to destination for level of the third variable, and are respectively related to the origin and destination’s characteristics, and is the auxiliary information on migration flows.

Cohen et al. Cohen2008 developed a generalized linear model (GLM) Mccullagh1989 to predict international migrants using only geographic and demographic variables. They found that the number of migrants per year depends on population of origin and its population density. De Beer et al. Beer2010 presented a methodology to estimate total immigration and emigration numbers for 19 European countries. Abel and Sander Abel2014 provided the spatial structure of international migration flows between 196 countries from 1990 to 2010 (see Figure 10). The bilateral migration flows are based on refugee statistics, population registers, and place-of-birth responses to census questions. They employed an iterative proportional fitting algorithm Deming1940 to estimate the global migration flows. They found that the percentage of 5-year flows has been relatively stable at about 0.6% of world population since 1995. Moreover, African migrants move predominantly within the African continent, Asian and Latin American migration flows are spatially focused, and long-distance flows usually go to higher income level countries with negligible return flows.

Figure 10: Circular plot of migration flows within and between world regions during 2005 to 2010. Tick marks show the number of migrants (inflows and outflows) in millions. Only flows containing at least 170,000 migrants are shown. Figure from Abel2014 .

The increasingly available geolocated digital records from intelligent devices and online platforms offer the opportunity to better quantify migration flows. Using a large sample of Yahoo! e-mail data, Zagheni and Weber Zagheni2012 estimated the age and gender-specific migration rates. The locations of users are estimated by the country where their most messages were sent. The self-reported age and gender of users are then linked to their locations. They found that the estimated age profiles of migrants are consistent with the official data, and the mobility of females grows at a faster pace. Using the similar Yahoo! data of over 100 million users, State et al. State2013 developed a statistical model to identify migrants and tourists. After generating a global mobility map, they found that the European Economic Area has high levels of pendularity, and pendular migrations are in closely located countries. State et al. State2014 investigated international migration of professional workers by analyzing millions of geotagged career histories on LinkedIn. They found that the percentage of professional migrants to the US decreases from 2000 to 2012, while Asia has been a major professional migration destination during the past twelve years. Kikas et al. Kikas2015 extracted international migration from the Skype login events, showing that international migration can be estimated based on some social network features such as the percentages of international calls. Barchiesi et al. Barchiesi2015 extracted the location of users from geotagged photographs on Flickr and inferred their trajectories. The estimated number of visitors to the UK correlates with the official estimates for 28 countries.

Twitter provides a rich source of geotagged data to estimate international migration. Based on about one billion tweets, Hawelka et al. Hawelka2014 estimated the volume of international travelers according to the country of residence. They revealed spatially cohesive regions after analyzing the community structure of the Twitter-based international mobility network. By analyzing geotagged tweets produced by about 500,000 users, Zagheni et al. Zagheni2014 evaluated recent trends of migrations in OECD countries. They applied a difference-in-differences approach Bertrand2004 to reduce selection bias when inferring trends in out-migration rates. Their method can predict turning points in migration trends. Fagiolo and Mastrorillo Fagiolo2013 analyzed the topological structure of a international migration network (IMN) and its evolution from 1960 to 2000, where nodes are countries and links are the stock of migrants. They found that link weights follow a power-law distribution with a stable exponent at about 1.3. Moreover, IMN is highly clustered, disassortative, with a modular structure and of small-world property. In addition, most topological features of IMN an be explained by GLM, suggesting that socioeconomic, geographical and political factors are important in shaping the structure of migration networks.

International migration issues are prominent on economy and policy agendas. Fagiolo and Mastrorillo Fagiolo2014 studied how international migrations affect bilateral trade. They found that IMN and trade are strongly correlated with each other, and high centrality in IMN can increase the bilateral trade of countries. These results also indicate that the number of international immigrants can boost bilateral trade. Lee et al. Lee2014 suggested the research themes to focus on the growth of migration flows driven by humanitarian crises and the connections between migration and inequality. The Global Migration Group GMG2016 has provided guidance to support the collection, tabulation, analysis, dissemination and use of international migration data to monitor the implementation of the Sustainable Development Goals.

2.3.3 Culture evolution

Culture is the essential character of human society, and it serves as a driving force for human development. Quantifying cultural evolution is a challenging task due to the lack of suitable data. Recently, the development of information technologies has made large-scale data available for culture evolution studies Bail2014 such as digitized books Michel2011 , baby namesBerger2012 , languages Ronen2014 , recipes Zhu2013 , and biographies Yu2016b . Moreover, human languages, as an important part of culture, have also been studied using novel data resources (e.g., social media data Eisenstein2014 ) besides evolution models Larson2010 . Here, we will introduce applications of new data sources and methods in quantifying culture evolution.

Part of the evolution of human society is recorded by books. By analyzing a corpus of 5 million Google digitized books, Michel et al. Michel2011

observed cultural trends with over two billion culturomic trajectories. Focusing on linguistic and cultural phenomena reflected in the English language between 1800 and 2000, they provided insights about the size of English lexicon, collective memory, and evolution of grammar. In particular, the polarization of the states before the Civil War was revealed by the trajectories of using “the North”, “the South”, and finally “the enemy”. Zeng and Greenfield

Zeng2015 analyzed massive culture-wide content using the Google Ngram Viewer. They found that cultural values shift along with specific ecological changes (urbanisation, wealth and formal education) in Chinese society. In particular, the frequencies of words related to adaptive individualistic values (indexed by words such as “choose”, “compete” and “get”) increases from 1970 to 2008. Bail Bail2014 summarized text extraction methods to classify different types of culture and map cultural environments from text-based data. These new tools were further combined with conventional qualitative methods to track cultural element evolution. Figuring out sales trajectories of books will help understand the cultural evolution. Yucesoy et al. Yucesoy2018 revealed a universal sales pattern of bestsellers, and further proposed a model that can explain the time evolution of book sales.

Human behavior massively reflects cultural information. Schich et al. Schich2014 reconstructed the aggregated mobility of over 150,000 intellectual individuals (see Figure 11) and then measured cultural interactions on a historical time scale. They developed quantitative methods to identify statistical regularities of individuals based on spatiotemporal birth and death information of notable individuals collected from Freebase.com (FB) and other sources (see Ref. Yu2016b for a similar dataset of globally famous biographies). They found that the distribution of distances between the birth and death locations of notable individuals remains unchanged over eight centuries. By employing network tools and complexity theory, they further identified the characteristic statistical patterns. In particular, Europe can be characterized by two different cultural regimes. One is winner-takes-all regime, where massive centralization is toward centers, and the other is fit-gets-richer regime, where many sub-centers compete in federal clusters. This work provides a macroscopic perspective of cultural history. Recently, Yang et al. Yang2016 explored cultural mapping based on user behavioral data collected from location-based social networks. From check-ins messages and check-ins at a city’s POIs, they extracted three key cultural features, namely, language usage, daily activity pattern, and intercity crowd mobility patterns. Then, they proposed a cultural clustering method to capture cultural features and generate cultural maps that match traditional survey-based ones.

Figure 11: Interactions between culturally relevant locations over two millennia. (A) Notable individuals with birth and death locations. (B) Demographic life table for the Freebase.com (FB) dataset indicating death age frequency. (C) Birth-death scatter plot for locations in FB. (D) Illustration of birth-death flows of antiquarians in the 18th century. (E) Migration in Europe based on FB. Figure from Schich2014 .

Names can be used to study the underlying mechanism for cultural evolution. Hahn and Bentley Hahn2003 analyzed 1000 most commonly used baby names in the US in each decade of the twentieth century. They found that the frequency distribution of baby names obeys a power law for over 100 years, and the distribution can be explained by a simple process where names are randomly copied. Bentley et al. Bentley2007 explained the steady turnover of modern baby names using a random-copying model. Female names in each decade have a higher turnover rate than male names, implying more innovation in naming girls. The random-copying model can characterize collective copying behavior in culture evolution. Berger et al. Berger2012 analyzed names given to babies born from 1882 to 2006 in the US. They found that the popularity of names is affected nonlinearly by the similar names that became popular recently. Xi et al. Xi2014 found the sustained decline of inequality level among baby names with time. The reason behind this observation may be that people have more chances to know others’ names, and new names need to be more distinctive and novel. Further, they proposed a stochastic model in which social influence and individual preference determine individual choice of names. Recently, Barucca et al. Barucca2015 analyzed the correlations of newborns’ names in different states of the US from 1910 to 2012. They found a clear division of states into two homogeneous groups, where either group has similarity in their distributions of names. However, a transformation occurred at the end of the 20th century, where new clusters emerged in naming babies. Kim and Park Kim2005 investigated the distribution of family names in Korea, finding that the growth rates of smaller family names are higher. Lee et al. Lee2016 analyzed statistics of given names in Korea, Quebec, and the US. They found that the average popularities of given names show similar patterns of rise and fall at about one generation.

Language evolution is an important aspect of culture evolution. From a modeling perspective, Nowak et al. Nowak2002 showed that some certain evolutionary dynamics can describe both the cultural evolution of language and the biological evolution of universal grammar. Abrams and Strogatz Abrams2003 developed a simple model of language competition that explains historical data on the decline of some endangered languages. They derived a linguistic parameter that quantifies the threat of language extinction for the model. From an empirical perspective, Lieberman et al. Lieberman2007 quantified the evolving dynamics of language by analyzing the regularization of English verbs over the past 1,200 years. They explored how the rate of regularization depends on the frequency of word usage and found that the half-life of irregular verbs is proportional to the square root of their frequency. Based on the dataset of 107 million tweets, Eisenstein et al. Eisenstein2014 investigated the fundamental changes in the nature of written language. After employing a latent vector auto-regressive model to identify high-level patterns in the diffusion of linguistic change over the US, they found that language evolution in computer-mediated communication reproduces existing fault lines in spoken American English. Recently, Newberry et al. Newberry2017 quantified the strength of selection relative to stochastic drift in language evolution. After inferring selection towards the irregular forms of some past-tense verbs, they found that stochastic drift is stronger for rare words, suggesting that stochasticity plays an under-appreciated role in language evolution.

Vocabulary growth in natural languages follows scaling laws. For example, the character frequency distribution follows Zipf’s law Zipf1949 in the relation , where denotes the rank of a word by its frequency , and is the Zipf’s exponent. The number of distinct characters follows Heaps’ law Heaps1978 as , where denotes the number of distinct words when the text length is , and is the Heaps’ exponent. Zipf’s law and Heaps’ law have been widely observed in Indo-European language family and keywords in journals Zhang2008 . Indeed, these two laws are mathematically related, say Zipf’s law leads to Heaps’ law Lu2010 . After analyzing over 15 million words in books, Petersen et al. Petersen2012 found that only the more common words obey the classic Zipf’s law, and the annual growth fluctuation of word usages decreases with the corpus size. Based on the Google Ngram database of books, Gerlach and Altmann Gerlach2013 proposed a stochastic model for vocabulary growth that can generalize Zipf’s and Heaps’ law to two-scaling regimes. They found that the main historical change is the composition of specific words, where the list of core words is finite and decays exponentially in time with about 30 words per year for English. Pechenick et al. Pechenick2017 analyzed the English fiction corpus and found that the Zipf’s distribution has changed little from 1820 to 2000.

Some languages like Chinese, Japanese and Korean do not obey Zipf’s law or Heaps’ law. For these languages with very limited dictionary sizes (the number of characters is much smaller than the number of words), Lü et al. Lu2013b found that follows a power law with , and grows with the text length in three stages: grows linearly at the beginning, then turns to a logarithmical form, and saturates in the end. After analyzing four Chinese texts, Deng et al. Deng2014 found that Zipf’s law perfectly holds for sufficiently short texts of Chinese characters, However, rank-frequency relations display a two-layer structure for long texts, with a Zipfian power-law regime for high-frequent characters in the first layer and an exponential-like regime for less-frequent characters in the second layer. Yan and Minnhagen Yan2015 proposed a neutral model to predict character frequency distributions in Chinese characters, where the maximum entropy prediction is used to describe a text written in Chinese. They demonstrated that the same Chinese texts written in words and Chinese characters are both well predicted by their three characteristic values (the total number of words, the number of distinct words, and the number of repetitions of the most common word). Yan et al. Yan2013 further built a node-weighted network of Chinese characters, in which the weights of nodes are the frequencies of character usages, and the directed links correspond to the relations of direct components of characters. They developed a distributed node weight (DNW) strategy for learning Chinese characters and analyzed learning strategies using the dynamical processes. Results showed that the DNW strategy can significantly improve the efficiency of learning major Chinese textbooks.

Language can be used to reveal linguistic and cultural borders. Bryden et al. Bryden2013 explored the interlink between language and social network structure based on Twitter data. They found that the hierarchy of communities on social networks can be characterized by their most significantly used words. The community of a user can be predicted by the used words in tweets. Based on co-editing activities of Wikipedia, Samoilenko et al. Samoilenko2016 studied the linguistic neighbourhoods between language communities. They found that similar interests of Wikipedia editors between cultural communities can be explained by bilingualism, linguistic similarity of languages, and shared religion. Further, they proposed a method that can extract cultural borders from the co-editing activities. Mocanu et al. Mocanu2013 studied worldwide linguistic indicators and trends (see Figure 12) by analyzing a large-scale dataset of geotagged tweets. They found that Twitter penetration is highly heterogeneous and it is strongly correlated with GDP. Moreover, tweets can be used to study linguistic homogeneity at the country level, map language distributions in regions, and identify linguistically specific communities in urban areas.

Figure 12: Geographic distribution of languages based on Twitter data. (A) Raw Twitter signal. Each color represents a language. Densely populated areas are easily identified, showing that languages are separated among European countries. (B) Dominant language usage. The color indicates the fraction of users adopting the official language in tweets. Figure from Mocanu2013 .

Language can also reflect the ability of cultural influence. Ronen et al. Ronen2014 proposed a quantitative measure for a language’s global influence based on the structure of three global language networks (GLNs). The GLNs are constructed by identifying significant links between languages with respect to the population of speakers expressed in three datasets. Formally, the correlation between languages and is given by

(26)

where is the number of multilingual users (or book translations) between languages and , , and is the total number of users (or book translations). The statistical significance of the correlation is given by the statistic,

(27)

where

is the degree of freedom and

. Empirical results show that the position of a language in the GLNs contributes to the visibility of its speakers and the global popularity of the cultural content they produce. Gonçalves et al. Gonccalves2018 explored a large corpus of geotagged tweets and the Google Books datasets corresponding to books published in the US and the UK. After studying how the world-wide varieties of written English are evolving, they found that the past two centuries have clearly resulted in a clear shift in vocabulary and spelling conventions from British to American. The result suggests the capacity to culturally influence the rest of the world gradually shifts from the UK to the US.

Food is an integral part of cultures. Counihan and van Esterik Counihan2013 analyzed food-related activities and presented a crosscultural study of personal identities and social groups. They introduced empirical and theoretical tools to understand food systems at multiple levels. Data of cuisines have been used to study food culture. Ahn et al. Ahn2011 explored cultural diversity by analyzing the variety of regional cuisines. After introducing a flavor network to capture the ingredient combinations in recipes, they found that Western cuisines tend to use compound sharing ingredients, supporting the food pairing hypothesis that ingredients having similar flavor compounds may taste well together Blumenthal2008 . By contrast, East Asian cuisines show a tendency to avoid food pairing. Zhu et al. Zhu2013 explored the similarity of regional cuisines in China based on online recipes. They found that geographical proximity plays a more crucial role than climate proximity in determining regional cuisine similarity. Further, they proposed an evolution model of Chinese cuisines that achieves the similar tendency as the real dataset. Their work extends our understanding of the evolution of Chinese regional cuisines and cultures.

Food preference can reflect cultural diversity and cross-cultural relations. Based on a server log data from a large recipe platform, Wagner et al. Wagner2014 explored the evolution of food preferences. They found that ingredients partly drive recipe preferences, and ingredient preference distributions have less regional differences than recipe preference distributions. Moreover, weekday preferences differ from weekend preferences. Abbar et al. Abbar2015 studied US-wide dietary choices by analyzing dining experiences tweeted by 0.21 million users. They found that the caloric values of tweeted foods have a high correlation (0.77) with the state-wide obesity rates. Moreover, users in higher-educated areas tweeted about food with less caloric. Based on twitted food names and demographic variables, they built a model that can well predict county-wide obesity and diabetes statistics. Laufer et al. Laufer2015 explored the cross-cultural relations based on 31 European food cultures recorded by Wikipedia. They mined cultural relations through the collective description and popularity of culinary practices within and across different Wikipedia language communities. They found that shared internal states (e.g., beliefs and values) are positively correlated with shared culinary practices, and neighbouring countries tend to have similar cultural practices.

3 Regional socioeconomic status and urban perception

3.1 Economic activity and socioeconomic status

High-resolution data and improved methods allow us to reveal economic activity and socioeconomic status in subnational, regional and urban scales. In this section, we will introduce recent works on predicting regional economic activity from nighttime lights (NTLs), mapping slums from very high resolution (VHR) imagery, inferring regional socioeconomic status from mobile phone (MP) data, and quantifying regional economic development based on social media (SM) data.

3.1.1 Nighttime lights reflect economic activity

Nighttime lights (NTLs) data have been widely used to infer socioeconomic status and predict income per capita at the regional and urban scales. Sutton et al. Sutton2007 proposed a regression model to estimate subnational GDP of India, China, Turkey, and the US. They estimated urban population using a log-log relationship between population and the areal extent of lit areas in NTLs imagery derived from the Defense Meteorological Satellite Program-Operational Line Scan System (DMSP-OLS) Elvidge2001 . They predicted GDP for the subnational administrative units by adding the estimated urban population into the regression model, with an assumption that urban population is the most critical factor for economic activity. They found that spatial disaggregation of estimates dramatically improved the aggregated national estimates of GDP based on NTLs.

Regional and urban socioeconomic status can be inferred from the changes in electric power consumption patterns reflected by NTLs that are derived from the DMSP-OLS data. Chand et al. Chand2009 studied the socioeconomic development of states and cities in India by looking at the spatial and temporal changes in electric power consumption. They found that the number of NTLs overall increases up to 26% in all states from 1993 to 2002, but there is a decline in some states. The increase in population correlates with both the increase in NTLs () and the electric power consumption (). For the Republic of Kazakhstan, Propastin and Kappas Propastin2012 leveraged NTLs to monitor socioeconomic indicators (e.g., population, electricity consumption and GDP) at different spatial resolutions. Linear regression models were used to estimate population and electricity consumption at the settlement level. They revealed a strong correlation between NTLs and GDP. In particular, the regression model can explain 76% of the spatial variability in GDP among 17 provinces and 94% of the inter-annual variation in total GDP of Kazakhstan during 1994-1999.

Luminosity from NTLs satellite imagery has been used as a proxy for economic statistics. Chen and Nordhaus Chen2011 studied how much luminosity can contribute to the construction of GDP measures. They proposed an analytic method to quantify the relationship between luminosity from the DMSP-OLS data and GDP of North America. They found that luminosity is likely to add values as a proxy to estimate economic output for countries and regions. Due to a high measurement error of luminosity, however, the added values are limited for countries and regions with poor quality data. For more than 200 cities in China, Ma et al. Ma2012 comparatively used three regression models to study the responses of stable NTLs from the DMSP-OLS data to changes in urbanization variables (e.g., population, GDP, electric power consumption, and built-up area). They found that NTLs can help estimate urbanization dynamics.

Mellander, et al. Mellander2015 studied the relationship between NTLs and economic activity at a fine level. They used a geo-coded socioeconomic dataset consisting of spatially matched population and establishment counts in Sweden. After matching the dataset with light emissions, they used correlation analysis and geographically weighted regressions (GWR) to examine the relationship. The GWR model is given by

(28)

where denotes location, is the dependent variable of NTLs, is a population-based or industry-based socioeconomic variable, and is the error term. They found that NTLs is a good proxy for population density and a weaker proxy for economic activity at the micro-level. The link between NTLs and economic activity is slightly overestimated for large urban areas but underestimated for rural areas. Moreover, economic activity has a stronger correlation with radiance light than saturated light.

NTLs satellite imagery has been used to study the economic differences and GDP spatialization at the regional levels in China. For example, Zhao et al. Zhao2011 examined the relations between China’s environmental change and economic development from 1996 to 2000. A proxy evaluator of ecosystems is net primary production (NPP), which is the amount of carbon and energy that enters ecosystems. A proxy evaluator of economic development is GDP, which is estimated based on NTLs. They found that changing patterns of NPP and GDP vary by regions, and the relations are greatly affected by spatial locations. Later, Zhao et al. Zhao2017 studied economic differences among diverse geomorphological types in South China based on NTLs from the Visible Infrared Imaging Radiometer Suite (VIIRS) Baugh2013 . They found that the total NTLs exhibit a high correlation with both prefecture GDP () and county GDP (), suggesting the capability of NTLs in estimating economic development. Further, they proposed a GDP spatialization model and produced a pixel-level GDP map for South China (see Figure 13). Meanwhile, Dai et al. Dai2017 explored the suitability of using NTLs and regression methods to estimate GDP at the provincial and city levels in China. Empirical analysis suggested that the VIIRS data outperforms the DMSP-OLS data, and the polynomial model outperforms the linear regression model.

Figure 13: The pixel-level (500m 500m) GDP map for South China in 2014. The GDP map was produced by using the corrected NPP-VIIRS data and the regression model. Figure from Zhao2017 .

3.1.2 Very high resolution imagery maps poverty

Indicators derived from both nighttime lights (NTLs) and very high resolution (VHR) imagery have been used to map poverty at fine scales. By analyzing stable NTLs, Weng et al. Wang2012 estimated poverty for 31 provinces in China. Specifically, they used the principal component analysis (PCA) to develop a so-called Integrated Poverty Index (IPI) based on 17 socioeconomic indexes and computed the average light index (ALI) for each region based on the total number of lit pixels. They found a high correlation (

) between IPI and ALI for the 31 provinces, suggesting the capacity of NTLs imagery to analyze poverty at the regional level. Using a supervised learning approach, Engstrom et al.

Engstrom2017 linked features derived from VHR satellite images to survey-based poverty rates at the local level in Sri Lanka. They found that satellite-based features are highly predictive to poverty. Moreover, measures of built-up area and building density are strongly correlated with welfare in both urban and rural areas.

Slums are common in low and middle income countries with poor quality of basic services (e.g., water supply, electricity and sanitation). Detecting and monitoring slum areas is valuable for implementing policies to improve living conditions. However, mapping the spatial distribution of urban slums is a challenging problem due to the reasons that there is no universal model of slums and slum conditions can take various forms. The meeting in 2008 Sliuzas2008 brought together the methodological expertise on slum monitoring and reviewed methods for slum identification based on VHR imagery. Using object-oriented classification of VHR images, Rhinane et al. Rhinane2011 developed a novel approach to detect slums in Casablanca, which integrates spectral, spatial and contextual information to map the urban land, achieving a high accuracy 0.85 in extracting slum areas.

VHR images have been increasingly used to inventory the location and physical composition of slums. Shekhar Shekhar2012 applied the object oriented analysis Cheng2003 to VHR images to detect slums in Pune, India. First, they generated segments by automatically dividing images into coherent objects. Then, they used the feature extraction method to identify the characteristic features for object-classes. Finally, they used contextual information to separate slum objects from non-slum built-up objects. Their approach exhibits the overall accuracy 87% in the classification of slums. Kohli et al. Kohli2012 developed an ontological framework to conceptualize slums based on input from 50 domain-experts covering 16 different countries. They identified the morphology of built environment at the environs, settlement and object levels. By including all potentially relevant indicators, their ontological framework provides a comprehensive basis for image-based classification of slums.

Recent literature have applied advanced image processing techniques to map slums from VHR images with minimal operator intervention. Kit et al. Kit2012

developed the concept of lacunarity to identify slums in Hyderabad, India. First, they produced high resolution binary image using two binarization methods, the principal component analysis (PCA)-based method and the line detection-based method

Martinez2005 . Then, they calculated lacunarity based on the binary image following Malhi and Román-Cuesta Malhi2008 . Formally, the lacunarity of a subset of the original binary image is given by

(29)

where is the variance and is the arithmetic mean of the number of filled pixels within all -sized unique square subsets of the larger subset (see Ref. Kit2012 for details). The line detection algorithm performs better than the PCA-based method in providing suitable binary datasets for lacunarity analysis. The best method can reach an accuracy 0.8333 in slum identification when . Figure 14 shows the slum map of Hyderabad generated by the lacunarity-based slum detection algorithm.

Figure 14: Slum map in Hyderabad, India. The slum locations (red areas) are identified by the lacunarity-based slum detection algorithm. Two different subsets of the original satellite image together with georeferenced photographs are shown as ground truth: the Rasolpoora slum in the northern and the Nagamiah Kunta slum in the eastern. Figure from Kit2012 .

Kit et al. Kit2013 soon improved the lacunarity-based slum detection algorithm by combing two advanced image analysis methods (the Canny edge detection Canny1986 and the line-segment-detection (LSD) straight line detection Gioi2012 ) to reduce errors in slum identification. Their method identifies the plausibly and spatially explicit slum locations, which can be verified by a series of ground truthing visits. In particular, such method can capture the changing patters of slum areas from 2003 to 2010 in Hyderabad, India. Gruebner et al. Gruebner2014 mapped urban slums in Dhaka, Bangladesh, from the visual interpretation of Quickbird data from 2006 to 2010. To avoid small and isolated slums, they filtered the 2006 slums in GIS and defined the changes of 2010 slums over the 2006’s polygons to retain border consistency. Accordingly, they produced a slum distribution dataset for the Dhaka metropolitan area.

Engstrom et al. Engstrom2015 mapped slum areas in Accra, Ghana, by utilizing features extracted from the VHR Quickbird images acquired in 2002. They demonstrated that the satellite image-derived slum areas exhibits an overall accuracy of 94.3% when comparing to the field-based slum map from the UN Habitat/Accra Metropolitan Assembly (UNAMA). However, the accuracy drops when comparing to two census derived slum maps. Moreover, they found a moderate correlation () between satellite image-derived classification of slums and the census derived slum index, and the correlation increases () after taking into account population density. Kohli et al. Kohli2016 studied the spatial uncertainties related to slum delineations, which are observed from VHR images in Ahmedabad (India), Nairobi (Kenya) and Cape Town (South Africa). They found that the slum identification and delineation for the three contexts are significantly different, suggesting the existential and extensional uncertainty of slums.

VHR imagery allows the monitoring of slums and the analysis of deprived areas. Kuffer et al. Kuffer2016b utilized the gray-level co-occurrence matrix (GLCM) variance to distinguish slums areas in VHR imagery. They showed that the GLCM variance combined with the normalized difference vegetation index (NDVI) can separate slum areas with an overall accuracy 87%, 88% and 84% for Mumbai (India), Ahmedabad (India) and Kigali (Rwanda), respectively. The overall accuracy can be increased to 90% by adding spectral information to the GLCM within a random forest classifier Breiman2001 . Wurm et al. Wurm2017 explored the capabilities of X-band Synthetic Aperture Radar (SAR) data to estimate the extent of poverty in slum areas using the Kennaugh element framework in image preprocessing Schmitt2015 . Employing a random forest classifier, they tested different spatial image features at various window sizes to map slums. Results show that GLCM performs very well on slum mapping as it addresses a large spatial neighborhood of the pixels.

Recently, Kuffer et al. Kuffer2016 provided a literature review of slum mapping regarding four dimensions: contextual factors, physical slum characteristics, data and requirements, and slum extraction methods. They argued that the diversity and dynamics of slums have not been well captured due to the complex and diverse morphology of slums. Thereby, a more systematic exploration of physical slum characteristics is required (see Ref. Kuffer2016 for details). They demonstrated that texture-based methods show good robustness, while machine-learning algorithms exhibit the highest reported accuracy. Mahabir et al. Mahabir2018 suggested to develop a more comprehensive framework by considering emerging sources of geospatial data (e.g., social media) and combining multiple emerging approaches in technology (e.g., geosensor networks).

3.1.3 Mobile phones track socioeconomic levels

Scientists have explored the relations between social structure and economic development. Woolcock Woolcock1998 provided a brief intellectual history of social capital and economic development. Adler and Kwon Adler2002 synthesized studies on social capital undertaken in various disciplines and developed a common conceptual framework. Granovetter Granovetter2005 suggested several underlying mechanisms on how social structure affects economic outcomes, for example, social networks influence the flow and the quality of information, and social networks are an important source of reward and punishment. Recently, empirical works have demonstrated that social network analysis of large-scale mobile phone (MP) data can be applied to monitor socioeconomic development.

Based on MP data and socioeconomic metric from national census, Eagle et al. Eagle2010 investigated the relation between the structure of communication network and economic development at the population level in the UK (see Figure 15). The socioeconomic metric is the 2004 UK government’s index of multiple deprivation (IMD), which is a composite measure of relative prosperity of communities. They calculated the socioeconomic profile of a region by aggregating the population-weighted average of the IMD for each telephone exchange area. The communication network data covers over 90% of MP users during August 2005, based on which they calculated two diversity metrics of communication ties. The social diversity is defined as the Shannon entropy associated with individual ’s communication behavior normalized by its number of contacts . Formally,

(30)

where is the proportion of individual ’s call volume that involves individual . A regions’s social diversity is then calculated by averaging the social diversities of individuals in that region. The spatial diversity is defined by replacing call volume with geographic distance. Formally,

(31)

where is the number of exchange areas, and is the proportion of time that individual spent on communicating with area . They found that the IMD socioeconomic rank is strongly correlated with both the social diversity () and the spatial diversity (). The strong correlation () persists if using Burt’s measure of “structural holes” (see Ref. Burt1995 for details). Moreover, a composite diversity measure can exhibit even stronger correlation () with socioeconomic status (see Figure 15B). This work takes a significant step towards inferring regional socioeconomic status from MP data.

Figure 15: Regional communication diversity and socioeconomic ranking for the UK. (A) Communication networks of regions and regional socioeconomic rank based on IMD. Saturation and width of links correspond to the volume of communications. High rank to low rank of the IMD is represented by light blue to dark blue. (B) The relation between social network diversity and socioeconomic rank. The network diversity was a composite of Shannon entropy and Burt’s measure of structural holes. The fractional polynomial fit to the data is shown in red. Figure from Eagle2010 .

The ubiquitous adoption of MPs in emerging economies provides a new way to track socioeconomic status. Based on CDRs of 0.22 million MP users in an advanced economy and 0.19 million MP users in a developing economy, Rubio et al. Rubio2010 studied human mobility patterns in regions of different socioeconomic levels. They found that individuals in the developing economy have smaller average traveled distance, their social networks have smaller geographical sparsity, and these patterns have no significant changes from workweeks to weekends. Later, Frias-Martinez and Virseda Frias2012 explored the relations between behavioral features extracted from large-scale CDRs and socioeconomic indices from country-wide census data in a Latin American country. They found that socioeconomic levels are strongly correlated with expenses, reciprocity of communications, physical distance with contacts, mobility patterns, and some others. Moreover, a multivariate linear regression including MP usage variables can accurately predict census-based variables such as the socioeconomic level (). These results suggest MP-derived human mobility patterns can be used to predict socioeconomic indices at fine scales.

A body of literature have leveraged CDRs to estimate regional socioeconomic status. A widely used CDRs data contain 2.5 billion calls and SMS exchanges from anonymous customers in Côte d’Ivoire. Smith-Clarke et al. Smith2014 estimated socioeconomic levels of regions in Côte d’Ivoire. They derived some socioeconomic-related features from the communication flows, including activity, gravity residual, network advantage, and introversion. They found that regions with higher call volumes from other regions are more likely to have a higher socioeconomic level. Further, they proposed a simple linear model that estimates socioeconomic status for 255 sub-prefectures in Côte d’Ivoire. Šćepanović et al. Scepanovic2015

extracted different spatial-temporal mobility patterns from the same CDRs in Côte d’Ivoire and used them to predict socioeconomic indices. They showed that the spatial-variance of calling frequency can identify electricity lacking rural and regions, the spatial-variance of the probability density functions of the radius of gyration (see Ref.

Gonzalez2008 for the definition) can identify a region’s wealth, and the number of a region’s migration workers is negatively correlated () with the multidimensional poverty index (MPI).

Recently, MP data have been combined with other data sources to study socioeconomic stratification. Leo et al. Leo2016 analyzed a coupled datasets of MP communications and bank transactions for over one million people in Mexico. They constructed a social network based on call/SMS interactions and estimated economic indicators based on bank transactions. After calculating the cumulative distributions of individual average monthly purchase (AMP) and debt (AMD), they found that both wealth and debt are unevenly distributed among people. Further, they studied the social stratification by categorizing users into nine socioeconomic classes using the cumulative AMP function. They observed that people are more densely connected to others of their own class. To quantify this observation, they calculated the “rich-club” coefficient Zhou2004 ,

(32)

where and is the average density. Here, and are respectively the number of links and nodes remaining in the communication network after removing nodes with their AMP value smaller than a given threshold (see Ref. Leo2016 for details). They found that the rich-club coefficient grows rapidly with , suggesting an assortative socioeconomic correlation.

Moreover, Leo et al. Leo2016 studied the spatio-socioeconomic correlations by calculating the average geodesic distance between any pairs of socioeconomic classes,

(33)

where is the number of links between nodes in classes and , and is the geodesic distance between zip locations of individuals and . They found that the distance is always minimal between individuals of the same class, suggesting that individuals from the same socioeconomic class live relatively the closest. In addition, there is a positive correlation between individuals’ socioeconomic levels and their typical commuting distances. After further exploring the same coupled dataset, Leo et al. Leo2016b found a strong correlation between identified socioeconomic classes and typical consumption patterns.

3.1.4 Social media reveals socioeconomic status

Social media (SM) data have many appealing advantages including low acquisition cost, wide geographical coverage and real-time update, which enable the feasibility to estimate socioeconomic status at regional and urban scales. For example, Twitter provides a huge number of tweets with user locations being directly tagged or can be mined out from content information. Cheng et al. Cheng2010 proposed a probabilistic framework to automatically identify words related to locations in tweets and then infer a Twitter user’s location at the city level from the content. They showed that about one hundred tweets are enough for their method to infer a user’s location. This method can place on average 51% users within 100 miles of their actual locations.

The contents of SM posts have been used to track socioeconomic well-beings. Quercia et al. Quercia2012 studied the relations between sentiment expressed in tweets and census-based socioeconomic well-being of communities in London. Specifically, they calculated the word count sentiment score Kramer2010 by counting the number of positive and negative words. Formally,

(34)

where () is the fraction of positive (negative) words for user , () is the mean of () across all users, and () is the corresponding standard deviation. Then, the gross community happiness (GCH) of a community is calculated by averaging the sentiment scores of users in that community. The GCH is highly correlated with the community’s socioeconomic well-being, suggesting the effectiveness of using tweets to track community well-being.

Mahmud et al. Mahmud2012

inferred home locations of Twitter users at different granularities using an algorithm that ensembles statistical and heuristic classifiers

Jimenez1998 . The algorithm achieves a higher performance in predicting Twitter users’ locations compared with the state-of-the-art algorithms. Hasan et al. Hasan2013 analyzed human activity patterns based on tweets with location information. By finding the distributions of different activity categories over a city geography, they characterized aggregate activity patterns and determined the purpose-specific activity distribution maps. Moreover, the timing distribution of visiting different places depends on activity category. Hasan and Ukkusuri Hasan2014 further proposed a data-driven modeling approach based on topic models Blei2003 to infer urban activity pattern from geotagged tweets. Results demonstrated that their model can extract user-specific activity patterns and predict missing activities.

Using Twitter data generated during weekdays in Inner London, Lansley and Longley Lansley2016

applied an unsupervised learning algorithm to classify geo-tagged tweets into 20 distinctive and interpretive topic groupings. They found that users’ socioeconomic characteristics can be inferred from their behaviours on Twitter. In particular, users whose neighbourhoods are of higher socioeconomic levels tend to tweet optimistically and discuss business, networking and leisure. Huang and Wong

Huang2016 explored to what extent Twitter data can be used to support the activity pattern analysis of users with different socioeconomic status. Activity patterns of Twitter users in Washington, D.C. were analyzed, and their socioeconomic levels were inferred by incorporating census data. Results showed that socioeconomic status remarkably affects users’ activity patterns. Moreover, the urban spatial structure is a key factor that affects the variation in activity patterns among users from different communities. In particular, the mid-income group other than the most affluent group may have the shortest travel. Moreover, affluent residents are more internationally oriented than mid-income and poor residents.

Figure 16: The spatial distributions of (A) the number of registered users in Weibo and (B) the values of GDP in the 282 prefecture-level cities of China in 2012. Figure from Liu2016 .

Liu et al. Liu2016 collected the registered location information of nearly 200 million Weibo users from 2009 to 2012 and explored the relationship between online activities and socioeconomic indices. Specifically, the online activity is estimated by the number of registered users (UN), and the socioeconomic indices are resident population (RP), GDP and GDP per captia. Figure 16 presents the spatial distributions of registered Weibo users (left) and the values of GDP (right). After calculating two correlation coefficients (Pearson coefficient Stigler1989 and Spearman’s rank coefficient Myers2010 ), they showed that UN is strongly correlated with socioeconomic indices. For example, the strengths of correlation between UN and GDP are and . These results demonstrate that socioeconomic status can be inferred from online social activity at the city-level. Of particular significance, they further proposed a method to detect a few abnormal cities, whose GDP is much higher than others with the same number of registered users. These GDP winners have less-diverse economic structure and highly dependent on some specific resources. In fact, these cities’ economics experienced a huge loss after 2013 due to the market price fluctuation of non-renewable energy resources and rare earths.

The structure of location-based social networks (LBSN) has been linked to socioeconomic development. Wang et al. Wang2019b estimated regional economic status based on the structures of information flow and talent mobility networks (see Figure 17). Specifically, the online information flow network is built on the following relations among about 433 million Weibo users (see also Ref. Liu2016 ), and the offline talent mobility network is built on the resumes of about 142 thousand anonymized Chinese job seekers with higher education (see Ref. Yang2018b for details). They calculated ten network structural features such as spatial and topological diversities and then linked them to regional economic indices. They found that structural features of both networks are relevant to economic status, while the talent mobility network exhibits a stronger predictive power for regional GDP. Further, they constructed a composite index of structural features, which can explain up to about 84% of the variance in regional GDP.

Figure 17: Networks of online information flow and offine talent mobility. Nodes represent provinces, where the size of node corresponds to the province’s GDP in natural logarithmic form in 2016, and the layout of node corresponds to the geographical location of the province’s capital city. (A) The online information flow network, where the link weight corresponds to the number of followings on Weibo. (B) The offline talent mobility network, where the link weight corresponds to the number of moved talents as recorded by their resumes. Figure from Wang2019b .

Based on data from the SM platform Gowalla Nguyen2012 with friendship information and geo-locations, Holzbauer et al. Holzbauer2016 studied the relations between regional economic status and quantitative measures of social ties in the US during 2009-2012. They found that cross-state long ties are strongly correlated with three economic measurements, namely, GDP (), the number of patents (), and the number of startups (), while short ties are much less predictive. This finding highlights the role of long ties in supporting regional innovation and economic development. Recently, Norbutas and Corten Norbutas2018 explored the relations between network structure and economic prosperity of 438 municipalities in Netherlands by analyzing data of over 10 million users on the Dutch online social network Hyves. They found that network diversity in terms of geographical distance Scellato2010 other than contacts’ topological diversity Eagle2010 exhibits a positive correlation with economic prosperity, while network density at the community level and network modularity Guimera2004 ; Newman2006 are negative predictors of economic status.

SM data have also been used to measure socioeconomic deprivation of regions (e.g., low level of economic status and lack of education) and quantify landscape values (e.g., the values shaped by the recreational and cultural services and benefits provided by landscapes). Venerandi et al. Venerandi2015 proposed a method to automatically mine deprivation from two datasets of urban elements in physical environment at a fine level in UK. The two datasets are respectively collected from Foursquare, a mobile social-networking application with check-ins, and OpenStreetMap, an openly global accessible map with geographical positions, names and categories. They defined the offering advantage to identify distinctive urban elements of each neighborhood (see Ref. Venerandi2015 for details) and built accurate classifiers of urban deprivation that can be verified by the census-based IMD. Later, van Zanten et al. Zanten2016 analyzed data from three online SM platforms (Panoramio, Flickr and Instagram). They found that data from these three platforms reveal similar patterns of landscape values. In particular, a significant portion of observed variation across different platforms can be explained by variables describing accessibility, population density, income, mountainous terrain, proximity to water, and so on.

3.2 Industrial structure and development path

Data from many new sources have been used to quantify economic structure and analyze industrial diversification, including large-scale social media data, labor market data, trade data, publicly listed firm data, and so on. In this subsection, we will briefly introduce the quantification of regional industrial structure, the role of relatedness on economic diversification, the collective learning effects, and the strategies for regional economic development.

3.2.1 Economic structure and relatedness

Economic development is not only a process of continuously improving the production of the same goods and the occupation of the same industries Lin2011 , but also one that requires structural transformation toward new economic activities associated with higher levels of productivity Hausmann2007 ; Hidalgo2007 . This implies that economic development and industrial diversification is a path-dependent process where structural transformation plays an important role. Revealing industrial structure of regions and quantifying relatedness of industries are critical for understanding development paths of regions and evolution patterns of regional economic diversification.

Data from LBSNs have been used to reveal regional economic structure. Based on location information of Weibo users, Liu et al. Liu2016 proposed an effective method to uncover the city-level macro economic structure in China. They employed the linear least square method to model the relations between user number (UN) and GDP for 282 prefecture-level cities. They found that cities below the fitting line are likely to be driven by the tertiary industries, while cities above the fitted line tend to focus on the secondary industry. They quantified the deviation of cities from the fitted line by calculating the measure , where is the value of fitted line for city , and is the corresponding GDP in the logarithmic form. Further, they used the measure to predict the macro-economic structure, in particular GDP, by employing the support vector regression (SVR) Cortes1995 ; Smola2004 . They found that the user number (UN) performs better in predicting GDP than some macroeconomic indices such as population and average GDP. This work shows the capacity of online social activity in revealing industrial structure and estimating economic status at the regional level.

Based on the China’s publicly listed firm data from 1990 to 2015, Gao et al. Gao2017 quantified the regional industrial structure of China by constructing a network of related industries, named industry space. First, they estimated the proximity between industries and

by calculating the cosine similarity. Formally, let

and be the number of firms in province operating respectively in industries and at year , the proximity is given by:

(35)

Then, based on the proximity , they built the industry space that highlights the relatedness between 70 industries at the sub-sectoral level. The China’s industry space exhibits both a core-periphery structure and a dumbbell structure with a big tightly knit core of manufacturing industries and some small tightly knit cores of service- and information-related activities. Figure 18 presents the evolution of industrial structure of four provinces in China (Beijing, Hebei, Shanghai and Zhejiang) from 1995 to 2015, with black circles showing the industries of presence in the industry space. Specifically, the presence of industry in province at year is identified by the revealed comparative advantage being over 1 (i.e., ) Balassa1965 . Beijing and Shanghai gradually occupied Internet and financial services, while Hebei and Zhejiang gradually occupied manufacturing industries. By analyzing the same data, Gao and Zhou Gao2018 found that provinces located along the coast tend to be industrial sophisticated with a high level of economic complexity. Moreover, the provinces’ ranks by their economic complexity are relatively stable during the considered period.

Figure 18: Evolution of China’s regional industrial structure from 1995 to 2015. The industry spaces of four provinces are illustrated, including Beijing, Hebei, Shanghai and Zhejiang. Black circles highlight industries that are present in the industry space of the corresponding province. Figure from Gao2017 .

National bureau of industrial enterprises can also be used to portray the production space as the representation of industrial structure. Based on data of four-digit manufacturing sectors from the China’s State Statistical Bureau covering the period of 1999-2007, Guo and He Guo2017 calculated the inter-sector relatedness and produced the production space consisting of 424 manufacturing sectors. The production space in 1999 has a core-periphery structure with a major core of electric apparatus, electronic and telecommunications equipment, and a small sub-core cluster consisted of food products, chemical and non-metallic mineral products. The small sub-core cluster developed into an important and dense core of the production space in 2007. They further found that China’s regions undergo substantial structural change from 1999 to 2007 with different magnitudes, and industrial evolution has a strong tendency of path dependencies, where regional development is rooted in and subject to the preexisting economic structure.

Economic relatedness contributes significantly to regional industrial diversification. By analyzing Italian trade data during 1995-2003, Boschma et al. Boschma2009 presented strong evidence in support of the fact that related variety contributes to regional economic growth. They grouped products into related variety sets based on the industrial classification and then calculated the related variety index of product by

(36)

where is the total exports of region , and is the entropy within the related variety set . Formally, the entropy of region is given by

(37)

They found that a region benefits from extra-regional knowledge originated from related sectors that are already present in that region. Later, Boschma and Frenken Boschma2012b demonstrated that technological relatedness affects the process of knowledge spillovers, which benefits regions with different but technologically related activities. As a result, new industries are likely to emerge from related industries. However, the process occurs primarily at the regional level as knowledge spillovers are geographically bounded. Using trade data of 50 Spanish provinces during 1995-2007, Boschma et al. Boschma2012 further investigated whether related variety affects regional growth. They calculated two measures of relatedness between industries: the related variety index Boschma2009 and the proximity index Hidalgo2007 . They found that Spanish provinces with a variety of related industries exhibit higher rates of economic growth.

By analyzing the US patent data during 1977-1999, Castaldi et al. Castaldi2015 showed that related technologies can enhance the innovation of a new technology and unrelated variety can enhances technological breakthroughs. Boschma et al. Boschma2015 investigated technological relatedness at the city level and technological change in 366 US cities by analyzing the US patent data during 1981-2010. They found that the level of relatedness with existing technologies increases the entry probability of a new technology in a city. Balland et al. Balland2015 discussed the co-evolutionary dynamics between proximity and knowledge ties. They found that proximities might gradually increase due to the past knowledge ties. In particular, the co-evolutionary dynamics can be captured by the processes of learning (cognitive proximity), decoupling (social proximity), agglomeration (geographical proximity), integration (organizational proximity) and institutionalization (institutional proximity). Acemoglu et al. Acemoglu2016 measured the strength of technological flows between technology subcategories using data of 1.8 million US patents and their citation properties during 1975-2004. They found that related pre-existing technological developments have a strong predictive power for future innovations.

A body of literature have demonstrated that relatedness plays an important role in economic development. Indeed, recent empirical evidences have generalized the principle of relatedness Hidalgo2018 , which describes the probability that an economy develops or loses an economic activity as a function of the density of its related activities in that economy. Jun et al. Jun2017 studied the role of relatedness in the evolution of bilateral trade. They found that produce relatedness, importer relatedness and exporter relatedness can increase a country’s exports of a product. Boschma Boschma2017b provided a valuable future research agenda regarding the relatedness as a driver of regional diversification. They suggested to focus on the role of economic and institutional agents. Davids and Frenken Davids2018 recently showed that the type of knowledge being mobilized and produced determines the relative importance of proximity dimensions. They proposed a framework that combines the proximity dimensions with different types of knowledge in the innovation process.

3.2.2 Collective learning in economic development

In addition to related varieties, geographic knowledge also plays a crucial role in regional economic development. Boschma et al. Boschma2013 demonstrated that capabilities that enable the development of new industries are regional specialized, supporting the hypothesis that knowledge decays strongly with distance in its diffusion process Keller2002 . Due to the localized nature of knowledge diffusion, neighboring regions should share more similar knowledge and exhibit a geographically correlated pattern in producing structure and economic growth Bahar2014 . Scientists have revealed the role of geographic neighbors and highlighted it as an alternative channel for development. Indeed, recent literature have focused on the effects of collective learning–the learning that takes place at the scale of teams, organizations, regions, and nations–by highlighting two learning channels, namely, the inter-industry learning (from related industries), and the inter-regional learning (from neighboring regions)Lawson1999 ; Gao2017 .

The effects of geographic knowledge spillovers on firm survival and industry development have been studied based on multiple data at regional, firm and plant levels. Acs et al. Acs2007 analyzed annual data of 11 million establishments in the US private sectors during 1989-1998. By incorporating knowledge spillovers through a geographical variation model, they investigated the relations between regional human capital stocks and new-firm survival. They found that knowledge spillovers lead to higher rates of new-firm survival. Holmes Holmes2011 studied the geographic expansion of Wal-Mart stores in the US by analyzing store-level data on sales. They found that locations of new Wal-Mart stores tend to be close to regions where Wal-Mart already had a high density of stores. Broekel and Boschma Broekel2012 analyzed data from 59 organizations in the Dutch aviation industry. They uncovered that geographical proximity serves as a driver of network formation and it is a stimulus for firm innovative performance after controlling for the effects of other proximities.

After analyzing a dataset summarizing individual work history, Jara-Figueroa et al. Jara2018 found that the growth and survival of new firms in a location increase when they hire workers with location-specific and industry-specific knowledge instead of occupation-specific knowledge. Moreover, industry-specific knowledge plays a more important role for pioneer than for non-pioneer firms. Using network clustering techniques, Alabdulkareem et al. Alabdulkareem2018 analyzed the dataset detailing the importance of 161 workplace skills for 672 occupations in the US. They found that skills exhibit a polarization into two clusters: the social-cognitive skills of high-wage occupations and the sensory-physical skills of low-wage occupations. Moreover, workers in occupations relying heavily on one skill cluster are likely to move to other occupations within the same skill cluster, says polarized skill network constrains career mobility of workers.

Based on the international trade data during 1962-2000, Bahar et al. Bahar2014 studied the effects of neighboring countries on the evolution of a country’s exporting basket. They measured the similarity in countries’ export structure by defining an export similarity index (ESI) through the Pearson correlation coefficient. Formally, the ESI between countries and is given by

(38)

where , and is the average value over all products for country . Here, is the revealed comparative advantage (RCA) Balassa1965 of country and product , which is calculated by Eq. (11). They found that neighboring countries have significantly larger ESI value than non-neighbors, and ECI is negatively correlated with geographical distance. After using regressions to discount the effects of product relatedness, they further found that a country’s probability to export a new product increases significantly (on average, 65% larger) if it has neighboring countries that are already successful exporters of that product.

Based on the survey data of 295 firms in 8 European regions, Broekel and Boschma Broekel2016 studied the geographical and cognitive structure of knowledge links. They found that firms’ knowledge exchange have differences in their cognitive and geographical dimensions. In particular, connecting with technologically related and similar organizations as well as organizations at various geographical levels (regional and non-regional) can enhance the innovations of firms. By analyzing the data of US state-level exports during 2000-2012, Boschma et al. Boschma2017 found that a state in the US has a higher probability (about 58%) of developing a new industry if it has a neighbouring state specialized in that industry. Further, they tested if neighboring regions have more similar export patterns by including ESI given by Eq. (38) into the regression model. They found that the ESI between a pair of states raises by 0.43 standard deviations if the two states share a border in the US.

In a word, previous literature have demonstrated two collective learning channels in regional economic development: the inter-industry learning that involves learning from related industries and the inter-regional learning that involves learning from neighboring regions. Using publicly listed firm data describing the evolution of China’s economy between 1990 and 2015, Gao et al. Gao2017 formalized these two collective learning effects. For inter-industry learning, they calculated the density of active related industries () by counting the number of related industries that are already present (i.e., ) in that province (see also Ref. Boschma2013 ). The density for industry in province at year is given by

(39)

where the binary variable

if province has advantage in industry at year (i.e., ), and otherwise. They found that the probability for a province to develop a new industry in the next five years increases with (see Figure 19A), supporting the inter-industry learning effect. For inter-regional learning, they calculated the density of active neighboring provinces () by counting the number of neighboring provinces that have developed advantage in an industry. The density for province in industry at year is given by

(40)

where is the geographic distance between two provinces and . They found that the probability that a province will develop a new industry in the next five years increases with (see Figure 19B), supporting the inter-regional learning effect.

Figure 19:

Quantifying the collective learning effects. (A) and (B) are the corresponding marginal probability distributions of new industries present in the next five years, given the density (

) of active related industries and the density () of active neighboring provinces, respectively. (C) Joint probability of a new industry developing revealed comparative advantage in a province in the next five years, given in horizontal-axis and in vertical-axis. The color marks the joint probability of new industries present after dividing the two densities into bins. Figure after Gao2017 .

Furthermore, Gao et al. Gao2017 explored the interaction between inter-regional and inter-industry learning effects. They calculated the joint probability that a new industry will emerge in a province as a function of both densities and . They found that the probability for a province to develop a new industry in a five-year period increases with both and (see Figure 19C). After using a probit model to check the robustness of the results, they demonstrated that the inter-regional and inter-regional learning effects are jointly significant. Interestingly, the regression coefficient of the two densities’ interaction term is negative and significant, suggesting the presence of diminishing returns (see Ref. Gao2017 for details). The observation means that, when one learning channel is sufficiently active (inter-industry or inter-regional), the marginal contribution of the other one is reduced. In other words, the two collective learning channels are substitutes for economic development. These empirical findings have been tested and generalized for countries at various stages of development based on different types of data. For example, Gao et al. Gao2017c analyzed over 300 million Brazilian labor records and found evidences in support of the collective learning effects in Brazilian regional economic development.

As geographic knowledge diffusion requires direct forms of human interaction Arrow1969 , the construction of high-speed rails (HSRs) is likely to facilitate market integration and knowledge spillovers. Using data of China’s HSRs, Zheng and Kahn Zheng2013 demonstrated that bullet trains help improve the life quality of urban population as HSRs entry allows individuals to access the megacity without living within its boundaries. To explore the impact of HSRs on regional economic activities, Li et al. Li2016b developed the geographically network weighted regression that incorporates the changes in network-based travel time from HSRs. They found that HSRs have significantly changed the spatial redistribution of economic activities in regions of China. Later, based on data of prefectural-level cities in China during 1990-2013, Ke et al. Ke2017 explored how HSRs affect the economic growth of cities. They found that the local economic gains are greater for cities connected by HSRs. Meanwhile, Qin Qin2017 found a mild impact of HSRs upgrades on economic growth in China’s prefecture-level cities, while the peripheral regions along the upgraded HSRs (e.g., counties close to high-speed rail stations) experienced an investment-driven reduction (3-5%) in GDP and GDP per capita after 2007.

The effects of modern transportation (e.g., HSRs and flights) on economic development and knowledge spillovers have also been studied in developed countries. For the European Union, Kim et al. Cheng2015 explored the contribution of HSRs in promoting economic integration. They found that local economic development is necessarily leaded by transport improvements alone, especially when this involves cross-border links. By analyzing the northwest European HSRs and the UK’s first HSR, Vickerman Vickerman2018 found that transport infrastructure by itself does not likely have a transformative effect on economy, but it can contribute to such effect after being coupled with policy interventions such as policies related to complementary planning and policies towards labour markets. Ahlfeldt and Feddersen Ahlfeldt2018 analyzed the economic impact of the German HSR and found that HSR has a causal effect (on average about 8.5%) on GDP growth in the regions of intermediate stops. Moreover, the strength of spillovers halves every 30 minutes of travelling time and diminishes to zero after about 200 minutes. Besides HSRs, a reduction in travel cost brought by cheaper flights can also facilitate knowledge spillovers reflected by scientific collaborations. Catalini et al. Catalini2016 analyzed a scientist-level dataset covering all US chemistry faculty members during 1991-2013. They found that scientific collaborations increase by 50% after the Southwest Airlines opens a new route, showing that face-to-face interactions can enhance scientific collaborations.

To address endogenous concerns of inter-regional learning, Gao et al. Gao2017 applied the differences-in-differences (DID) analysis Bertrand2004 and used the introduction of HSRs as an adequate instrument. The underlying intuition is that HSRs entry reduces the barriers to the inter-regional learning but should not affect the inter-industry learning. Specifically, they used the DID analysis to test whether provinces connected by HSRs increased their industrial similarity and experienced a boost in the productivity of shared industries. The industrial similarity between a pair provinces and at year is measured by

(41)

where and . They found that the industrial similarity () decays strongly with the geographic distance (), and HSRs entry significantly increases the industrial similarity between provinces connected by HSRs. Moreover, the labor productivity (measured by the revenue per worker) increases in the provinces connected by HSRs, supporting the hypothesis that HSRs entry promotes inter-regional learning.

3.2.3 Development paths and strategies

Regional industrial diversification has been suggested as a strong path-dependent process, where economic relatedness plays a significant role. Based on plant-level data of 70 Swedish regions during 1969-2002, Neffke et al. Neffke2011 identified related industries using the revealed relatedness (RR) measure Neffke2008 . They found that the probability that an industry will enter (exit) a region increases (decreases) with the number of related industries already present in that region. Neffke et al. Neffke2012 further studied the effects of technological relatedness on plant survival in Sweden during 1970-2004. They found that the plant survival rates are increased by the presence of technologically related local industries. Further, Neffke and Henning Neffke2013 investigated how industry’s skill relatedness affects the diversification of firms by calculating the RR measure based on the labor flow data covering about 4.5 million workers in 400 industries in Sweden during 2004-2007. They found that firms tend to diversify into industries that require skills strongly related to the firms’ existing industries. These works suggest the predictive power of skill relatedness for firm diversification.

Some literature have also explored the role of relatedness in regional development, industrial structural change and firm survival in China He2016b . Howell et al. Howell2018 analyzed the data of over 13 million entrepreneurial firms in China during 1998-2007. They found that local related variety has a stronger positive effect than other types of agglomeration on new firm survival. Moreover, the intensity and location of governmental support affect post-entry performance and survival of firms. He et al. He2017 analyzed the annual survey of industrial firms in China during 1998-2005. They found that private enterprises rely more on market-oriented institutions, while firms with local governmental supports and industrial linkages are more likely to sustain. He et al. He2018 later analyzed firm-level data of manufacturing industries in China during 1998-2008. They found that regions tend to develop new industries that are technologically related to the existing portfolio. These results demonstrate that regional industrial development is a path-dependent process where industries related to pre-existing ones.

Recently, Gao Gao2017e investigated how to maximize the learning from related industries (i.e., inter-industry learning) and neighboring regions (i.e., inter-regional learning) by leveraging the Brazilian labor data. He used a simple variant of the threshold model Watts2002 to simulate the diversification of industries on real networks. In the threshold model, a region or an industry will be activated if over half of its neighbors are already active Gao2015b . For inter-regional learning, simulations are based on the Brazilian industry space Gao2017c , and the set of initial industries are selected according to a turnable balancing index of core and periphery industries. Gao Gao2017e found an optimal strategy that results in a good tradeoff between core and periphery industries in the initial activation. For inter-regional learning, simulations are based on the adjacent network of regions integrated with one spatial link being added between each pair of regions Gao2015b , and the set of initial industries are randomly selected. The lengths of spatial links are determined by a turnable balancing index of nearby and distant regions. The result suggests an optimal strategy that makes a balance between nearby and distant regions in establishing new spatial connections. These findings demonstrate that there are optimal strategies for both channels that can maximize the learning effects in industrial diversification.

Figure 20: Strategic diffusion in networks, and the diversification of products. (A) A wheel network. (B) Time needed to activate all nodes in the wheel network as a function of the time when the hub is targeted (i.e., the fraction of active nodes). The red line indicates the optimal time based on the strategic diffusion. (C) Box-plot presenting the average degree of newly developed products in the product space as a function of the fraction of developed products. The red line shows the fit for the empirical values, and the purple line shows the optimal mix of greedy and high degree strategies obtained via numerical simulations. The black line shows the null model baseline which uses the random strategy. Figure after Alshamsi2018 .

As suggested by many empirical studies, countries and regions are likely to develop economic activities that have close relatedness to what they have already developed, yielding the principle of relatedness Hidalgo2018 . In other words, the probability of developing a new industry in a region increases with the density of the region’s developed industries that are related to the industry. As the produce space Hidalgo2007 and industrial space Gao2017 ; Gao2017e have a core-periphery structure, the difficulty and opportunity of developing produces and industries at different locations of the space are different. Alshamsi et al. Alshamsi2018 explored the optimal diversification strategies in the produce space (see Figure 20). They showed that the high-degree strategy, i.e., always targeting the potentially products with the highest degree (e.g., products in the core), will result in a long activation time, while the low-degree strategy, i.e., always targeting the potentially products with the lowest degree (e.g, products in the periphery), will miss the opportunity for a rapid development. In order to minimize the total time needed to develop all products, they proposed a method named strategic diffusion to identify products that are optimal to target at each time step. The optimal strategy targets core produces during a narrow and specific time window, which comes earlier than we previously thought (e.g., the time by the greedy strategy). They analyzed the international trade data and demonstrated that the countries’ strategies to diversify their products are close to the optimal ones. The time that countries target core products, however, is later than the one suggested by the model, showing that countries can still save the total time of developing all products.

The path-dependent process of regional diversification suggests that regions have more opportunities to develop industries that have high relatedness to their pre-existing ones Boschma2017b , while the strategic diffusion suggests that countries can optimize their development paths by targeting highly connected but somewhat unrelated activities at a certain time Alshamsi2018 . The development of unrelated economic activities is particularly significant for the catching-up growth in developing economies as it is usually hard for them to jump from periphery to core areas in the product and industry space, say the space conditions the development Hidalgo2007 . Regarding this point, Zhu et al. Zhu2017 explored the development paths of regions in the heterogeneous industry space built on the exporting data of Chinese firms during 2002-2011. They studied whether developing regions can catch up by breaking the path-dependent trajectories and jumping farther into core areas of the uneven industry space. They demonstrated that developing regions can make a farther jump to new industries in a path-breaking way, and the reliance of technological relatedness can be transcended by internal innovations and extra-regional linkages. These findings suggest that less developed economies should pay more attention to improving other factors (such as infrastructure and education, government supports and extra-regional linkages) to promote their jumping capability in the catching-up growth.

Processes of unrelated diversification are also important for economic development, and economies can benefit from entering unrelated activities. Boschma et al. Boschma2017c argued that a theory of regional diversification should also accounts for the processes of unrelated diversification. They suggested to pay attention to the role of agency in institutional entrepreneurship and enabling factors at different spatial scales. In particular, they discussed four regional diversification trajectories including two related diversification (replication and exaptation) and two unrelated diversification (transplantation involves and saltation stands). Pinheiro et al. Pinheiro2018 identified the periods that countries entered unrelated products by analyzing the diversification paths of 93 countries in product exports during 1965-2014. They found that countries tend to enter unrelated products when they have high levels of human capital and during their intermediate level of economic development. Moreover, countries that entered more unrelated products experienced a significant increase in economic growth, showing the positive gain to target unrelated activities at a specific development stage. All the above results indeed ask for more intelligent strategies for economic diversification by balancing related and unrelated activities in development.

3.3 Urban scalings and perception

The availability of large-scale and quantitative data from socioeconomic systems and image database has enhanced our perception of urban landscape and surrounding environment. In this subsection, we will summarize empirical observations and theoretical explanations of scaling laws of urban population with urban metrics (e.g., crime rate, employment, innovation and economic activity). Then, we will review recent applications of novel data on inferring the function of urban areas. Next, we introduce crowdsourcing methods and computational vision techniques to measure livability, safety and inequality, to infer the status of urban life, and to quantify the changes of urban streetscapes. Finally, we will introduce recent progresses on urban computing for better development in urban areas.

3.3.1 Scaling laws for cities

Empirical observations in economics suggest the Zipf’s law for cities in most countries. That is, the number of cities with populations greater than is proportional to . Formally, , with and being a constant. Gabaix Gabaix1999 provided a simple explanation for the emergence of such Zipf’s law. They demonstrated that the power-law exponent is necessarily led by the most natural conditions on the Markov chain. Using the maximum likelihood estimation (MLE) method, Clauset et al. Clauset2009 estimated the power-law exponent for the populations of US cities in 2000, finding a rank-size slope of (i.e., ). Later, Small et al. Small2011 tested the Zipf’s law based on a unique proxy for anthropogenic development, specifically, the temporally stable nighttime lights (NTLs) from the DMSP-OLS. They found that the estimated ranges from to , suggesting that Zipf’s law holds for spatial extent of anthropogenic development at global scales.

Urban scaling laws provide a quantitative connection between urbanization and economic development, which is common to all cities around the world. By analyzing datasets from urban systems in the US, Germany and China, Bettencourt et al. Bettencourt2007 found that many diverse urban variables fit power-law functions of population size with scaling exponents falling into distinct universality classes. Using total population to estimate the size of the city at time , the power-law scaling takes the form

(42)

where denotes a certain metric on social activities or material resources at time , and is the normalization constant. They found a pervasive property of urban organization with exponents falling into three categories (see Figure 21 for the results summarized by Arcaute et al. Arcaute2015 ): (superlinear), (linear), and (sublinear). In particular, they showed that is usually associated with individual human needs such as housing, employment and household electrical consumption, is associated with social currencies such as information, innovation and wealth, and is associated with infrastructure such as road surface, gasoline stations and length of electrical cables.

Figure 21: Scaling exponents for urban metrics versus city size. Scaling exponents

found for China, Germany and USA are shown with 95% confidence interval for different urban indicators. Scaling exponents are colour-coded according to their regime: sublinear in blue, linear in green, and superlinear in red. Figure from

Arcaute2015 with data coming from Bettencourt2007 .

Urban scaling laws have been widely observed in emissions and infrastructures. Louf and Barthélemy Louf2014b found that the CO emission scales superlinearly with in the US in 2012 () and the OECD countries in 2008 (). Oliveira et al. Oliveira2014 analyzed data of CO emissions in the US during 1999-2008 and found a superlinear scaling (with an average exponent ) across all cities. Delong and Burger Delong2015 found that energy use scales superlinearly with in Sweden, England and Wales (E&W), the US, and the world. Samaniego and Moses Samaniego2008 analyzed the structure of road networks in 425 US cities. They found that road network capacity per capita is independent of city size measured both by population and spatial extent of the urban area. Batty Batty2013 analyzed the road network of cities in E&W and found a superlinear scaling () of road accessibility with . Louf et al. Louf2014c analyzed the data of about 140 subways and over 50 railway networks across the world. They found that the length of subway networks scales superlinearly () while the yearly ridership of railway networks scales linearly with the number of stations. For the UK and urban California, Masucci et al. Masucci2015 found that both the total length and the area of street networks scale almost linearly with , and the urban scalings persist in space and time.

A body of literature have demonstrated the urban scaling of crime in cities. Alves et al. Alves2013 analyzed data of homicides in Brazilian cities and found that the number of homicides scales superlinearly () with . They further proposed an approach to unveil relations between crime and urban metrics using the distance between the actual homicide number and the expected number from the scaling law. Banerjee et al. Banerjee2015 analyzed the data of US cities and found that crime scales superlinearly () with . They gave the explanation that the number of polices scale sublinearly while the number of generated crimes scales linearly. After analyzing monthly police crime reports in E&W, Hanley et al. Hanley2016 found four types of scaling behaviors based on population density: non-urban scaling, accelerated scaling (), inhibited scaling () and collapsed scaling (, with ), where and are the scaling exponents for low and high population density, respectively. Oliveira et al. Oliveira2017 analyzed the disaggregated criminal data from the US and UK. They found that the crime concentration does not scale with the city size, and the crime distribution in a city follows a power-law distribution with exponent depending on the crime type.

In most of these aforementioned literature, the word “city” refers to a larger agglomeration around the central city, which is socioeconomic unit instead of administrative definition. In fact, there are alternative definitions of city boundaries. Arcaute et al. Arcaute2015 developed a framework to produce a system of cities by clustering small statistical units. They found that the scaling exponent gives mild deviations from linearity in E&W, suggesting that economic intricacies are not fully grasped by the urban population . Van Raan et al. VanRaan2016b analyzed the urban scalings in the Netherlands and found a superlinear scaling () of GDP with for major cities. After considering three separate modalities, they found that municipalities perform better than urban agglomerations and urban areas with the same population, showing that cities with a municipal reorganization are likely to perform better. Bettencourt and Jose Bettencourt2016 applied new harmonized definitions of functional urban areas to examine scalings, finding that pooling together cities from different urban systems can better identify scaling behaviors in European cities.

Social ties of cities also exhibit scaling behaviors. Pan et al. Pan2013 found that the density of social ties scales superlinearly with urban population density . The social-tie density is given by , where with a unique for each city. In particular, fells into a narrow band , where for the AIDS/HIV prevalence in the US cities and for the total GDP per square kilometer in the European cities. Moreover, the superlinear scaling () is led by the increase in , and the diffusion rate along social ties can accurately reproduce urban scalings. After analyzing mobile phone data of 31 Spanish cities, Louail et al. Louail2014 found that the number of activity centers scales sublinearly () with . Markus et al. Schlapfer2014 analyzed the nationwide communication records in Portugal and the UK. They found that the total number of contacts and communication activities scale superlinearly with . Recently, Leitão et al. Leitao2016 studied the existence of nonlinear scaling by developing a statistical framework to account fluctuations. They found that does not only depend on the fluctuations contained in the datasets but also on the assumptions of models and the heavy-tailed distribution of city sizes.

Several explanations have been proposed for the origin of urban scalings. Arbesman et al. Arbesman2009 explained the observed superlinear scaling in the relations between population size and innovation by a network model, where the number of long-distance ties associated with a city is proportional to its population and these ties provide the potential for innovation. The model yields a reasonable range of the scaling exponent, suggesting socially distant ties as a powerful force of the superlinear scaling. Later, Gomez-Lievano et al. Gomez2012 built a statistical framework to explore how urban scaling laws emerge and relate to Zipf’s law. Using data of homicides in three cities, they derived the conditional probability density for the number of homicides in a city with population by exploiting the Bayes’ rule

(43)

where is the distribution of homicides in cities, and is the conditional probability for the populations of cities with a given number of homicides. After studying the statistical properties of and , they found that scaling laws emerge as the expectation value of , which is a function of . Moreover, the knowledge of the distribution can be used to predict the Zipf’s exponent from the statistics of urban metrics.

To better understand the origin of urban scalings, Bettencourt Bettencourt2013 developed a framework to estimate scaling exponents without modeling infrastructure. In a city with land area and population , the strength of local interactions between people in an area is denoted as . The basis ideas behind their model are summarized as follows. First, the number of local interactions per person is given by , where is the population density, and is the length of travel. Then, a city’s total social output is given by

(44)

where , is the average social output per interaction, and is the population size. Next, the total cost to mix the city is , where is a force per unit time, and is the cost per person. The cost should be covered by each individual,