Log In Sign Up

One-shot Transfer Learning for Population Mapping

Fine-grained population distribution data is of great importance for many applications, e.g., urban planning, traffic scheduling, epidemic modeling, and risk control. However, due to the limitations of data collection, including infrastructure density, user privacy, and business security, such fine-grained data is hard to collect and usually, only coarse-grained data is available. Thus, obtaining fine-grained population distribution from coarse-grained distribution becomes an important problem. To tackle this problem, existing methods mainly rely on sufficient fine-grained ground truth for training, which is not often available for the majority of cities. That limits the applications of these methods and brings the necessity to transfer knowledge between data-sufficient source cities to data-scarce target cities. In knowledge transfer scenario, we employ single reference fine-grained ground truth in target city, which is easy to obtain via remote sensing or questionnaire, as the ground truth to inform the large-scale urban structure and support the knowledge transfer in target city. By this approach, we transform the fine-grained population mapping problem into a one-shot transfer learning problem. In this paper, we propose a novel one-shot transfer learning framework PSRNet to transfer spatial-temporal knowledge across cities from the view of network structure, the view of data, and the view of optimization. Experiments on real-life datasets of 4 cities demonstrate that PSRNet has significant advantages over 8 state-of-the-art baselines by reducing RMSE and MAE by more than 25 (


page 1

page 5

page 7


A Cross-City Federated Transfer Learning Framework: A Case Study on Urban Region Profiling

Data insufficiency problem (i.e., data missing and label scarcity issues...

Fine-grained Population Mapping from Coarse Census Counts and Open Geodata

Fine-grained population maps are needed in several domains, like urban p...

Spatio-Temporal Graph Few-Shot Learning with Cross-City Knowledge Transfer

Spatio-temporal graph learning is a key method for urban computing tasks...

ALCNN: Attention-based Model for Fine-grained Demand Inference of Dock-less Shared Bike in New Cities

In recent years, dock-less shared bikes have been widely spread across m...

Mapping urban socioeconomic inequalities in developing countries through Facebook advertising data

Ending poverty in all its forms everywhere is the number one Sustainable...

Learning from Multiple Cities: A Meta-Learning Approach for Spatial-Temporal Prediction

Spatial-temporal prediction is a fundamental problem for constructing sm...

GSV-Cities: Toward Appropriate Supervised Visual Place Recognition

This paper aims to investigate representation learning for large scale v...

1. Introduction

(a) Coarse-grained Population (CITY1).
(b) Fine-grained Population (CITY1 4).
(c) Coarse-grained Population (CITY2).
(d) Fine-grained Population (CITY2 4).
Figure 1. Coarse-grained and fine-grained population distribution in CITY1 and CITY2. Lighter places have higher population.

Fine-grained urban population distribution, e.g., the real-time population in grids in the city, is of great importance for many applications, including urban planning, epidemic modeling, and transportation management. For example, with a large-scale data recording dynamic fine-grained population distribution, governments can make timely and effective policies about infrastructure construction for commute and public health during the pandemic disease. However, due to the limitations of current data collection systems along with the issues of user privacy and business security, such fine-grained population distribution is difficult to obtain and rarely to be open. Usually, only coarse-grained population distribution like is obtainable. Thus, as Figure 1, inferring fine-grained urban population distribution from coarse-grained data, also known as fine-grained population mapping problem (deville2014dynamic; stevens2015disaggregating), becomes an important task.

Although existing neural network-based population mapping methods (zong2019deepdpm; liang2019urbanfm) have promising performance, these methods always rely on sufficient fine-grained population distribution as ground truth to supervise their training, which severely restricts their applications since sufficient fine-grained population data are usually not obtainable. Therefore, a method, which could infer fine-grained population distribution without sufficient fine-grained ground truth, would significantly broaden the applications of population mapping in data-scarce cities. Fortunately, we could obtain prior knowledge of these cities from several approaches. First, we could extract universal and transferable knowledge about the relationship between coarse-grained population and fine-grained population from data-sufficient source city to support the population mapping in data-scarce target city. Second, although sufficient fine-grained population distribution data is unobtainable, a single static reference fine-grained ground truth is still easy-to-obtain by remote sensing (stevens2015disaggregating) or questionnaire. Third, POI (Point of Interest) distribution characterizes the large-scale urban structures and region functions. Its strong relationship with population and crowd flow makes it informative.

Thus, we transform the population inference problem into a one-shot transfer learning problem, since there is only one target domain’s fine-grained ground truth in our scenario. We need to extract transferable knowledge from the data-sufficient source domain (city) and transfer it into data-scarce target domain (city) with the support of only one reference static population distribution sample and auxiliary data (i.e., POI distribution). To solve this one-shot transfer learning problem, there are still several challenges:

  • [leftmargin=*,partopsep=0pt,topsep=1pt]

  • First, the spatial-temporal correlations between the coarse and fine-grained population distribution are complicated. In the source data-sufficient domain, we could train a model with both coarse and fine-grained population data. However, the correlations between two distributions are spatially affected by the urban structure and they change in different locations. Thus, effectively learn transferable spatial-temporal correlations is challenging.

  • Second, it’s hard to utilize the target domain’s scarce reference data to guide knowledge transfer. Single reference fine-grained population ground truth and POI distribution in the target domain could guide the knowledge transfer from the source domain. However, following experiments will prove that straightforward methods, including employing single reference fine-grained distribution as the ground truth to fine-tune model which is pre-trained in a single source domain or meta-pre-trained in multiple source domains, fail to perform well to adapt the model to the target domain. Considering the differences of large-scale structures between source and target domains, a more effective method is required for domain adaptation.

Confronting these challenges, we propose a novel Population Super-Resolution Network (PSRNet), which follows the procedure of two stages, including pre-training for knowledge extraction in the source domain and fine-tuning for knowledge transfer from source to target domain. To extract and transfer spatial-temporal knowledge, PSRNet consists of three components:

  • Spatial-Temporal Network (STNet) for model-based knowledge transfer, used to extract the transferable spatial-temporal correlations between coarse-grained and fine-grained population distribution.

  • Population Generation Network (PGNet) for data-based knowledge transfer, designed to transfer the relationship between POI and gridded crowd flow. It will augment the single fine-grained ground truth in target domains.

  • Pixel-level Adversarial Domain Adaptation mechanism (PADA) as optimization-based knowledge transfer, which could mitigate the domain shift in fine-tuning stage.

In the model-based transfer network STNet, we design a dense connection-based population mapping network to extract spatial correlations from the coarse-grained population for fine-grained population mapping. Furthermore, we design a temporal module to enhance the transferability of above population mapping network by capturing the temporal correlations of spatial features and progressively merging them into different layers of STNet .

In the data-based transfer network PGNet

, we design a generative adversarial-based model to learn the transferable correlations between POI distribution and gridded crowd flow from the source domain and generate fine-grained population distribution in the target domain. Concretely, we utilize the dynamic representation from time-enabled long short-term memory network (LSTM) as weights to reorganize the static urban POI map and generate the sequential gridded crowd flow with a residual convolution-based network. Then, we combine the single reference fine-grained ground truth with generated crowd flow to synthesize multiple fine-grained population distribution samples in the target domain to provide more ground truth for fine-tuning.

Finally, we enable the knowledge transfer from the optimization view. We develop a pixel-level adversarial domain adaptation framework (PADA) to adapt our model into target domains by mitigating the domain shift between different domains in the fine-tuning stage. When employing PADA for fine-tuning, except for STNet’s regular population mapping, a pixel-level discriminator is simultaneously trained to distinguish the domain of STNet’s feature maps, whereas STNet is also optimized to confuse the discriminator. With this adversarial mechanism, we could adapt STNet while ensure its feature extraction is universal for source and target domains. That would mitigate the domain shift and improve the performance of transfer.

Our contributions are summarized as follows.

  • [leftmargin=*,partopsep=0pt,topsep=1pt]

  • We present the first attempt in one-shot transfer learning for fine-grained population mapping, which is of great importance to deploy population mapping on data-scarce cities. Concretely, we develop a novel framework with three-view knowledge transfer mechanisms to infer fine-grained population distribution with scarce data in the target domain.

  • We design a model-based transfer network STNet to transfer the spatial-temporal correlations between coarse-grained and fine-grained population distribution by its parameters. Besides, we develop a data-based transfer model PGNet to synthesize fine-grained ground truth in target domains and transfer the correlation between POI distribution and gridded crowd flow. Finally, based on the aforementioned components, we design an pixel-level adversarial domain adaption fine-tuning framework PADA to reduce the domain shift in spatial-temporal knowledge transfer between source and target domains during fine-tuning optimization.

  • We conduct extensive experiments on real-life datasets of cities to evaluate the performance of our proposed model, including the knowledge transfer across cities and granularities. Results of on four metrics in and tasks demonstrate that our model has significant advantages over state-of-the-art baselines.

2. Preliminaries

In this section, we first introduce the notations and then formally define the fine-grained population mapping problem in the one-shot transfer learning scenario. Following previous works (stevens2015disaggregating; liang2019urbanfm; zong2019deepdpm), we use gridded population distribution to formulate the fine-grained population mapping problem.

Definition 1 (Gridded Population Distribution).

By partitioning an area into a

grid map, the gridded population distribution in a single time slot is defined as tensor

by accumulating the users visiting each grid. The sequential population distribution with consecutive time slots in source domain and target domain are denoted as , .

Under above settings, population could be mapped into grid maps of different granularities (e.g., or ). Coarse-grained population distribution, with grid size is denoted by . Fine-grained population distribution e.g., with grid size is denoted by . We note that both coarse-grained and fine-grained population are relative and task-specified, which will be introduced before each comparison in Experiments 4. In this research, time slot always contains minutes. The fine-grained population mapping task needs to recover the fine-grained distribution from coarse-grained distribution, which is formally defined as below:

Problem 1 (Fine-grained Population Mapping).

Given the coarse-grained population distribution sequence of time slots (e.g.

, from 06:00PM to 09:00PM), estimate the fine-grained population distribution of the newest (

th) time slot (e.g., at 09:00PM). We note that is the upscale factor, which means population in each grid need to be partitioned into sub-grids.

While fine-grained population data is difficult to obtain for the majority of cities in practice, we attempt to transfer knowledge from data-sufficient city to data-scarce city with the support of single reference static fine-grained population and POI distribution in target city. The single reference static fine-grained population is available via remote sensing (stevens2015disaggregating) or questionnaire. It is critical to indicate large-scale urban structure and patterns of population in the data-scarce target domain, which is denoted by . Moreover, POIs distribution characterizes the function of regions. It is also considered as a reliable and informative proxy of human activity (yuan2012discovering; Xu2016ContextawareRP; Dong2019PredictingNS; shao2021deepflowgen) in target domain. With partitioning an area into a grid map, the number of POIs of each category is defined as a tensor by accumulating POIs in each grid into categories.

By introducing knowledge transfer, single fine-grained population distribution and POI distribution for target city, we transform the fine-grained population mapping task into a one-shot transfer learning problem, since there is only one fine-grained population distribution as ground truth in target domain. This problem is formally defined as follows:

Problem 2 (One-shot Transfer Learning For Fine-grained Population Mapping).


  • Sufficient coarse and fine-grained population distribution , of source domain , where is the number of samples.

  • Sufficient coarse-grained population distribution , one static reference fine-grained population distribution sample in target domain .

  • Fine-grained POI distributions (static) , in source and target domains.

Estimate the fine-grained population distribution in target domain .

3. Methods

Figure 2. The framework of PSRNet.

To solve the fine-grained population mapping problem in the one-shot transfer learning scenario, we propose a novel model PSRNet, whose basic procedure is presented in Figure 2. PSRNet consists of three components: STNet for model-based knowledge transfer, PGNet for data-based transfer, and pixel-level adversarial domain adaptation (PADA) for optimization-based knowledge transfer. Firstly, Our STNet is designed to complete the population mapping task by modeling the complicated spatial-temporal correlations between coarse and fine-grained population distributions. Further, it is enhanced by temporal modeling network TNet. Secondly, PGNet

is designed to generate gridded crowd flow and synthesize fine-grained population distribution ground truth by capturing the spatial-temporal correlations between gridded crowd flow and POI distribution via a generative adversarial network (GAN). Finally, with the combination of

STNet and PGNet, we propose PADA for optimization-based knowledge transfer, which encourages the model to transfer the spatial-temporal knowledge while mitigating the domain shift between different domains. In this way, our model PSRNet succeeds in one-shot transfer learning for the fine-grained population mapping problem.

The training procedure of PSRNet is described as follows:

  • Pre-training: For STNet, We employ sufficient data in source domain to infer fine-grained population distribution by sequential coarse-grained population distribution. We also train PGNet to synthesize fine-grained gridded crowd flow by POI distribution in the source domain.

  • Fine-tuning: First, we employ PGNet, which is pre-trained in source domain, to generate gridded crowd flow with POI distribution in target domain. Second, we combine the single reference fine-grained population distribution and the generated gridded crowd flow to synthesize fine-grained population distribution in target domain. Third, we employ the synthetic fine-grained population distribution as ground truth to support the PADA mechanism to adapt the STNet into target domain.

3.1. STNet: Spatial-Temporal Correlation Modeling

Figure 3. The framework of STNet in PSRNet.

STNet is designed to extract universal spatial-temporal correlations from the coarse-grained population input and produce the fine-grained distribution . As Figure 3 shows, it consists of two components: the first part is the backbone network SNet, which is for the spatial modeling of single input of coarse-grained population in the th time slot; the second part is the temporal enhancement network TNet, which is designed to enhance SNet by modeling the temporal correlation of the sequential coarse-grained population input . Now, we discuss the details of these two networks.

3.1.1. SNet for Spatial Modeling

We first introduce the backbone network SNet for spatial modeling. As Figure 3 shows, SNet can be divided into three parts: 1) the feature extraction unit for input preprocessing and preliminary feature extraction from the coarse-grained population; 2) stacked conv-blocks for advanced spatial feature extraction; and 3) the upsampling components to produce the fine-grained population map based on the feature map from previous feature extractors. We design two types of feature extractors. As shown at the bottom of Figure 3, the preliminary feature extraction unit is made up of two

convolution units which are activated by a ReLU function. Here, we choose

filter to expand the receptive field of feature pre-processing. Following the preliminary feature extraction unit, we stack several dense connected conv-blocks as the advanced feature extractor to extract and fuse features again. The detailed design of conv-block unit is presented at the bottom of Figure 3. It consists of a two-layer convolution unit activated by ReLU function and a batch-norm layer after the first convolution layer. Based on this basic unit, we apply the dense connection to construct the conv-block.

After merging all the output features from two levels of feature extractors with a convolution layer, we build an up-sampling unit to upscale the feature map. Each up-sampling unit is designed to upscale the feature map by times. One up-sampling unit consists of three layers: a convolution layer with batch-norm layer, a pixel-shuffle layer (shi2016real) with scale for rearranging and up-scaling, and a ReLU function for non-linear activation. Stacked several up-sampling units or pixel-shuffle layer of higher scale could achieve a larger up-sampling size. Different from the general image super-resolution task, the fine-grained population mapping task exhibits a specific value constraint: the population of an area equals to the total population of its sub-areas. Therefore, we finally follow -Normalization (liang2019urbanfm) to achieve refine the fine-grained population.

3.1.2. TNet for Temporal Enhancement

While SNet is designed for single input of coarse-grained population , a simple extension method of it for the temporal modeling is to process the sequential input by directly concatenating consecutive time slots in the channel dimension as the sequential input . However, simple concatenation in the channel dimension is limited to capture this long-term correlation, which is testified in our Ablation Study 4. Due to the regularity and daily periodicity of dynamic population distribution, we need to consider long-term effects in the temporal modeling. Thus, we design TNet to capture this important long-term temporal correlation, which is shown on the top of Figure 3.

Figure 4. The framework of PGNet in PSRNet.

To model these long-term effects and avoid the limitation of simple concatenation in channel dimension, we first utilize a 3D convolution layer with stride

to merge the coarse-grained population input in adjacent time slots. The parameter of stride step denotes the time window of merging features as an important factor for trading performance and model complexity. Then, we utilize the feature extraction unit with the same structure in SNet to process each merged feature independently. For example, for and , the number of the merged features is . We use this shared feature extraction unit to process these merged features and produce new features with the same number. Finally, we progressively merge these features into conv-blocks. For example, the first feature map is fed into the first conv-block and the second map feature is fed into the second conv-block. In this way, the temporal features of different periods are merged into different layers of the spatial modeling network SNet, which can ensure that each merging operation only needs to process fewer features. It can also be regarded as an interpretative weighting scheme, more important temporal features are fed into the earlier location of the whole structure.

In summary, we design a dense connection-based SNet to complete fine-grained population mapping task from the spatial view and then design TNet to enhance the SNet from the temporal view to further improve the results.

3.2. PGNet: Static Fine-grained Population Distribution Synthesis

3.2.1. Overview

Facing the problem of lacking fine-grained population ground truth in target domain, we propose a generative model, PGNet, to generate gridded crowd flow by POI distribution. Combined with the single reference fine-grained population, PGNet could synthesize multiple fine-grained population distribution.

Population distribution, crowd flow, and POI distribution are highly associated so it’s reasonable to infer one of them based on others. Firstly, the static reference fine-grained population distribution is strongly associated with other time slots’ distribution of the same domain since they share an identical large-scale urban structure. Therefore, the static reference distribution is critical to synthesize the fine-grained population of other time slots. Secondly, crowd flow is defined as the difference between population distribution in consecutive time slots. Thus, with the sequence of crowd flow and static population distribution, the distribution of more time slots could be inferred by accumulating crowd flow onto static distribution iteratively. Thirdly, POI distribution (PoI2020) characterizes the function of regions and it has been regarded as a reliable proxy for human activities in many previous works (yuan2012discovering; Xu2016ContextawareRP; Dong2019PredictingNS; shao2021deepflowgen). Thus it’s strongly associated with crowd flow or population.

Given the close relationship between POI, population, and crowd flow, our PGNet learns transferable universal knowledge of the correlations between POI distribution and crowd flow in the source domain and generates fine-grained crowd flow in target domain. Then, we synthesize the fine-grained population of more time slots by accumulating the generated crowd flow onto single fine-grained reference population distribution. These synthetic population distribution data contain the knowledge of the target domain’s static reference distribution, target domain’s POI distribution. Finally, these synthetic population distributions will be employed as the ground truth in fine-tuning. The whole generative framework of PGNet is presented in Figure 4, which contains a generator to produce dynamic fine-grained population distribution from time-weighted POI density and a discriminator to produce learning signals for the generator via distinguishing whether the input distribution is synthetic or real.

3.2.2. Detailed Designs

We first introduce the design of the generator, which is at the left of Figure 4. As the goal of human activities to move to different regions, POI describes urban function and urban structure to a large extent (yuan2012discovering). Here, we try to generate the fine-grained crowd flow from dynamic weighted fine-grained POI density map. First, to generate the fine-grained crowd flow in time of the day, we use a learnable time embedding table to convert time

into a dense feature vector. Then we feed this time vectors into a LSTM to produce the sequential representation

of time . On the other hand, given the fine-grained POI distribution , we utilize a convolution layer and res-blocks to pre-process it and obtain its feature map . Then, we multiply time embedding onto each pixel of to obtain a time-aware feature map . Moreover, we stack res-blocks and a convolution layer to further process it to generate the fine-grained crowd flow map of time . Finally, we generate the population distributions of more time slots by adding generated crowd flow onto static reference fine-grained distribution and employ a -Normalization layer to regularize these population maps. The whole generator is formulated as follows:


We first define fine-grained gridded crowd flow as the difference fine-grained population distribution between two consecutive time slots and . represents the embedding layer, denotes the vector extension operation, and denotes the two feature extraction module including convolution layers and res-blocks, denotes the category-aware POI distribution, stands for the -Normalization layer, means the estimated fine-grained crowd flow in time , static reference fine-grained population distribution is , and represent the synthetic fine-grained population distribution in forward and backward directions. The sequence of fine-grained population distribution is the final output of PGNet’s generator, where is the length of output sequence.

To guide the learning of the generator, we build a discriminator to generate the learning signals. The discriminator consists of two major components: the convolution-based feature extractor which consists of a convolution unit and

res-blocks and a classification module with a linear layer activated by sigmoid function. The training of

PGNet follows the standard GAN training procedure with a combined loss of GAN loss and MSE loss with weighting coefficient as a hyper-parameter.

3.3. Pixel-level Adversarial Domain Adaptation

Cooperating with the model-based transfer by STNet and data-based transfer by PGNet, we introduce the optimization-based transfer by adversarial domain adaption training mechanisms, which is presented in Figure 2. Adversarial domain adaption (Ganin2015UnsupervisedDA; Tzeng2017AdversarialDD) is an effective transfer learning algorithm. We adopt its basic structure while adapting it into our problem, which contains three components:

  • TNet - Feature Extractor: We first utilize the TNet

    as a universal Feature Extractor to extract feature maps in source and target domains. It is optimized to support the regression of Predictor. Whilst, it is also optimized against the Domain Classifier to confuse its classification task.

  • SNet - Predictor: With pre-trained PGNet, we obtain the synthetic fine-grained population data in target city as ground truth. With the synthetic multiple fine-grained population distributions as ground truth, Predictor and Feature Extractor are optimized to complete the prediction task with MSE loss.

  • Domain Classifier: Domain Classifier accepts the concatenation of feature maps from source and target domains. The classifier is highly similar to PGNet but we remove the last

    layers and employ multi-layer perceptron (MLP) to directly classify the domain of each pixel, which contains multiple channels. Domain Classifier is optimized to classify the domain of input feature maps by working with Predictor parallelly.

We repeat the optimization until convergence. Finally, we get a well-trained STNet to complete the fine-grained population mapping task on the target domain by learning the universal spatial-temporal knowledge between cities from the model-based, data-based, and optimization-based transfer views.

4. Experiments

4.1. Dataset

We employ real-life datasets from cities, which are represented by CITY1 - CITY4, to evaluate the performance of models. These datasets are collected from mobile devices by a popular mobile localization service provider in China, which is dense in the population level and thus close to the real population distribution. It covers cities with a duration of month (2018.082018.09). It records the locations whenever users request localization services in the applications. To obtain the fine-grained gridded population distribution, each location record is projected into a grid in chessboard as the finest granularity, while timestamp is projected into time windows of minutes. Records are aggregated by counting the population value of each grid in each time window. We noted that the raw data with anonymous individual information is not available and we could only access the aggregated population data.

We also collect POI data in these cities from the public website to support the experiments. For each city, we collect about million POI instances and calculate the category-based POI distribution map, which would be used in PSRNet. These POIs are classified into categories, including food, hotel, culture, sports, shopping, factory, recreation, institution, medical care, scenic spot, education, landmark, residence, travel & transport, business affairs, and life service.

4.2. Baselines

To evaluate the performance of our model, we compare it with state-of-the-art baselines, including traditional methods (Bicubic and LightGBM), super-resolution based methods for fine-grained population mapping task (DeepDPM and UrbanFM), and advanced methods for image and video super-resolution (RCAN, DBPN, RRN, and RBPN).

  • [leftmargin=*,partopsep=0pt,topsep=0pt]

  • Bicubic (Gonzlez1981DigitalIP): A widely used up-sampling method for image processing, we use it to up-sample the coarse-grained population distribution.

  • LightGBM (ke2017lightgbm):

    It is a gradient boost regression tree-based ensemble semi-automated machine learning method, which is highly efficient and effective.

  • DeepDPM (zong2019deepdpm): It contains a CNN-based spatial mapping network and an RNN-based temporal smoothness network to extract spatial-temporal features and achieves promising results for population mapping.

  • UrbanFM (liang2019urbanfm): With its ResNet-based super-resolution network and -Normalization layer, UrbanFM achieves state-of-the-art performance on the fine-grained urban flow inference task.

  • RCAN (zhang2018rcan):

    It improves its performance of image super-resolution by considering the residual connection and channel attention in the model.

  • DBPN (DBPN2018): With back-projection units and dense connection, it repetitively up-samples and down-samples the feature maps and concatenates them for high-resolution image reconstruction.

  • RRN (isobe2020RRN): By processing the feature map with its residual module recurrently, RRN is capable to complete the video super-resolution task.

  • RBPN (RBPN2019): As the state-of-the-art method for video super-resolution, it is an enhanced version of DBPN by considering the temporal correlation with a multiple projection mechanism.

Dataset CITY2 (2) CITY3 (2) CITY2 (4) CITY3 (4)
Bibubic 19.463 12.973 0.325 0.874 37.504 20.454 0.457 0.850 25.879 17.849 0.447 0.759 50.437 28.978 0.648 0.703
LightGBM (ke2017lightgbm) 19.507 12.755 0.320 0.883 37.453 20.021 0.448 0.851 26.519 17.849 0.447 0.768 50.460 28.654 0.641 0.707
DeepDPM (zong2019deepdpm) 19.904 13.737 0.344 0.877 32.135 17.234 0.385 0.900 25.913 16.932 0.424 0.797 41.501 21.184 0.474 0.842
UrbanFM (liang2019urbanfm) 16.546 10.640 0.267 0.917 19.808 10.594 0.237 0.958 20.900 13.499 0.338 0.866 27.722 13.654 0.305 0.925
RCAN (zhang2018rcan) 17.380 10.880 0.273 0.911 33.602 19.251 0.431 0.930 21.688 14.352 0.360 0.853 28.752 16.576 0.371 0.927
DBPN (DBPN2018) 17.745 11.404 0.286 0.908 20.438 11.355 0.254 0.956 25.223 16.128 0.404 0.818 29.920 16.965 0.379 0.910
RRN (isobe2020RRN) 17.836 11.502 0.288 0.905 34.074 19.506 0.436 0.900 38.977 25.185 0.631 0.696 50.008 31.832 0.712 0.798
 RBPN (RBPN2019) 17.909 12.142 0.304 0.901 23.096 14.587 0.326 0.946 22.857 15.612 0.391 0.835 31.628 19.196 0.429 0.902
OurBest 14.157 8.397 0.210 0.942 16.174 8.247 0.184 0.972 16.746 10.214 0.256 0.917 20.861 10.280 0.230 0.952
Improv. 14.4% 21.1% 21.1% 7.4% 18.3% 22.2% 22.2% 8.0% 19.9% 24.3% 24.3% 31.7% 24.7% 24.7% 24.7% 19.3%
Table 1. Results of our model and baselines. Bold denotes best (lowest) results. underline denotes the second-best results.

4.3. Experimental Settings

In the pre-processing, task means with a sequence of fine-grained population distribution of shape in a certain dataset, we add the population of adjacent grids together and get a sequence of coarse-grained population maps with shape . Then we use distribution to infer distribution. Fine-grained POI distribution and single reference fine-grained population distribution are always of fine-grained shape and .

For source cities, we randomly select time slots as the training set and utilize the remaining and time slots as validation set and test set. For target cities, we use week as the test set to evaluate the performance on the fine-grained population mapping task in the transfer learning scenario, while its first time slots is used as the reference fine-grained population distribution in target domain, it can be regarded as a one-shot transfer learning scenario.

The default settings of STNet are length of population sequence , time stride , number of layers , base input/output channel , time channel . For the generator of PGNet, we use base channel . For the discriminator of PGNet, we use number of layers , spatial stride , base channel .

We evaluate the model with metrics: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Pearson Correlation Coefficient (Corr). Based on these metrics, we calculate the error between estimated fine-grained population distributions and their ground truth.

We conducted experiments on Ubuntu 18.04.3 LTS system with 4x NVIDIA GTX 2080Ti using Python 3.6.10 and PyTorch 1.6.0. Our models, experiment code, and datasets are available via


4.4. One-shot Transfer across Cities

To verify the effectiveness of our PSRNet, we compare our model with baselines on the and fine-grained population mapping task in the cross-cities scenarios.

For all baselines, we firstly pre-train their models in CITY1 with sufficient data and employ MAML (finn2017MAML) on all cities except the target city as an additional meta-pre-training. Then we use the single reference fine-grained population distribution to fine-tune them on the target city (CITY2 or CITY3) and obtain the final mapping results.

In this subsection, to train our proposed PSRNet, firstly, we employ CITY1 to pre-train STNet and PGNet. Secondly, we employ the pre-trained PGNet to synthesize distribution in more time slots with the single reference fine-grained population distribution. Finally, we employ PADA 3.3 to fine-tune the pre-trained STNet with the synthetic population distribution. Then, PSRNet is enabled to generate fine-grained population distribution. We note that our PSRNet is not only compared with the architecture of baselines but also compared with MAML (finn2017MAML), which is the meta-learning mechanism for knowledge transfer across cities. We note that PSRNet only utilize source city (CITY1) for pre-training while source cities are used to support baselines’ meta-learning. That would bring advantages for baselines in this comparison.

Table 1 shows the performance of all baselines and our PSRNet, where the notation Improv. indicates the percentage of reduction of RMSE, MAE, MAPE, and the increase of Corr of our method when compared with the best baselines. From these results, we can find that PSRNet always outperforms all baselines in all metrics. Although UrbanFM, DBPN, RRN, and RBPN reach comparable results in some scenarios, none of them could outperform PSRNet in any task. They obtain , , , and by RMSE, which are the second-best results in and tasks when CITY2 and CITY3 are target cities. Compared with these results, PSRNet reduces the RMSE by , , , and , while other baselines could only achieve comparable performance in minor metrics or tasks, and fail to sustain a consistently comparable performance in other tasks. UrbanFM is the only comparable baseline since is specially designed for urban population mapping scenario, while it is still less competitive than PSRNet.

(a) Ground truth.
(b) Our model.
(c) UrbanFM.
(d) RBPN.
Figure 5. Visualization of results in CITY2.

In addition to the numerical evaluation of models’ performance, we also compare them by visualization. The ground truth map and predicted fine-grained population maps at 09:00 AM of PSRNet, UrbanFM, and RBPN in CITY2’s task are demonstrated in Figure 5.

In these figures, the lighter places denote greater population values and vice versa. The region circled by the smaller white rectangle is a region that contains fine-grained grids mapped from coarse-grained grids in task. The bigger rectangles are zooms of smaller rectangles. We note that the river crosses the city from the bottom to the right-up corner of the image and the selected region is on the bank of this river.

In Figure 5, PSRNet’s distribution have the closest shapes and textures with the ground truth distribution. For example, compared with PSRNet, UrbanFMs’ predicted fine-grained population suddenly changes on the margins of different coarse-grained pixels so the zoomed area has an obvious unnatural vertical dividing line. Besides, the map of RBPN is much lighter and rougher than PSRNet in the zoomed region. These results show that compared with RBPN, PSRNet could capture the pattern of the river and produce more reasonable fine-grained distributions.

In summary, numerical comparison and visualization prove our proposed PSRNet has significant advantages over state-of-the-art baselines in fine-grained population mapping tasks in the cross-cities transfer learning scenario.

4.5. One-shot Transfer across Granularities

In this subsection, we research an extreme data-scarce scenario in which any data out of the target cities is unobtainable. In this scenario, only coarse-grained distribution, single reference fine-grained population distribution, and POI distribution in target cities are available. This situation brings us a new challenge that we cannot transfer knowledge from other cities. Inspired by the self-supervised zero-shot super-resolution method (shocher2018zero), we design a novel cross-granularities self-supervision task in which we utilize coarser-grained distribution (down-sampled from coarse-grained distribution) to infer coarse-grained distribution as universal knowledge extraction.

Given the significant diversity between cities’ large-scale structures, it is a natural assumption that the domain shift between the population of same city’s different granularities was smaller than the population of different cities’ same granularity. Concretely, instead of transferring knowledge from CITY1 task to CITY2 task (e.g., CITY2’s task), in this subsection, we are attempting to test the feasibility to transfer knowledge from CITY2 to CITY2 . Therefore, we design a novel cross-granularities knowledge transfer task to test the capability of each model, in which we pre-train all baselines to infer population from population and use the single static reference distribution as ground truth to fine-tune these models. Finally, all baselines accept population to estimate the distribution of . For PSRNet, we adapt our proposed training procedure 3.3 into this scenario by pre-training both STNet and PGNet to generate the population distribution of .

Dataset CITY2 CITY3
Granularity (2km1km)(1km500m) (2km1km) (1km500m)
Bibubic 19.463 12.973 0.325 0.874 37.504 20.454 0.457 0.850
LightGBM 22.205 14.033 0.352 0.852 41.810 21.799 0.488 0.839
DeepDPM 19.698 13.314 0.334 0.873 34.333 18.854 0.422 0.879
UrbanFM 17.677 11.219 0.281 0.906 21.222 11.647 0.260 0.953
RCAN 18.698 12.693 0.318 0.889 30.956 19.144 0.428 0.900
DBPN 18.402 12.102 0.303 0.898 24.980 14.237 0.318 0.936
RRN 19.195 13.181 0.330 0.888 31.045 17.803 0.398 0.902
RBPN 21.382 14.458 0.362 0.859 35.185 21.869 0.489 0.881
PSRNet 13.073 7.950 0.199 0.950 16.164 7.886 0.176 0.972
Improv. 26.0% 29.1% 29.1% 4.8% 23.8% 32.3% 32.3% 2.0%
Table 2. Performance of our model and baselines on cross-granularities knowledge transfer task. (2km1km)(1km500m) means we pre-train PSRNet in source domain, in which we infer population of from population of , while in target domain, we infer from .

Table 2 demonstrate the performance of PSRNet and baselines in cross-granularities knowledge transfer scenario. According to Table 2, our PSRNet could always achieve the best performance in both cities and all metrics, while UrbanFM always reaches the second-best results. We note that in cross-cities scenario of task, PSRNet’s RMSE is and in CITY2 and CITY3, whereas these numbers are and in cross-granularities scenario, which shows the domain shift in the cross-granularities scenario is smaller than cross-cities scenario. Therefore, the knowledge of same city’s coarse-grained distribution is more transferable than other cities’ fine-grained distribution, which validates the aforementioned assumption. Furthermore, it also strongly implies PSRNet is potential to generate the finer-grained population distribution (i.e., of or even finer granularity) as long as we employ a finer-grained static reference distribution to fine-tune PSRNet. Unfortunately, we are not able to validate the results without enough ground truth in finer granularity.

Dataset CITY2 (2) CITY3 (2) CITY2 (4) CITY3 (4)
UrbanFM 16.546 10.640 0.267 0.917 19.808 10.594 0.237 0.958 20.900 13.499 0.338 0.866 27.722 13.654 0.305 0.925
SNet 15.395 9.426 0.236 0.930 18.589 9.201 0.206 0.963 19.444 12.105 0.303 0.887 26.157 12.616 0.341 0.934
SNet+PGNet 15.021 9.389 0.235 0.933 17.740 9.158 0.205 0.966 17.880 11.467 0.287 0.903 22.489 11.350 0.308 0.945
STNet 15.050 9.304 0.233 0.934 18.264 9.149 0.205 0.966 18.318 11.187 0.280 0.902 25.531 12.332 0.313 0.938
Meta-STNet 14.973 9.270 0.232 0.934 17.968 8.957 0.200 0.967 18.561 11.425 0.286 0.899 25.861 12.429 0.313 0.936
STNet+PGNet 14.157 8.397 0.210 0.942 16.420 8.451 0.189 0.971 16.795 10.336 0.259 0.916 21.020 10.392 0.268 0.952
PSRNet 14.714 9.066 0.227 0.937 16.174 8.247 0.184 0.972 16.746 10.214 0.256 0.917 20.861 10.280 0.263 0.952
Table 3. Performance of different variants of our model in cross-cities knowledge transfer task.

4.6. Ablation Study

In this section, we conduct an ablation study in cross-cities scenario to evaluate the effectiveness of each proposed component of PSRNet. To investigate their effect, we compare different variants of our method. To evaluate SNet’s capability of spatial-temporal knowledge extraction, UrbanFM is also introduced in this comparison. All the variants are introduced as follows:

  • [leftmargin=*]

  • SNet is the backbone population mapping network of PSRNet, which accepts coarse-grained population maps as multiple input channels.

  • +TNet denotes that we introduce TNet to enhance the temporal modeling of sequential input. SNet+TNet is simplified as STNet.

  • +PGNet represents we employ PGNet to generate synthetic fine-grained population map in target domain. Then these generated maps are used to fine-tune the pre-trained population mapping network.

  • +MAML means we employ meta-learning algorithm MAML (finn2017MAML) to drive an additional meta-training with three cities’ data except the target city for population mapping models.

  • +PADA means employ pixel-level adversarial domain adaptation 3 to fine-tune the model in target domain. The complete version of our proposed method is STNet+PGNet+PADA (denoted as PSRNet).

The results in Table 3 brings us several conclusions:

  • Our SNet could outperform UrbanFM, which shows that SNet could capture spatial patterns in population maps by its more advanced architecture.

  • In STNet, combined with TNet, which could effectively exploit temporal information, the feature maps of different time slots are fed into different layers. Therefore, each layer of STNet only needs to process less information. STNet outperforms SNet indicates that STNet is better to capture transferable features in all scenarios.

  • By comparing the performance of STNet and MAML+STNet, we find that although MAML slightly improves STNet’s performance in tasks, it fails to sustain this improvement in tasks. That shows the meta-learning method failed to transfer knowledge across cities stably.

  • SNet+PGNet outperforms SNet, while STNet+PGNet outperforms STNet in all scenarios. That shows our PGNet could always improve the performance by providing more synthetic ground truth in target city and transferring the correlation between POI distribution and gridded crowd flow.

  • Compared with STNet+PGNet, the performance of PSRNet (STNet+PGNet+PADA) is better in task in CITY2 and all tasks in CITY3. Although STNet+PGNet performs slightly better for task in CITY2, PSRNet is still comparable.

In summary, our proposed model-based, data-based, and optimization-based transfer learning components in PSRNet bring performance gain in one-shot transfer learning fine-grained population mapping task. That proves the reasonability of our design.

5. Related Works

5.1. Image and Video Super-Resolution

With the application of deep learning, research community 

(wang2019deep; dong2014learning; DBPN2018; zhang2018rcan; li2019fast) makes significant progress on image super-resolution task. SRCNN (dong2014learning) utilizes several convolution layers to build the first end-to-end framework for image super-resolution. With the basic idea of first building deep convolution networks to extract features and then up-sampling to obtain the high-resolution image, many following up works are proposed with more advanced network design (DBPN2018; zhang2018rcan)

, specific loss function 

(ledig2017photo) and so on. Considering the temporal correlation between multiple frames, image super-resolution upgrades to the video super-resolution task. While some works (liao2015video) try to model spatial-temporal correlation via the motion compensation between different frames by optical flow or learning, others (RBPN2019; li2019fast)

try to directly learn the spatial-temporal dependency with different sequential structures like recurrent neural networks. Different from these existing works on the general image/video super-resolution task, we focus on the fine-grained population mapping task and propose effective methods to transfer the spatial-temporal knowledge to promote the performance in cities without fine-grained data.

5.2. Fine-Grained Population Mapping

By applying the successful practice of image super-resolution into fine-grained population mapping, DeepDPM (zong2019deepdpm) and UrbanFM (liang2019urbanfm) are the most related work to our work. DeepDPM (zong2019deepdpm) first utilizes SRCNN (dong2014learning) with stacking structure to up-sample the static population distribution and then utilizes LSTM to refine the population variation in the temporal dimension. To infer the fine-grained crowd flow, UrbanFM (liang2019urbanfm) proposes a ResNet-based network structure with applying the recent practice from image super-resolution and also consider the effects of external factors like holidays in the model. While they achieve promising performance in the city with enough data, they require a large number of fine-grained data to train the whole model, which is not available for most of the cities. Different from them, our work considers the transferred fine-grained population mapping task and proposes to transfer the spatial-temporal knowledge from the data and model view to improve the mapping performance on these cities without fine-grained data.

5.3. Transfer Learning Among Cities

Knowledge transferring between cities is an important topic in urban computing. Wei et al. (Wei2016TransferKB) propose FLORAL with learning semantically related dictionaries and transferring dictionaries and instances to predict air quality in different cities. Wang et al. (Wang2019CrossCityTL) propose to use slide information from public check-ins to align regions from different cities to enable the explicit knowledge transfer in the crowd flow prediction task. Yao et al. (Yao2019LearningFM) apply MAML optimization methods to enable the multi-cities crowd flow prediction. These existing works focus on the multi-variant time series prediction problem and are not directly available for our mapping task. Furthermore, different from these works which only transfer knowledge from single view, our framework enables the spatial-temporal knowledge transfer from model, data, and optimization views.

6. Conclusion

In this paper, we investigated the fine-grained population mapping problem in the transfer learning scenario. We transfer this problem into a one-shot transfer learning problem for the population mapping task. We proposed a novel model by transferring the spatial-temporal knowledge from model view, data view, and optimization view. We designed a sequential population mapping network to capture the complicated correlation between the population of different granularities. Furthermore, we proposed a generative model to synthesize multiple fine-grained population samples in target domain with POI distribution. Finally, we utilized the adversarial adaptation methods to fine-tune the pre-trained model and transfer the universal spatial-temporal knowledge.

This work was supported in part by The National Key Research and Development Program of China under grant 2018YFB1800804, the National Nature Science Foundation of China under U1936217, 61971267, 61972223, 61941117, 61861136003, Beijing Natural Science Foundation under L182038, Beijing National Research Center for Information Science and Technology under 20031887521, and research fund of Tsinghua University - Tencent Joint Laboratory for Internet Innovation Technology.