Fine-Grained Urban Flow Inference

The ubiquitous deployment of monitoring devices in urban flow monitoring systems induces a significant cost for maintenance and operation. A technique is required to reduce the number of deployed devices, while preventing the degeneration of data accuracy and granularity. In this paper, we present an approach for inferring the real-time and fine-grained crowd flows throughout a city based on coarse-grained observations. This task exhibits two challenges: the spatial correlations between coarse- and fine-grained urban flows, and the complexities of external impacts. To tackle these issues, we develop a model entitled UrbanFM which consists of two major parts: 1) an inference network to generate fine-grained flow distributions from coarse-grained inputs that uses a feature extraction module and a novel distributional upsampling module; 2) a general fusion subnet to further boost the performance by considering the influence of different external factors. This structure provides outstanding effectiveness and efficiency for small scale upsampling. However, the single-pass upsampling used by UrbanFM is insufficient at higher upscaling rates. Therefore, we further present UrbanPy, a cascading model for progressive inference of fine-grained urban flows by decomposing the original tasks into multiple subtasks. Compared to UrbanFM, such an enhanced structure demonstrates favorable performance for larger-scale inference tasks.


page 1

page 4

page 6

page 8

page 11

page 13

page 16


UrbanFM: Inferring Fine-Grained Urban Flows

Urban flow monitoring systems play important roles in smart city efforts...

FASTER: Fusion AnalyticS for public Transport Event Response

Increasing urban concentration raises operational challenges that can be...

Road Network Guided Fine-Grained Urban Traffic Flow Inference

Accurate inference of fine-grained traffic flow from coarse-grained one ...

Towards Fine Grained Network Flow Prediction

One main challenge for the design of networks is that traffic load is no...

Fine-Grained Vehicle Classification in Urban Traffic Scenes using Deep Learning

The increasingly dense traffic is becoming a challenge in our local sett...

A fine-grained, versatile index of remoteness to characterize place-level rurality

Rural-urban classifications are essential for analyzing geographic, demo...

WiRe57 : A Fine-Grained Benchmark for Open Information Extraction

We build a reference for the task of Open Information Extraction, on fiv...

1 Introduction

Fine-grained urban flow monitoring systems are a crucial component of the information infrastructure systems of smart cities, providing a foundation for urban planning and various others applications such as traffic management. To obtain data at a spatially fine level of granularity, a system requires large numbers of sensing devices to be deployed in order to cover a citywide landscape. For example, thousands of piezoelectric sensors and loop detectors are deployed on road segments in a city to monitor fine-grained vehicle traffic flow volumes in real time. With a large number of devices deployed, a high cost is incurred due to the long-term operation (e.g., electricity and communication cost) and maintenance (e.g., on-site maintenance and warranty). A recent study showed that in Anyang, Korea, the annual operation and device maintenance costs for their smart city projects reached 100K USD and 400K USD respectively in 2015 [19]. With the rapid development of smart cities on a worldwide scale, the cost of manpower and energy will become a prohibitive factor for further smartening the Earth. To reduce such expense, people require a novel technology that allows reducing the number of deployed sensors while, most importantly, keeping the original data granularity unchanged. Therefore, how to approximate the original fine-grained information from available coarse-grained data (obtained from fewer sensors) becomes an urgent problem.

Take monitoring traffic on a university campus as an example. We can reduce the number of interior loop detectors and keep sensors only at the entrances to save cost. However, we still desire to recover the fine-grained flow distribution within the campus given only the coarse-grained information. In this paper, our goal is to infer the real-time and spatially fine-grained flows from observed coarse-grained data on a citywide scale with many other regions (as shown in Figure 1). This Fine-grained Urban Flow Inference (FUFI) problem, however, is very challenging due to the following reasons:

Fig. 1: Traffic flows at two levels of granularities in Beijing, where each grid denotes a region.
  • [leftmargin=*]

  • Spatial Correlations. Fine-grained flow maps have spatial and structural correlations with their coarse-grained counterparts. Essentially, the flow volume in a coarse-grained superregion (e.g., the campus), is distributed among constituent subregions (e.g., libraries, sports center) at the fine-grained level. This implies a crucial structural constraint (i.e., spatial hierarchy [39]): the sum of the flow volumes among subregions strictly equals that of the corresponding superregion, as shown in Figure 1. Furthermore, the flow in one region can be affected by the flows in the nearby regions, which will impact the inference for the fine-grained flow distributions over subregions. Methods failing to capture these considerations would exhibit poor performance.

  • External Factors. The distribution of the flows in a given region is affected by various external factors, such as local weather, time of day and special events. To understand such impacts, we present a real-world study in an area of Beijing as shown in Figure 2(a). On weekdays, (b) shows more flows occuring at 10 a.m. in the office area and attractions as compared to at 8 p.m. when residences experience much higher flow density than the other areas (see (e)); on weekends, however, (c) depicts that people tend to be present in a park in the morning. All of these reflects our common sense that people go to work in the morning, to attractions for relaxation at the weekend, and return home at night. In addition, (d) shows that people are keen to move to indoor areas instead of the outdoor park during storms. These observations demonstrate that regions with different semantics present different flow distributions in the presence of different external factors. Moreover, these external factors can compound and thus influence the actual distribution in complicated ways.

Fig. 2: Impacts of external factors on regional flow distributions. (a) We obtain Point of Interests (POIs) for different regions, and then categorize regions with different semantics according to the POI information. (b)-(e) depict the average flow distribution under various external conditions.

Inspired by techniques from the domain of image recovery, in the preliminary work we attack the FUFI problem by designing a neural network-based model entitled

Urban Flow Magnifier (UrbanFM) [21]

which resolves the above challenges with an innovative network structure. Firstly, we extract features from coarse-grained inputs using Convolutional Neural Networks (CNN) and perform upsampling based on high-level features. But in contrast to image processing, where the direct output is the target fine-grained image, we instead change the learning objective to inference of the

distributions of the fine-grained flows that capture how the flows in each superregion are distributed to their corresponding subregions. To this end, we present a distributional upsampling module with a novel and parameter-free layer entitled -Normalization

which provides superior performance over the image super-resolution baselines by exploiting the underlying structure of the FUFI problem. Moreover, we employ an external factor fusion subnet to capture the complexity of external impacts and produce a feature map that embeds the different impacts on different locations. Benefitting from the dedicated network architecture, UrbanFM outperforms all six baselines we have studied including heuristics and state-of-the-art methods across all three evaluation metrics we have considered..

Nevertheless, one limitation of our preliminary work is that it was evaluated only for 4x upsampling. When the required upsampling scale becomes larger (e.g., 8x), UrbanFM can encounter difficulties as it performs upsampling in a single forward pass. In other words, the upsampling is conducted by consecutively stacking multiple sub-pixel upsampling layers. Such a simple strategy complicates the tasks of feature extraction for frontal layers when the upsampled space becomes much larger. Inspired by the concept of Pyramid structure [15], in this paper, we present an enhanced model named Urban Pyramid Network (UrbanPy) which inherits key advantages from UrbanFM while performing progressive upsampling instead of single-pass. This model employs multiple key innovations to address the following deficiencies of its predecessor.

  • [leftmargin=*]

  • Pyramid Architecture. In contrast to UrbanFM, UrbanPy decompose the overall upsampling objective into multiple subprocesses. For objective decomposition, UrbanPy employs a pyramid architecture consisting of multiple components, where each component functions as an atomic upsampler for a small scale (e.g. 2x). Such decomposition allows the network to divide a difficult task into much easier subtasks such that each component can solve its own subtask more effectively. Each component consists of two subnets and processes the upsampling task through a propose-and-correct paradigm, where the proposal network aims to propose a prototype based on the previous output and the correct network learns to correct the prototype. Following the spirit of UrbanFM, both the proposal and correction subnets focus on modeling the distributions over corresponding subregions in the upsampled map.

  • Local Structure. UrbanFM utilizes classic convolution layers for feature extraction, where the kernel weights are shared globally without considering the local characteristics of each superregion in which the individual distribution applies. However, human flows can be highly correlated to the geographic nature of the location. For example, a park tends to have smaller flow volume compared to the street next to it. Extracting the region-specific features can be difficult when kernel weights are shared. To this end, we embed geographic knowledge (e.g., road network, POIs) to each location and employ a non-shared convolution layer. This results in each superregion enjoying a customized submodel for flow inference while tailoring to the local geographic structure.

  • Distributional Loss. Though UrbanFM explicitly implants the spatial hierarchy into its model architecture, the training loss used is mean square error (MSE). This can introduce inconsistency between the distributional nature that the network is expected to capture and the training objective. To bridge this gap, we explicitly train the model according to discrepancies in the distribution space. In particular, we compare the inferred distribution with the ground truth distribution using KL-divergence in every local superregion

    . This helps to preserve the local nature at the loss function compared to the simple MSE function.

Our key contributions are summarized as follows.

  • We formalize the problem of Fine-grained Urban Flow Inference, which is critical for modern urban information infrastructure construction. We show that the essence of this problem is to uncover the distributions of super-regions over their associative sub-regions.

  • We present UrbanFM, which exploits the problem structure and shows superior performance versus the baseline methods. Moreover, we identify the limitation of this preliminary work for large scale upsampling and present the improved method UrbanPy, which incorporates multiple key innovations over UrbanFM.

  • We conducted extensive experiments using real-world datasets, including city-scale (i.e., Beijing) and district-scale (i.e. HappyValley, a theme park). Empirical results demonstrate the superiority of our methods compared to multiple state-of-the-art approaches.

Outline. We first formalize the FUFI problem in Section 2. Then we present in Section 3 our preliminary work (i.e. UrbanFM) that addresses the problem by resolving the two challenges mentioned above. Furthermore, we discuss the limitations of UrbanFM and present the detail techniques employed by UrbanPy in Section 4. Extensive experiments are presented in Section 5 to demonstrate the effectiveness of our method, followed by discussions of related works in Section 6. Section 7 concludes the paper.

2 Problem Formulation

This section first defines some notation and then formulates the problem of Fine-grained Urban Flow Inference (FUFI).

Definition 1 (Region) As shown in Figure 1, we partition an area of interest (e.g., a city) evenly into an grid map based on longitude and latitude, where a grid ele.lment denotes a region [37]. Partitioning the city into smaller regions (i.e., using larger ) allows ones to obtain flow data with more detail, which produces a more fine-grained flow map.

Definition 2 (Flow Map) Let represent a flow map of a particular time, where each entry denotes the flow volume of the flow agents (e.g., vehicle, people, etc.) in region .

Definition 3 (Superregion & Subregion) In our FUFI problem, a coarse-grained grid map indicates the data granularity we can observe upon sensor reduction. It is obtained by combining nearby grids within an -by- range of a fine-grained grid map using a scaling factor . Figure 1 illustrates an example when . Each coarse-grained grid in Figure 1(a) is composed of smaller grids from Figure 1(b). We define the aggregated larger grid as a superregion, and its constituent smaller regions as subregions. Note that in this setting, superregions do not share subregions. Hence, the relationship between superregions and the corresponding subregions induces a special structural constraint in FUFI.

Definition 4 (Structural Constraint) The flow volume in a superregion of the coarse-grained grid map and the flows in the corresponding subregions of the fine-grained counterpart obey the following equation:


For simplicity, and in our paper unless otherwise specified.

Problem Statement (Fine-grained Urban Flow Inference)

Given an upscaling factor and a coarse-grained flow map , infer the fine-grained counterpart as accurately as possible subject to the structural constraints.

3 Urban Flow Magnifier

Fig. 3: The UrbanFM framework for upscaling (). denotes addition and denotes Hadamard product. Note that our framework allows other integer upscaling factor, not limited to power of 2.

Figure 3 depicts the framework of UrbanFM which consists of two main components for conducting structurally constrained inference of fine-grained flows and capturing complex external influence on the flows, respectively. The inference network takes the coarse-grained flow map as input, and then extract high-level features across the whole area by leveraging deep residual networks [9]. Taking extracted features as a priori knowledge, the distributional upsampling module outputs a flow distribution over the subregions of each superregion by introducing a dedicated -Normalization layer. Finally, the Hadamard product of the inferred distribution with the original coarse-grained flow map gives the fine-grained flow map as the network output. In an external factor fusion branch, we leverage embeddings and a dense network to extract pixel-wise external features at both coarse and fine granularity. The integration of external and flow features enables UrbanFM to exhibit fine-grained flow inference more effectively. In this section, we describe the designs of the two components, as well as the optimization scheme used in network training.

3.1 Inference Network

The inference network aims to produce a map of fine-grained flow distributions over subregions from a coarse-grained input. We follow the general procedure in image super resolution (SR) methods, which is composed of two phases: 1) feature extraction; 2) inference upon upsampled features.

3.1.1 Feature Extraction

In the input stage, we use a convolutional layer (with filter size and filter size ) to extract low-level features from the given coarse-grained flow map , and perform the first stage of fusion if external features are provided. Then residual blocks with identical layout take the (fused) low-level feature maps as input and construct high-level feature maps. The residual block layout, as shown on the top right of Figure 3, follows the guideline in Ledig et al. [18] which contains two convolutional layers (,

) followed by a batch normalization layer 


, with an intermediate ReLU 

[8] function to introduce non-linearity.

Since we utilize a fully convolutional architecture, the reception field grows larger as we stack the network deeper. In other words, each pixel at the high-level feature map will be able to capture distant or even citywide dependencies. Moreover, we use another convolutional layer (, ) followed by batch normalization to guarantee non-trivial feature extraction. Finally, drawing from the intuition that the output flow distribution exhibits region-to-region dependencies on the original , we employ a skip connection to introduce identity mapping [10] between the low-level features and high-level features, building an information highway skipping over the residual blocks to allow efficient gradient back-propagation.

3.1.2 Distributional Upsampling

In the second phase, the extracted features first go through sub-pixel blocks to perform an upscaling operation which produces a hidden feature . The sub-pixel block, as illustrated in Figure 3, leverages a convolutional layer (, ) followed by batch normalization to extract features. Then it uses a PixelShuffle layer [24] to rearrange and upsample the feature maps to size and applies a ReLU activation at the end. After processing each sub-pixel block, the output feature maps are 2 times larger spatially with the number of channels unchanged. A convolutional layer (, ) is applied post-upsampling, which maps

to a tensor

. in our case for simplicity. In SR tasks, is usually the final output for the recovered image with super-resolution. However, the structural constraint that is essential to FUFI has not yet been considered.

In order to impose the structural constraint on the network, one straightforward manner is to add a structural loss as a regularization term to the loss function:


However, simply applying does not improve the model performance, as we demonstrate in Section 5. Instead, we design an -Normalization layer, which outputs a distributions over every patch of -by- subregions of an associated superregion. To achieve this, we reformulate Equation 1 as in the following:


The flow volume in each subregion is now expressed as a fraction of that in the superregion, i.e.,

, and we can treat the fraction as a probability. This allows us to interpret the network output in a meaningful way: the value in each subregion pixel states how likely the overall superregion flow will be allocated to the subregion

. With this reformulation, we shift our focus from directly generating the fine-grained flow to generating the flow distributions . This essentially changes the network learning target and thus diverges from the traditional SR literature. To this end, we present the -Normalization layer: -Normalization, such that


The -Normalization layer induces no extra parameters to the network. Moreover, it can be easily implemented within a few lines of code (see Algorithm 1). Also, the operations can be fully paralleled and automatically differentiated at runtime. Remarkably, this reformulation relieves the network from concerning varying output scales and enables it to focus on producing a probability within constraint.

Input: x, scale_factor,
Output: out
// x: an input feature map
// scale_factor: the upscaling factor
// : a small number for numerical stability
// out: the structural distributions
sum = SumPooling(x, scale_factor);
sum = NearestNeighborUpsampling(sum, scale_factor);
out  = x (sum+) // element wise division
Algorithm 1 -Normalization

Finally, we upscale using nearest-neighbor upsampling [22] () with scaling factor

as the initial interpolation, and then generate the fine-grained inference by


3.2 External Factor Fusion

External factors, such as weather, can have a complicated and vital influence on the flow distribution over the subregions. For instance, even if the total population in a city remains stable over time, during stormy weather people tend to move from outdoor regions to indoor regions. When different external factors entangle, the actual impact on the flow becomes more difficult to capture. Therefore, we design a subnet to handle external factors all at once.

In particular, we first separate the available external factors into two groups, continuous features and categorical features. Continuous features including temperature and wind speed are directly concatenated to form a vector

. As shown in Figure 3, categorical features include the day of week, the time of the day and kind of weather (e.g, sunny, rainy). Inspired by previous studies [20], we transform the categorical attributes into low-dimensional vectors by feeding them into seperate embedding layers, and then concatenate those embeddings to construct the categorical vector . Then, the concatenation of and gives the final external embedding, with .

Once we obtain the concatenation vector , we feed it into a feature extraction module whose structure is depicted in Figure 3

. By using dense layers, the different external factors are compounded to construct a hidden representation, which models their complicated interaction. The module provides two outputs: the coarse-grained feature maps

and the fine-grained feature maps , where is obtained by passing through sub-pixel blocks similar to the ones in the inference network. Intuitively, (respectively ) is the spatial encoding for in the coarse-grained (fine-grained) setting, modeling how each superregion (subregion) individually responds to the external factors. Therefore we concatenate with , and with in the inference network. The early fusion of and allows the network to learn to extract a high-level feature describing not only the citywide flow, but also the external factors. In addition, the fine-grained carries the external information all the way to the rear of the inference network, playing a similar role as an information highway, and thus prevents information perishing in the deep network.

3.3 Loss Function

UrbanFM provides an end-to-end mapping from coarse-grained input to fine-grained output, which is differentiable everywhere. Therefore, we can train the network through auto back-propagation, by providing training pairs () and calculating empirical loss between (), where is the ground truth and is the outcome inferred by our network. As pixel-wise Mean Square Error (MSE) is a widely used cost function in many SR tasks, we employ the same in this work as follows:


where denotes the set of parameters in UrbanFM.

4 Urban Pyramid Network

Fig. 4: A UrbanPy framework for , and upscaling. We employ a cascading strategy to progressively upsample the coarse-grained inputs. At each level, we first prepare features from the input flow map and external factors. Then a propose-and-correct component is used to cooperatively produce two views of the target distribution map. We aggregate and renormalize the two views using -Normalization to give the map of mixture distributions which will then be an input for the next level. The foremost distribution map is initialized with an all-ones matrix. Note that we omit nearest neighbor upsampling and the Hadamard product to avoid redundant presentation of elements shown in Figure 3.

The design of UrbanFM has followed two key principles: first, reconstruct a fine-grained flow map according to high-level features; second, embed the structural constraint in the model design. Maintaining those two principles, we now present UrbanPy which advances the UrbanFM framework by resolving three limitations: 1) a single upsampling process, 2) non-distinguishable features and 3) inconsistent MSE loss. The overall architecture is depicted in Figure 4.

4.1 Pyramid Architecture

Taking a coarse-grained flow map as input, our model decomposes the high-scale upsampling task into consecutive upsampling subtasks with upsampling factors respectively, where .111For simplicity, upsample rates for the width and height are described as being the same here but need not be in the general case. In accordance with the decomposition, UrbanPy employs components of similar structure where each component upsamples from to . The common component structure includes two modules: 1) feature extraction and 2) propose-and-correct. The feature extraction module abstracts a higher-level representation of the original input and transforms the feature maps from a lower scale to a higher scale. Then the processed representations are fed into a proposal network and a correction network, which collaboratively infer the flow distributions at the scale. We describe the two modules further in the following.

4.1.1 Feature Extraction

The input to the feature extraction component a feature tensor generated from the previous feature extraction component. Given H, we first uses number of residual blocks of the same layout to construct a high-level representation without changing the spatial dimensions. In order to meet the upscaled dimension at the current level, we employ a Subpixel block to enlarge the spatial dimension of and produce upsampled feature maps . and thus provide two different views of the inputs, and are used as separate features for the following proposal network and correction network respectively. For simplicity, we set for

Highway Connection. Larger upscaling rates require stacking more feature extraction component, leading to a deeper network architecture. Although we have utilized residual blocks to facilitate gradient passage during back-propagation, the existence of other layers (e.g., upsampling) can affect the dynamics such that the deeper network becomes harder to train. Therefore, we add highway connections (denoted as blue arrows in Figure 4) across components such that previous features can be directly reused in the deeper layers. Specifically, given a list of previous representations , we upsample them to the scale of and then aggregate with the current representation by addition, which gives . This process resembles a moving average process with equal decaying factors.

External Factor Fusion. We employ the same external factor fusion module used in UrbanFM for extracting the external features. Let be . At each level we recruit a subpixel block to upsample to . We use the same weights in each upsampler to reduce the number of parameter as we observed no obvious advantage of using distinct weights in our experiments.

4.1.2 Propose and Correct

Fig. 5: Structure of the correction net, the proposal net and the non-shared convolution layer. (a) and (c) uses black dots to denote concatenation; is the extra geographic feature for obtaining local structure. (b) uses different color of to denotes different kernel weights; similar to urbanFM, it outputs a feature map of channel size before -Normalization.

Inspired by the idea of a Laplacian pyramid, where the difference between a blurred image and its original image is modeled, we apply a similar idea as a propose-and-correct architecture such that the proposal network aims to generate a prototype distribution map, and the correction network is responsible for correcting the prototype.

Specifically, given representations and , the proposal network produces a prototype by embracing the flow distributions produced at the previous level (see the dotted lines in Figure 4). Note that these connections give the whole network a cascade structure, which improves the consistency between different levels as there are obvious correlations between flows at different granularities. The interior structure of the proposal network is depicted in Figure 5(a), where we stack number of residual blocks to strengthen the capacity of the proposal network, followed by an -Normalization layer to enforces the hierarchical structure of the output distributions at the current scale. The dotted elements are described in the Section 4.2.

Taking and as input, the correction network employs a similar structure as the proposal network (i.e. convolutional layers followed by -Normalization) to generate a correction distribution map . But the correction network is designed to be light-weight. One reason is that adjusting the distribution based on the proposal network is a simpler task. Moreover, the correction network actually can take advantage of the weights of the upsample block from the feature extraction branch whose output also serves as direct input to the correction network. Therefore, we design the correction network with just one convolution layer to transform the upsampled feature maps.

Once we obtain the prototype and correction , we add these two distribution maps in an element-wise manner, and then renormalize the results to generate the final distribution map . That is,


Akin to UrbanFM, the generated distribution map is Hadamard-multiplied by the interpolated coarse-grained input to give the inferred fine-grained flow prediction at the scale.

4.1.3 A Probabilistic View

Note that the UrbanPy architecture is different from Laplacian pyramid models [16, 15, 33]. The latter add the interpolated image with the predicted residual image to obtain the final inference. However, in the problem of FUFI we concerned with modeling flow distributions, but distributions do not exhibit closure under addition (i.e., where and

is the set of distributions). Instead, UrbanPy infers the final distribution map as a symmetric mixture density of the prototype and the correction, where each super-region is represented by a mixture probability distribution.

4.2 Local Structure

Each super-region in the coarse-grain map can cover a very large area. For instance, each grid in the 8-by-8 granularity contains about 6.25 area in Beijing. The geographic properties (building layout, road plan) of a grid can vary significantly from one grid to another. To capture such specialty, we include the geographic features as additional knowledge and employ a non-shared convolution layer to allow customized feature extraction for each super-region.

4.2.1 Geographic Embeddings

For each level , we obtain level-specific geographic features including POI and road network. For POI, we obtain a set of POI density maps of the city across different categories (e.g., education and entertainment), which results in (raw) feature maps , where denotes the total number of POI categories. Likewise, in terms of the road network structure, we obtain the tier-1, tier-2 and tier-3 road density for each region and construct a road network feature tensor . Then the geographic feature is given by concatenation of both along the last dimension. , however, is very sparse. To mitigate the sparsity, we perform feature compression by pre-training a Stacked Denoising Auto-encoder [32] and then use the hidden code as a compressed knowledge of the raw features. Eventually, each grid can be represented by an individual embedding of size .

4.2.2 Non-shared Convolution

The UrbanFM model uses the classic convolution layers such that the kernel weights are shared globally, which can be insufficient to capture the peculiarity of each superregion. Therefore, besides recruiting geographic features, we use separated weights to produce a more local representation of each super-region before applying -Normalization, as is shown in Figure 5.

First, we define the classic discrete convolution operation according to the formulation from  [35] as follows222

To keep consistent, we use even kernel width here. An odd kernel width simply modifies

.. Let be a discrete function. Let and let be a discrete filter of size . Then the is defined as:


To generalize the classic convolution, we introduce a set of kernels where and define the local non-shared convolution as:


The kernel width

is thus equal to the convolution stride. Specifically, in each level

, we set = to force each kernel to focus on a specific super-region as shown in Figure 5(c). To reduce the parameter cost for customization, a bottleneck layer [9] is deployed in advance to compress the channel of the feature maps provided by previous layers to 2 for simplicity.

4.3 Distributional Loss

UrbanFM measures the MSE between the ground truth flow and the predicted flow values, and uses it as the loss to optimize the network parameters. Such loss is straight forward, but ignores the underlying structure of the problem. To resolve such inconsistency, we step deeper into the problem by directly measure the divergence existing between the truth distribution and the predicted flow distributions.

In particular, in level , we first prepare the ground truth =, where indicates nearest neighbour upsampling by scale . Once we obtain the inferred distribution map that contains different distributions, the total distributional loss between the two set of distributions is computed via KL-divergence:

The distributional loss exploits the essence of the model and is defined on each superregion-subregion distilling. Using alone can train the network till convergence, but its asymmetry can produce unstable gradients which slows down the training process [1]. Therefore, we combine both the MSE loss and distributional loss across at each level, and aggregate through all levels to constitute the overall loss function:


where is the coefficient to control the scale of the two losses. In experiments, we set =1e- by default to balance the magnitude of the gradients from the two losses, giving stable multi-task training according to [3].

5 Experiments

Our experiments aim to quantitively and qualitatively examine the capacity of the presented two models in a citywide scenario. Therefore, we conduct extensive experiments using taxi flows in Beijing to comprehensively test the model from different aspects. Different from the preliminary evaluations that focus only on upsampling, we conduct experiments involving four different scales. In addition to citywide, we conduct further experiments in a theme park, namely Happy Valley, to show the adaptivity of our models in a relatively small area.

5.1 Experimental Settings

5.1.1 Datasets

Table I details the two datasets we use, namely TaxiBj and HappyValley, where each dataset contains two sub-datasets: urban flows and external factors. Since a number of fine-grained flow data are available as ground truth, in this paper, we can obtain the a coarse-grained flows by aggregating subregion flows from the fine-grained counterparts. As our empirical evaluations span across multiple scales, we obtain data at each granularity separately. When conducting experiments for upscaling, we aggregate the subregions in a area to generate the flows of the corresponding superregion.

  • [leftmargin=*]

  • TaxiBJ333See our GitHub This dataset indicates the taxi flows traveling throughout Beijing. Figure 1 gives an example when the studied area is split into 3232 grids. where each grid reports the coarse-grained flow information every 30 minutes within four different periods: P1 to P4 (detailed in Table I). In our experiments, we partition the data into non-overlapping training, validation and test data by a ratio of 2:1:1 respectively for each period. For example, in P1 (7/1/2013-10/31/2013), we use the first two-month data as the training set, the next month as the validation set, and the last month as the test set. With this dataset, we construct coarse-grained data of 4 different granularities (i.e., , , and ) as the coarse-grained inputs, targeting upsampling factor respectively. The data partition for each granularity is the same.

  • HappyValley: We obtain this dataset by crawling from an open which provides hourly gridded crowd flow observations for a theme park named Beijing Happy Valley, with a total 5 area coverage, from 1/1/2018 to 10/31/2018. As shown in Figure 6, we partition this area with 2550 uniform grids in coarse-grained setting, and target a fine granularity at 50100 with an upscaling factor

    . Note that in this dataset, one special external factor is the ticket price, including day price and night price, obtained from the official account of HappyValley in WeChat. Regarding the smaller area, crowd flows exhibit large variance across samples given the 1-hour sampling rate. Thus, we use a ratio of 8:1:1 to split training, validation, and test set to provide more training data.

Dataset TaxiBJ HappyValley
Time span P1: 7/1/2013-10/31/2013
P2: 2/1/2014-6/30/2014 1/1/2018-
P3: 3/1/2015-6/30/2015 10/31/2018
P4: 11/1/2015-3/31/2016
Time interval 30 minutes 1 hour
Coarse-grained size 3232 2550
Fine-grained size 128128 50100
Upscaling factor () 4 2
External factors (meteorology, time and event)
Weather (e.g., Sunny) 16 types 8 types
Temperature/℃ [-24.6, 41.0] [-15.0, 39.0]
Wind speed/mph [0, 48.6] [0.1, 15.5]
# Holidays 41 33
Ticket prize/¥ / [29.9, 260]
Geographic features
Road network density /
Point of Interest density /
TABLE I: Dataset Description.
Methods Upscales P1 P2 P3 P4
MEAN 2 16.899 8.931 2.935 21.557 11.373 3.477 22.111 11.876 3.633 15.369 8.218 2.766
HA 2 3.494 1.723 0.306 3.932 1.933 0.305 4.072 2.002 0.299 3.063 1.547 0.287
SRCNN 2 3.216 1.793 0.433 3.500 1.998 0.468 3.587 2.034 0.446 2.861 1.643 0.422
VDSR 2 3.203 1.750 0.387 3.523 1.894 0.325 3.575 1.965 0.369 2.776 1.533 0.337
ESPCN 2 3.170 1.789 0.433 3.472 1.969 0.432 3.564 2.014 0.419 2.813 1.608 0.398
SRResNet 2 3.101 1.742 0.430 3.383 1.840 0.351 3.481 1.889 0.336 2.732 1.518 0.347
UrbanFM 2 3.015 1.553 0.265 3.344 1.729 0.260 3.415 1.783 0.262 2.675 1.397 0.248
DeepSD 2 3.216 1.793 0.433 3.500 1.998 0.468 3.587 2.034 0.446 2.861 1.643 0.422
LapSRN 2 3.202 1.763 0.396 3.468 1.900 0.370 3.584 1.959 0.360 2.797 1.545 0.338
UrbanPy 2 3.093 1.578 0.268 3.420 1.758 0.264 3.584 1.843 0.268 2.759 1.428 0.252
MEAN 4 20.918 12.019 4.469 20.918 12.019 5.364 27.442 16.029 5.612 19.049 11.070 4.192
HA 4 4.741 2.251 0.336 5.381 2.551 0.334 5.594 2.674 0.328 4.125 2.023 0.323
SRCNN 4 4.297 2.491 0.714 4.612 2.681 0.689 4.815 2.829 0.727 3.838 2.289 0.665
VDSR 4 4.159 2.213 0.467 4.586 2.498 0.486 4.730 2.548 0.461 3.654 1.978 0.411
ESPCN 4 4.206 2.497 0.732 4.569 2.727 0.732 4.744 2.862 0.773 3.728 2.228 0.711
SRResNet 4 4.164 2.457 0.713 4.524 2.660 0.688 4.690 2.775 0.717 3.667 2.189 0.637
UrbanFM 4 3.991 2.036 0.331 4.374 2.256 0.322 4.539 2.348 0.323 3.526 1.831 0.310
DeepSD 4 4.156 2.368 0.614 4.554 2.612 0.621 4.692 2.739 0.682 3.877 2.297 0.652
LapSRN 4 3.997 2.040 0.339 4.353 2.235 0.324 4.539 2.343 0.330 3.531 1.841 0.315
UrbanPy 4 3.949 1.997 0.330 4.359 2.227 0.323 4.519 2.319 0.326 3.514 1.821 0.314
MEAN 8 22.565 13.205 5.221 28.903 16.871 6.305 29.677 17.617 6.587 20.606 12.168 4.882
HA 8 5.629 2.682 0.442 6.429 3.058 0.443 6.717 3.211 0.431 4.959 2.433 0.423
SRCNN 8 6.103 3.433 1.027 6.569 3.708 0.971 6.959 4.012 1.086 5.518 3.181 0.935
VDSR 8 5.178 2.681 0.580 5.482 2.821 0.489 5.878 3.069 0.543 4.623 2.416 0.481
ESPCN 8 4.854 2.664 0.664 5.291 2.854 0.580 5.529 2.981 0.570 4.311 2.368 0.547
SRResNet 8 4.783 2.554 0.579 5.215 2.807 0.572 5.492 2.935 0.551 4.298 2.297 0.487
UrbanFM 8 4.748 2.373 0.377 5.224 2.647 0.366 5.488 2.776 0.358 4.195 2.130 0.342
DeepSD 8 5.412 3.056 0.831 5.680 3.175 0.756 6.023 3.397 0.804 4.733 2.452 0.450
LapSRN 8 4.772 2.355 0.401 5.197 2.578 0.375 5.456 2.721 0.383 4.209 2.141 0.369
UrbanPy 8 4.572 2.237 0.352 5.003 2.476 0.339 5.259 2.610 0.342 4.071 2.052 0.336
MEAN 16 23.157 13.622 5.540 29.695 17.390 6.722 30.521 18.165 7.035 21.193 12.554 5.191
HA 16 6.307 2.920 0.459 7.289 3.358 0.461 7.619 3.534 0.448 5.597 2.658 0.442
SRCNN 16 9.987 5.699 1.779 9.198 5.148 1.647 10.912 6.226 1.822 8.181 4.618 1.490
VDSR 16 6.313 2.994 0.551 6.780 3.216 0.468 7.074 3.429 0.517 5.758 2.867 0.572
ESPCN 16 5.373 2.914 0.762 5.956 3.211 0.750 6.256 3.422 0.806 4.892 2.699 0.715
SRResNet 16 5.381 2.672 0.504 5.959 3.056 0.594 6.295 3.189 0.559 5.066 2.596 0.551
UrbanFM 16 5.344 2.573 0.383 5.968 2.898 0.371 6.241 3.033 0.366 4.839 2.362 0.368
DeepSD 16 5.983 3.223 0.811 6.392 3.380 0.725 6.818 3.644 0.787 5.400 2.671 0.441
LapSRN 16 5.244 2.449 0.357 5.820 2.729 0.339 6.104 2.879 0.343 4.860 2.321 0.346
UrbanPy 16 5.204 2.434 0.357 5.750 2.719 0.352 6.061 2.868 0.346 4.778 2.290 0.350
TABLE II: Qualititative results for model comparison. We conducted inference experiments for four different scales, where all target the same endpoint with 128128 resolution. For each scale, the results for single-process and progressive upscaling are presented separately. Across all methods, we use underlined bold, bold and underline to indicate the best, the second best and the third performance, respectively. This helps us to identify the performance change along the enlarging of upscaling.

5.1.2 Evaluation Metrics

We use three common metrics for urban flow data to evaluate the model performance from different facets. Specifically, Root Mean Square Error (RMSE) is defined as:

where is the total number of samples, is -th the inferred value and is corresponding ground truth. Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) are defined as: and . In general, RMSE favors spiky distributions, while MAE and MAPE focus more on the smoothness of the outcome. Smaller metric scores indicate better model performances.

Fig. 6: Visualization of crowd flows in HappyValley.

5.1.3 Baselines

We compare our proposed model with seven baselines that belong to the following three classes: (1) Heuristics, (2) Single-pass upsampling and (3) Progressive upsampling. The heuristic methods are designed based on intuition or empirical knowledge. In the single-pass category, we include four state-of-the-art methods for single image super-resolution, the domain from which we are inspired to design UrbanFM. In the progressive upsampling branch, we involve two methods with different progressive strategies: stacking and cascading, where one is the state of the art on statistical upsampling for climate data and the other is for image super-resolution. We detail them as follows:

Heuristic methods:

  • [leftmargin=*]

  • Mean Partition (Mean): We evenly distribute the flow volume from each superregion in a coarse-grained flow map to the subregions, where is the upscaling factor.

  • Historical Average (HA): Similar to distributional upsampling, HA treats the value over each subregion a fraction of the value in the respective superregion, where the faction is computed by averaging all training data.

Single-pass methods:

  • [leftmargin=*]

  • SRCNN [4]: SRCNN presented the first successful introduction of convolutional neural networks (CNNs) into the SR problems. It consists of three layers: patch extraction, non-linear mapping, and reconstruction. Filters of spatial sizes , , and were used respectively. The number of filters in the two convolutional layers is 64 and 32 respectively. In SRCNN, the low-resolution input is upscaled to the high-resolution space using a single filter (commonly bicubic interpolation) before reconstruction.

  • ESPCN [24]: Bicubic interpolation used in SRCNN is a special case of the deconvolutional layer. To overcome the low efficiency of such deconvolutional layer, Efficient Sub-Pixel Convolutional Neural Network (ESPCN) employs a sub-pixel convolutional layer aggregates the feature maps from LR space and builds the SR image in a single step.

  • VDSR [13]: Since both SRCNN and ESPCN follow a three-stage architecture, they have several drawbacks such as slow convergence speed and limited representation ability. Inspired by the VGG-net, Kim et al. presents a Super-Resolution method using Very Deep neural networks with depth up to 20. This study suggests that a large depth is necessary for the task of SR.

  • SRResNet [18]: SRResNet enhances VDSR by using the residual architecture presented by He et al.[9]

    . The residual architecture allows one to stack a much larger number of network layers, which bases many benchmark methods in computer vision tasks.

Progressive methods:

  • [leftmargin=*]

  • DeepSD555As stacking of UrbanFMs gives similar or worse results over LapSRN at large scales, we show the results for DeepSD and LapSRN. [31]: DeepSD is the state-of-the-art method on statistical upcaling (i.e., super-resolutioin)for meteorological data. It employs the stacking strategy by independently training multiple SRCNNs, each aims at downscaling for an intermediate level. It performs further upsampling by simply stacking up those pretrained SRCNNs. This method, however, is slow as it needs to perform interpolation first and then extracts features on the large-size feature maps. Another related art [40] also employs this same technique for a different task.

  • LapSRN [15]: LapSRN is named due to the Laplacian pyramid structure. At each level, it predicts a residual image and then adds with the interpolated output from the previous level to construct the current prediction. The training of LapSRN employs a cascading strategy such that the whole model is trained end to end.

5.1.4 Variants

We study the following variants of UrbanFM to evaluate the roles of different components.

  • [leftmargin=*]

  • UrbanFM-ne: We simply remove the external factor fusion subnet from our method, which can help reveal the significance of this component.

  • UrbanFM-sl: Upon removing the external subnet, we further replace the distributional upsampling module by using sub-pixel blocks and to consider the structural constraint in this variant.

Study on the variants of UrbanPy involves choosing the different depth of the proposal network , different filter depth and filter size for each feature extraction module at each level. We denote the variants by -- and omit the name as shown in Table IV for a more succinct presentation.

5.1.5 Training Details & Hyperparameters

Our model, as well as the baselines, are completely implemented by PyTorch with one TITAN V GPU. We leverage Adam 


, an algorithm for stochastic gradient descent, to perform network training with learning rate


and use batch size being 16 for the single-pass methods. We also apply a staircase-like schedule by halving the learning rate every 20 epochs, which allows smoother search near the convergence point. In the external subnet, there are 128 hidden units in the first dense layer with dropout rate 0.3, and

hidden units in the second dense layer. We embed DayOfWeek to , HourOfDay to and weather condition to

. Besides, for VDSR and SRResNet, we use the default settings in their paper. Since SRCNN, ESPCN performs poorly with default settings, we test different hyperparameters for them and finally use 384 as the number of filters in their two convolutional layers.

For progressive methods, as the training are typically much slower, we double the learning rate as well as the batch size. We stack pre-trained SRCNN with 384 filter size to construct DeepSD, and default hyper-parameters for LapSRN. For UrbanPy, we use =8 for tasks to endow more power for the network and =4 for other scales unless otherwise specified. We save the best-performed model according to the validation results and early-stop the training if the best model is not altered after 50 epochs.

UrbanFM-SRResNet 5.6e-4 2.2e-4 2.2e-4
UrbanPy-LapSRN 4.0e-4 2.2e-4 4.5e-3
UrbanPy-UrbanFM 3.5e-2 6.1e-3 3.9e-2
TABLE III: P-value of Wilcoxon signed-rank test.

5.2 Results on TaxiBJ

5.2.1 Model Comparison

This subsection compares the model effectiveness against the baselines. We report the results of UrbanFM with - being 16-64 and UrbanPy with -- being 4-64-4 as our default settings. Further experiments on variants regarding different - and -- will be discussed later. Table II illustrates the overall performances of all methods for the TaxiBJ dataset for tasks with , , and upscaling. Due to space limitation, the key tests of significance regarding the results of this table is shown at Table III.

We summarize the table with several key observations.

  1. [leftmargin=*]

  2. Single-pass. By comparing UrbanFM with heuristic and single-pass baselines, it can be seen that UrbanFM consistently outperforms all methods in all metrics in all 16 groups of experiments. Take the strongest baseline SRResNet for example. By averaging across all experiments, UrbanFM advances it by 2%, 9% and 37% on RMSE, MAE, and MAPE respectively. Accordingly, the first row in Table III validates the significance of this result. Though the backbone structure are similar, the advances of UrbanFM over SRResNet indeed underline the effectiveness of the proposed distributional upsampling component and the usefulness of the features extracted by the external factor fusion module.

  3. Progressive. In the category of progressive upsampling, it can be seen that the LapSRN is the stronger baseline, which shows the betterment of using cascading strategy in our task, as it allows a thoroughly trained network. Nevertheless, LapSRN is beaten by UrbanPy in almost all metrics across all experiments. Specifically, the average improvements shown by UrbanPy are 2%, 3%, and 10% on the three metrics. This improvement comes not only from the spirits that are inherited from UrbanFM, but also the unique structure we design for UrbanPy.

  4. Single-pass vs Progressive. By comparing progressive methods against the single-pass methods, we can see that progressive networks generally demonstrate improvements at larger scales upsampling (typically at 4 and larger). For example, DeepSD versus the SRCNN counterparts and cascading methods versus all the single-pass baselines. In particular, UrbanPy outperform all single-pass baselines in this regard. This can be attributed to that progressive method allows the model to conduct upsampling in a smoother way instead of abruptly enlarging the output scale by a large factor. It also worths noting that UrbanFM remains very competitive at upscaling compared to the progressive baselines. This emphasizes that the proposed -Normalization and the external factor fusion can provide significant enhancement even without smoothing the upsampling task.

Test of Significance

The Wilcoxon test is an alternative for paired t-test when samples are from a non-normal distribution. Given two methods [A-B], we aim to test the alternative hypothesis: ”The error produced by A is significantly


than that of B” across the 16 experiments settings (four scales for four periods) and then report the p-values of the null hypothesis at Table 

III. The first and second rows are testing using our method and the best baseline in the respective category. The third row compares UrbanPy against UrbanFM. All tests are significant at level .

(a) 16-64 setting
(b) 16-128 setting
Fig. 7: Study on Distributional Upsampling. Performance comparison on whether or not applying the structural constraint.

5.2.2 Studies on Effectiveness

Study on Distributional Upsampling

To examine the effectiveness of the distributional upsampling module, we compare SRResNet with UrbanFM-ne (using distributional upsampling but no external factors) and UrbanFM-sl (using structural loss instead of distributional upsampling), as shown in Figure 7. In both - settings, it can be seen that UrbanFM-sl regularized by performs very close to the SRResNet which is not constrained at all. Though under the setting of 16-64, Urban-sl achieves a smaller error than SRResNet in a subtle way, under the 16-128 setting they behave the opposite. On the contrary, UrbanFM-ne consistently outperforms the others on all three metrics. This results has verified the superiority of the distributional upsampling module over for imposing the structural constraint.

Study on External Factor Fusion

External impacts, though are complicated, can assist the network for better inferences when they are properly modeled and integrated, especially in a more difficult situation when there is less data budget. Thereby, we study the effectiveness of external factors by randomly subsampling from the original training set according to four ratios (i.e., 10%, 30%, 50% and 100%) which corresponds to four difficulty levels: hard, semi-hard, medium and easy.

As shown in Figure 8, the gap between UrbanFM and UrbanFM-ne becomes larger as we reduce the number of training data, indicating that external factor fusion plays a more important role in providing a priori knowledge. When the training size grows, the weight for the priori knowledge decreases, as there exists overlaying information between observed traffic flows and external factors. Thus, the network may learn to capture some external impacts when given enough data. Moreover, this trend also occurs between UrbanFM and UrbanFM-sl, which illustrates that the -Normalization layer provides a strong structural prior to facilitate network training.

(a) Results on RMSE
(b) Results on MAE
Fig. 8: Study on External Factor Fusion. We reveals the contribution of the fusion network by varying the amount of available training data.
Methods Settings #Params RMSE MAE MAPE
UrbanFM 16-64 (base) 1.6M 4.107 2.118 0.322
SRResNet 20-64 1.8M 4.317 2.586 0.725
UrbanFM 20-64 1.9M 4.094 2.101 0.321
SRResNet 16-128 6.0M 4.301 2.588 0.740
UrbanFM 16-128 6.2M 4.069 2.092 0.316
SRResNet 16-256 24.2M 4.178 2.418 0.614
UrbanFM 16-256 24.4M 4.068 2.087 0.316
(a) Various - configurations of UrbanFM at
Method Settings #Params RMSE MAE MAPE
UrbanPy 4-64-4 (base) 4.7M 4.572 2.237 0.352
4-32-4 3.0M 4.731 2.326 0.379
4-128-4 11.6M 4.547 2.222 0.348
2-64-4 4.3M 4.619 2.265 0.359
8-64-4 5.6M 4.581 2.240 0.351
4-64-1 3.9M 4.576 2.239 0.350
4-64-8 5.8M 4.571 2.233 0.349
(b) Various -- configurations of UrbanPy at
TABLE IV: Study on Configurations. For both UrbanFM and UrbanPy, we are interested in the key configurations that control the network capability and size of parameters are the filter size , number of residual blocks and , and the number of layers of the proposal network.
Fig. 9: Ablation study on UrbanPy. We conduct three experiments on N=8 and N=16 tasks and report the mean and std of RMSE and MAE on the test sets. The configurations involve [Local Structure, Distributional Loss]. We use T to denote the present of an element or F otherwise.

Ablation study on UrbanPy

UrbanPy embraces three enhancements over UrbanFM. To reveal the individual contribution of each element, we conduct ablation studies and diagram the results in Figure 9. It can be seen that UrbanPy with the pyramid architecture alone (i.e., [F,F]) has already outperformed UrbanFM by a large margin in all four settings. This result underscores the importance of the progressive structure and demonstrate its leading role of the improvement of the inference task. By comparing different variants of UrbanPy, it can be seen that the local structure contributes (i.e., [T,F]) a bit more than the distributional loss (i.e., [F,T]). It can also provides more robustness as is witnessed in Figure 9c. This is not unexpected as the model needs to find a balance between RMSE and the KL-divergence and thus can variate in some situations. Nevertheless, the combination of both elements (i.e., [T,T]) achieves the best performance and good stableness in all settings.

Study on Configurations

Table (a)a compares the average performance over P1 to P4. Across all hyperparameter settings, UrbanFM consistently outperforms SRResNet, advancing by at least 2.6%, 13.7%, and 48.6%. Besides, this experiment reveals that adding more ResBlocks (larger ) or increasing the number of filters (larger ) can improve the model performance. However, these also increase the training time and memory space. Considering the tradeoff between training cost and performance, we use 16-64 for UrbanFM as default.

In Table (b)b, we compare different -- combinations with -- being the base setting. We observe the following. 1) By increasing we can see that the network also improves its performance, while the network parameter also blows up quickly. As =128 only outdoes = marginally, we use the latter setting as default. 2) Different from the effect of , increasing can temper the network performance. This is unlikely due to overfitting as the parameter size remains less than that of --. Instead, the network can introduce too much zero inputs at the high-level layers as the receptive field at the middle level can already cover the whole coarse-grained input area when is too large. Therefore, We find a reasonable depth for the residual block. 3) The proposal network is not sensitive to the change of as is shown in the last two rows. We use = since we find it is more stable for network training.

Fig. 10: Robustness along the change of data budget. We train the models by feeding different amounts of the training data of P1 for large upscaling tasks including and .

Study on Robustness to Training Data Budget

With the interest of model performance when less training data are available for larger-scale inference tasks, we depict the comparative results with the strongest baseline LapSRN in Figure 10. As it illustrates, all models increase their performances as the training data become larger. At =8, UrbanFM remains very competitive and even more data efficient than the LapSRN that enjoys the progressive structure. At =16, LapSRN achieves lower error than UrbanFM as the structure advantages start to overcomes the benefits bought by -Normalization and external features when the task becomes too difficult. Therefore, it is no surprise that UrbanPy outperforms the LapSRN in all data budgets for both metrics, as it combines the advances of UrbanFm and benefits from progressive upsampling.

(a) Model comparison.
(b) Variant comparison.
(c) Model comparison.
(d) Variant comparison.
Fig. 11: Convergence speed of various methods. Figure (a) and (b) show the convergence speed for single-pass methods using 4 upscaling factors. Differently, we plot logarithm scores in (c) and (d) for progressive methods with for a clearer illustration. Note that we double the training batch size for progressive methods and employ early stopping according to the best validation scores, as their training are generally slower. We use the same notation for variants as in Figure 9.

5.2.3 Study on Training Efficiency

Figure 11 plots the RMSE on the validation set during the training phase using P1-100%. Figure 11(a) and 11(b) delineate that UrbanFM converges much smoother and faster than the single-pass baselines and its variants. Specifically, 11(b) suggests such efficiency improvement can be mainly attributed to the -Normalization layer since UrbanFM-sl converges much slower and fluctuates drastically even it is constrained by , when compared with UrbanFM and UrbanFM-ne. This also suggests that learning the spatial correlation is a non-trivial task. Moreover, UrbanFM-ne behaves closely to UrbanFM as external factors fusion affects the training speed subtly when training data are abundant as suggested by the previous experiments.

The convergence curves for progressive methods are depicted by Figure 11(c). It illustrates that DeepSD can not converge to the same level as the cascading methods do, due to the inconsistency resides between the stacked components while cascading structure allows training a more coherent network. UrbanPy converges as fast as LapSRN and can be trained continuously longer, as the proposal network is a more powerful component than the simple bilinear interpolation function used by LapSRN. 11(d) gives a more detailed plot that focuses on UrbanPy and its variants. It can be seen that the [F,F] setting converges smoother then others, however, stops earlier and fails to improve further. [F,T] shows large fluctuation during training as the KL-divergence can reduce the stableness. Nevertheless, [T,T] shows both smooth convergence curve and continuous improvement, which explains why combining both local structure and distributional loss can outperform the state-of-the-art methods.

5.2.4 Visualization

1) Inference error. Figure 12 displays the inference error from our methods and the other four baselines for a sample at the 4 task, where a brighter pixel indicates a larger error. Contrast with the baseline methods, both UrbanFM and UrbanPy achieves higher fidelity for totality and in detail, which corresponds to the quantitive results from Table II. For instance, areas A and B are ”hard areas” to be inferred, as A (Sanyuan bridge, the main entrance to downtown) and B (Sihui bridge, a huge flyover) are two of the top congestion points in Beijing. Traffic flow of these locations usually fluctuates drastically and quickly, resulting in higher inference errors. Nonetheless, Our methods remain to produce better performances in these areas. Another observation is that the SR methods (SRCNN, VDSR, and SRResNet) tend to generate blurry images as compared to structural methods (HA and our methods). For instance, even if there is zero flow in area C, SR methods still generate error pixels as they overlap the predicted patches. This suggests the FUFI problem does differ from the ordinary SR problem and requires specific designs.

2) External influence. Figure 13(a)-(d) portray that the inferred distribution over subregions varies along with external factor changes. To stay succinct, we present the results of UrbanFM only as UrbanPy produces similar visualization regarding external factors. On weekdays, at 10 a.m., people had already flowed to the office area to start their work (b); at 9 p.m., many people had returned home after a hard-working day (c). On weekends, most people stayed home at 10 a.m. but some industrial researchers remained working in the university labs. This result proves that our methods indeed capture the external influence and learns to adjust the inference accordingly.

Fig. 12: Visualization for inference errors among different methods. Best view in color.
Fig. 13: Case study on a superregion near Peking University. See our github for further dynamic analysis on this area.

5.2.5 Results on HappyValley

Table V shows model performances using the HappyValley dataset. Note that in this experiment, we do not include DeepSD, since this task contains only upscaling such that DeepSD degrades to SRCNN in this case. One important trait of the HappyValley dataset is that it contains more spikes on the fine-grained flow distribution, which results in a much larger RMSE score versus that in the TaxiBJ task. Nonetheless, in the single-pass branch UrbanFM remains the winner method outperforming the best baseline by 3.5%, 7.8%, and 22%; the UrbanFM-ne still holds the runner-up position. Moreover, it is not unexpected to see LapSRN is worse than UrbanFM as the former shows no progressive superiority over UrbanFM in this task. Move on to the progressive branch. Though the [F,F] variants show worse performances than UrbanFM, as the compositional architecture can complicate the training when the task is as simple as 2 upscaling, the full models of UrbanPy can provide better scores then its single-pass counterpart, which validates the usefulness of the two components even when no structural advance can be exploited.

To summarize, this collection of experiment prove that our methods not only work on the large-scale scenario, but is also adaptable to smaller areas, which concludes our empirical studies.

Methods Settings Params RMSE MAE MRE
MEAN x x 9.206 2.269 0.799
HA x x 8.379 1.811 0.549
SRCNN 768 7.4M 8.291 2.175 0.816
ESPCN 768 7.4M 8.156 2.155 0.805
VDSR 16-64 0.6M 8.490 2.128 0.756
SRResNet 16-128 5.5M 8.318 1.941 0.679
UrbanFM-sl 16-128 5.5M 8.312 1.939 0.677
UrbanFM-ne 16-128 5.5M 8.138 1.816 0.537
UrbanFM 16-128 5.6M 8.030 1.790 0.531
LapSRN 10-128 3.2M 8.249 1.832 0.547
UrbanPy-FF 8-64-4 1.2M 8.280 1.879 0.587
UrbanPy-TT 8-64-4 1.3M 8.028 1.749 0.523
UrbanPy-FF 8-128-4 4.4M 8.184 1.900 0.618
UrbanPy-TT 8-128-4 4.4M 8.332 1.732 0.508
TABLE V: Results comparison on Happy Valley. We evaluate the task with 2 upscaling for this area. All models are selected based on the best validation performance and test results are presented.


While our methods demonstrates leading performance for both low-scale (UrbanFM) and large-scale (UrbanPy) urban flow inference tasks, the current structure accepts the regular partition of the urban area. For non-regular partition, we need to use a graph to represent the locations as nodes and connections between locations (e.g., road networks) as the edges. Besides, the UrbanPy learns slower than its single-pass counterpart (typically 4 times slower as can be seen in Figure 11) as the dynamics become more complicated with the pyramid structure, which is also noted by  [16, 33]. Nevertheless, this is a trade-off between training efficiency and inference performance. We suggest that when the required upsampling scale is large, the UrbanPy is a more favorable choice; if the training time is of the key concern or the scale is small, we should opt for the UrbanFM model.

6 Related Work

6.1 Image Super-Resolution

Single image super-resolution (SISR), which aims to recover a high-resolution (HR) image from a single low-resolution (LR) image, has gained increasing research attention for decades. This task finds direct applications in many areas such as face recognition

[7], fine-grained crowdsourcing [29] and HDTV [23]. Over the years, the computer vision community has presented many efforts in developing SISR algorithms that can be largely categorized into two: single-pass and progressive methods.

6.1.1 Single-pass methods

Single-pass methods process coarse-grained images in one or multiple consecutive upsampling steps. Early upsampling techniques exploited interpolation methods such as bicubic interpolation and Lanczos resampling [5]. Also, several studies utilized statistical image priors [27, 28] to achieve better performances. Advanced works aimed at learning the non-linear mapping between LR and HR images with neighbor embedding [2] and sparse coding [34, 30]. However, these approaches are still inadequate to reconstruct realistic and fine-grained textures of images.

Recently, a series of models based on deep learning has achieved great success in terms of SISR as they do not require any human-engineered features and show the state-of-the-art performance. Since

Dong et al. [4] first proposed an end-to-end mapping method represented as CNNs between the low-resolution (LR) and high-resolution (HR) images, various CNN based architectures have been studied for SR. Among them, Shi et al. [24]

introduced an efficient sub-pixel convolutional layer which is capable of recovering HR images with very little additional computational cost compared with the deconvolutional layer at training phase. Inspired by VGG-net for ImageNet classification

[25], a very deep CNN was applied for SISR in [13]. However, training a very deep network for SR is really hard due to the small convergence rate. Kim et al. [13] showed residual learning speed up their training phase and verified that increasing the network depth could contribute to a significant improvement in SR accuracy.

The general process of SISR methods (i.e., feature extraction followed by SR image recovery) inspires our solution for FUFI. However, these approaches are not suitable for the FUFI problem since the flow data present a very specific hierarchical structure with regard to natural images, as such, the related arts cannot be simply applied to our application in terms of efficiency and effectiveness.

6.1.2 Progressive methods

Though single-pass methods demonstrate useful performances at small-scale upsampling (typically 2 and ), these methods encounter difficulties when dealing with large-scale super-resolution tasks (e.g., 8[15]. This can be attributed to the abrupt upsampling based on low-level features and utilize only one supervision signal at the output end. To tackle this problem, several nascent works [15, 16, 33] proposed progressive models based on laplacian pyramid, where the network aimed to learn the upsampled residuals and perform upsampling by aggregating the residuals with interpolated images. This inspires the cascading design of our UrbanPy architecture.

Apart from super-resolving classical images, there are limited studies that utilized super-resolution methods to solve real-world problems in the urban area. In particular, two very recent works [31, 40] employ the same strategy of stacking SRCNN [4] for two different tasks:  Vandal et al. [31] aimed at statistical downscaling of climate and earth system simulations based on observational and topographical data; likewise,  Zong et al. [40] addressed the task of inferring fine-grained population density by treating the population heat maps as images.

Different from the related arts that directly target the modeling on pixel values, we instead model the distributions over the superregions and their fine-grained counterparts, by doing which we are able to capture the essence of the FUFI problem. Moreover, we also include the external features which are very unique in the urban scenario.

6.2 Urban Flows Analysis

Due to the wide applications of traffic analysis and the increasing demand for real-time public safety monitoring, urban flow analysis has recently attracted the attention of a large amount of researchers [39]. Zheng et al. [39]

first transformed public traffic trajectories into other data formats, such as graphs and tensors, to which more data mining and machine learning techniques can be applied. Based on our observation, there were several previous works

[26, 6] forecasting millions, or even billions of individual mobility traces rather than aggregated flows in a region.

Recently, researchers have started to focus on city-scale traffic flow prediction [11]. Inspired by deep learning techniques that power many applications in modern society [17], a novel deep neural network was developed by Zhang et al. [36] to simultaneously model spatial dependencies (both near and distant), and temporal dynamics of various scales (i.e., closeness, period and trend) for citywide crowd flow prediction. Following this work, Zhang et al. [37] further proposed a deep spatio-temporal residual network to collectively predict inflow and outflow of crowds in every city grid. Apart from the above applications, very recently Liang et al. [21] presented UrbanFM, the first work to the best of our knowledge to solve the novel FUFI problem in urban scenario. In this paper we further extend the capability of UrbanFM to solve larger-scale inference tasks by presenting the UrbanPy framework.

7 Conclusion

In this paper, we have formalized the fine-grained urban flow inference problem and two versions of deep neural network-based methods to solve it. The preliminary version (i.e., UrbanFM) focuses on addressing the two specific challenges of the problem through embedding the hierarchical structure in the model and generating a comprehensive representation for external factors. Build upon the key components of UrbanFM, we present a more advanced version named UrbanPy by employing the progressive upsampling strategy, which resolves the defects of UrbanFM when tackling larger-scale inference tasks. We have conducted extensive experiments, both qualitatively and quantitively to study the actual performance of the models using the TaxiBJ dataset and HappyValley datasets. The empirical studies and visualizations have supported the advantages of both UrbanFM and UrbanPy on both efficiency and effectiveness. Codes are also published for the community 666

We have also discussed the limitation of the current work, which is mainly due to the learning dynamic of the pyramid structure and remains an open problem. For our future work, we are interested in improving the learning efficiency of the UrbanPy framework, by curriculum strategy [33] or exploring differnet network structures [38].


  • [1] C. M. Bishop (2006) Pattern recognition and machine learning. Springer Science+ Business Media. Cited by: §4.3.
  • [2] H. Chang, D. Yeung, and Y. Xiong (2004) Super-resolution through neighbor embedding. In CVPR, Cited by: §6.1.1.
  • [3] Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich (2018) GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning, pp. 793–802. Cited by: §4.3.
  • [4] C. Dong, C. C. Loy, K. He, and X. Tang (2016) Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2), pp. 295–307. Cited by: 1st item, §6.1.1, §6.1.2.
  • [5] C. E. Duchon (1979) Lanczos filtering in one and two dimensions. Journal of Applied Meteorology 18 (8), pp. 1016–1022. Cited by: §6.1.1.
  • [6] Z. Fan, X. Song, R. Shibasaki, and R. Adachi (2015) CityMomentum: an online approach for crowd behavior prediction at a citywide level. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 559–569. Cited by: §6.2.
  • [7] B. K. Gunturk, A. U. Batur, Y. Altunbasak, M. H. Hayes, and R. M. Mersereau (2003) Eigenface-domain super-resolution for face recognition. IEEE Transactions on Image Processing 12 (5), pp. 597–606. Cited by: §6.1.
  • [8] R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung (2000) Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405 (6789), pp. 947. Cited by: §3.1.1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3, §4.2.2, 4th item.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In ECCV, pp. 630–645. Cited by: §3.1.1.
  • [11] M. X. Hoang, Y. Zheng, and A. K. Singh (2016) FCCF: forecasting citywide crowd flows based on big data. In Proceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 6. Cited by: §6.2.
  • [12] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.1.1.
  • [13] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, pp. 1646–1654. Cited by: 3rd item, §6.1.1.
  • [14] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.5.
  • [15] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate superresolution. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 5. Cited by: §1, §4.1.3, 2nd item, §6.1.2.
  • [16] W. Lai, J. Huang, N. Ahuja, and M. Yang (2018) Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.1.3, §5.2.5, §6.1.2.
  • [17] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436. Cited by: §6.2.
  • [18] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network.. In CVPR, Vol. 2, pp. 4. Cited by: §3.1.1, 4th item.
  • [19] S. K. Lee, H. R. Kwon, H. Cho, J. Kim, and D. Lee (2016) International case studies of smart cities: anyang, republic of korea. Cited by: §1.
  • [20] Y. Liang, S. Ke, J. Zhang, X. Yi, and Y. Zheng (2018) GeoMAN: multi-level attention networks for geo-sensory time series prediction. In

    Proceedings of the International Joint Conference on Artificial Intelligence

    Cited by: §3.2.
  • [21] Y. Liang, K. Ouyang, L. Jing, S. Ruan, Y. Liu, J. Zhang, D. S. Rosenblum, and Y. Zheng (2019) UrbanFM: Inferring fine-grained urban flows. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 3132–3142. Cited by: Fine-Grained Urban Flow Inference, §1, §6.2.
  • [22] Cited by: §3.1.2.
  • [23] S. C. Park, M. K. Park, and M. G. Kang (2003) Super-resolution image reconstruction: a technical overview. IEEE Signal Processing Magazine 20 (3), pp. 21–36. Cited by: §6.1.
  • [24] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In ICCV, pp. 1874–1883. Cited by: §3.1.2, 2nd item, §6.1.1.
  • [25] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §6.1.1.
  • [26] X. Song, Q. Zhang, Y. Sekimoto, and R. Shibasaki (2014) Prediction of human emergency behavior and their mobility following large-scale disaster. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 5–14. Cited by: §6.2.
  • [27] J. Sun, Z. Xu, and H. Shum (2008) Image super-resolution using gradient profile prior. In CVPR, pp. 1–8. Cited by: §6.1.1.
  • [28] Y. Tai, S. Liu, M. S. Brown, and S. Lin (2010) Super resolution using edge prior and single image detail synthesis. In CVPR, pp. 2400–2407. Cited by: §6.1.1.
  • [29] M. W. Thornton, P. M. Atkinson, and D. Holland (2006) Sub-pixel mapping of rural land cover objects from fine spatial resolution satellite sensor imagery using super-resolution pixel-swapping. International Journal of Remote Sensing 27 (3), pp. 473–491. Cited by: §6.1.
  • [30] R. Timofte, V. De Smet, and L. Van Gool (2014) A+: adjusted anchored neighborhood regression for fast super-resolution. In Asian Conference on Computer Vision, pp. 111–126. Cited by: §6.1.1.
  • [31] T. Vandal, E. Kodra, S. Ganguly, A. Michaelis, R. Nemani, and A. R. Ganguly (2017) Deepsd: generating high resolution climate change projections through single image super-resolution. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1663–1672. Cited by: 1st item, §6.1.2.
  • [32] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010)

    Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion

    Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §4.2.1.
  • [33] Y. Wang, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O. Sorkine-Hornung, and C. Schroers (2018) A fully progressive approach to single-image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 864–873. Cited by: §4.1.3, §5.2.5, §6.1.2, §7.
  • [34] J. Yang, J. Wright, T. S. Huang, and Y. Ma (2010) Image super-resolution via sparse representation. IEEE Transactions on Image Processing 19 (11), pp. 2861–2873. Cited by: §6.1.1.
  • [35] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §4.2.2.
  • [36] J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi (2016) DNN-based prediction model for spatio-temporal data. In Proceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 92. Cited by: §6.2.
  • [37] J. Zhang, Y. Zheng, and D. Qi (2017) Deep spatio-temporal residual networks for citywide crowd flows prediction. In The AAAI Conference on Artificial Intelligence, pp. 1655–1661. Cited by: §2, §6.2.
  • [38] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §7.
  • [39] Y. Zheng, L. Capra, O. Wolfson, and H. Yang (2014) Urban computing: concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology 5 (3), pp. 38. Cited by: 1st item, §6.2.
  • [40] Z. Zong, J. Feng, K. Liu, H. Shi, and Y. Li (2019) DeepDPM: dynamic population mapping via deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 1294–1301. Cited by: 1st item, §6.1.2.