CloudLSTM: A Recurrent Neural Model for Spatiotemporal Point-cloud Stream Forecasting

This paper introduces CloudLSTM, a new branch of recurrent neural network models tailored to forecasting over data streams generated by geospatial point-cloud sources. We design a Dynamic Convolution (D-Conv) operator as the core component of CloudLSTMs, which allows performing convolution operations directly over point-clouds and extracts local spatial features from sets of neighboring points that surround different elements of the input. This maintains the permutation invariance of sequence-to-sequence learning frameworks, while enabling learnable neighboring correlations at each time step -- an important aspect in spatiotemporal predictive learning. The D-Conv operator resolves the grid-structural data requirements of existing spatiotemporal forecasting models (e.g. ConvLSTM) and can be easily plugged into traditional LSTM architectures with sequence-to-sequence learning and attention mechanisms. As a case study, we perform antenna-level forecasting of the data traffic generated by mobile services, demonstrating that the proposed CloudLSTM achieves state-of-the-art performance with measurement datasets collected in operational metropolitan-scale mobile network deployments.



There are no comments yet.


page 13


MeteorNet: Deep Learning on Dynamic 3D Point Cloud Sequences

Understanding dynamic 3D environment is crucial for robotic agents and m...

PointRNN: Point Recurrent Neural Network for Moving Point Cloud Processing

Point cloud is attracting more and more attention in the community. Howe...

Machine Learning for Spatiotemporal Sequence Forecasting: A Survey

Spatiotemporal systems are common in the real-world. Forecasting the mul...

Multi-Service Mobile Traffic Forecasting via Convolutional Long Short-Term Memories

Network slicing is increasingly used to partition network infrastructure...

PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning

The predictive learning of spatiotemporal sequences aims to generate fut...

Linked Dynamic Graph CNN: Learning on Point Cloud via Linking Hierarchical Features

Learning on point cloud is eagerly in demand because the point cloud is ...

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

The goal of precipitation nowcasting is to predict the future rainfall i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point-cloud stream forecasting seeks to predict the future values and/or locations of data streams generated by a geospatial point cloud , given sequences of historical observations shi2018machine . Example data sources include mobile network antennas that serve the traffic generated by ubiquitous mobile services at city scale zhang2019deep , sensors that monitor the air quality of a target region cheng2018neural , or moving crowds that produce individual trajectories. Unlike traditional spatiotemporal forecasting on grid-structural data (e.g., precipitation nowcasting xingjian2015convolutional , or video frame prediction wang2018predrnn++

), point-cloud stream forecasting needs to operate on geometrically scattered sets of points, which are irregular and unordered, and encapsulate complex spatial correlations. While vanilla Long Short-term Memories (LSTMs) have modest abilities to exploit spatial features

xingjian2015convolutional , convolution-based recurrent neural network (RNN) models (e.g., ConvLSTM xingjian2015convolutional and PredRNN++ wang2018predrnn++ ) are limited to modeling grid-structural data, and are therefore inappropriate for handling scattered point-clouds.

Figure 1: Different approaches to geospatial data stream forecasting: predicting over input data streams that are inherently grid-structured, e.g., video frames using ConvLSTMs (top); mapping of point-cloud input to a grid, e.g., mobile network traffic collected at different antennas in a city, to enable forecasting using existing neural network structures (middle); forecasting directly over point-cloud data streams using historical information (as above, but without pre-processing), as proposed in this paper (bottom).

Leveraging the location information embedded in such irregular data sources, so as to learn important spatiotemporal features, is in fact challenging. Existing approaches that tackle the point-cloud stream forecasting problem can be categorized into two classes, both of which bear significant shortcomings: (i) methods that transform point-clouds into data structures amenable to processing with mature solutions (e.g., grids Patr1907:Multi , see Fig.1); and (ii) models that ignore the exact locations of each data source and inherent spatial correlations (e.g., liang2018geoman ). The transformations required by the former not only add data preprocessing overheads, but also introduce spatial displacements, which distorts relevant correlations among points Patr1907:Multi . On the other hand, the latter are largely location-invariant, while recent literature suggests spatial correlations should be revisited over time, to suit series prediction tasks shi2017deep . In essence, overlooking dynamic spatial correlations will lead to modest forecasting performance.

Contributions. In this paper, we introduce Convolutional Point-cloud LSTMs (CloudLSTMs), a new branch of recurrent neural network models tailored to geospatial point-cloud stream forecasting. The CloudLSTM builds upon a Dynamic Convolution (-Conv) operator, which takes raw point-cloud streams (both data time series and spatial coordinates) as input, and performs dynamic convolution over these, to learn spatiotemporal features over time, irrespective of the topology and permutations of the point-cloud. This eliminates the data preprocessing overheads mentioned above and circumvents the negative effects of spatial displacement. The proposed CloudLSTM takes into account the locations of each data source and performs dynamic positioning at each time step, to conduct a deformable convolution operation over point-clouds dai2017deformable . This allows revising the spatial and temporal correlations, and the configuration of the data points over time, and guarantees that the location-variant property is met at different steps. Importantly, the -Conv operator is flexible, as it can be easily plugged into existing neural network models for different purposes, such as RNNs, LSTMs, sequence-to-sequence (Seq2seq) learning sutskever2014sequence , and attention mechanisms luong2015effective .

We perform antenna-level forecasting of data traffic generated by mobile services zhang2018long ; bega2019deepcog as a case study, experimenting with metropolitan-scale mobile traffic measurements collected in two European cities for 38 popular mobile apps. This represents an important application of geospatial point-cloud stream forecasting. We combine our CloudLSTM with Seq2seq learning and an attention mechanism, then undertake a comprehensive evaluation on both datasets. The results obtained demonstrate that our architecture can deliver precise long-term mobile traffic forecasting, outperforming eight different baseline neural network models in terms of four performance metrics, without any data preprocessing requirements. To the best of knowledge, the proposed CloudLSTM is the first dedicated neural architecture for spatiotemporal forecasting that operates directly on point-cloud streams.

2 Related Work

Spatiotemporal Forecasting. Convolution-based RNN architectures have been widely employed for spatiotemporal forecasting, as they simultaneously capture spatial and temporal dynamics of the input. Shi et al., incorporate convolution into LSTMs, building a ConvLSTM for precipitation nowcasting xingjian2015convolutional

. This approach exploits spatial information, which in turn leads to higher prediction accuracy. The ConvLSTM is improved by constructing a subnetwork to predict state-to-state connections, thereby guaranteeing location-variance and flexibility of the model

shi2017deep . PredRNN wang2017predrnn and PredRNN++ wang2018predrnn++ evolve the ConvLSTM by constructing spatiotemporal cells and adding gradient highway units. These improve long-term forecasting performance and mitigate the gradient vanishing problem encountered in recurrent architectures. Although these solution work well for spatiotemporal forecasting, they can not be applied directly to point-cloud streams, as they require point-cloud-to-grid preprocessing Patr1907:Multi .

Feature Learning on Point-clouds. Deep neural networks for feature learning on point-cloud data are advancing rapidly. PointNet performs feature learning and maintains input permutation invariance qi2017pointnet

. PointNet++ upgrades this structure by hierarchically partitioning point-clouds and performing feature extraction on local regions 

qi2017pointnet++ . VoxelNet employs voxel feature encoding to limit inter-point interactions within a voxel zhou2018voxelnet . This effectively projects cloud-points onto sub-grids, which enables feature learning. Li et al., generalize the convolution operation on point-clouds and employ -transformations to learn the weights and permutations for the features li2018pointcnn . Through this, the proposed PointCNN leverages spatial-local correlations of point clouds, irrespective of the order of the input. Notably, although these architectures can learn spatial features of point-clouds, they are designed to work with static data, thus have limited ability to discover temporal dependencies.

3 Convolutional Point-cloud LSTM

Next, we describe in detail the concept and properties of forecasting over point cloud-streams. We then introduce the -Conv operator, which is at the core of our proposed CloudLSTM architecture. Finally, we present CloudLSTM and its variants, and explain how to combine CloudLSTM with Seq2seq learning and attention mechanisms, to achieve precise forecasting over point-cloud streams.

3.1 Forecasting over Point-cloud Streams

We formally define a point-cloud containing a set of points, as . Each point contains two sets of features, i.e., , where are value features (e.g., mobile traffic measurements, air quality indexes, etc.) of , and are its -dimensional coordinates. At each time step , we may obtain different channels of by conducting different measurements111These resemble the RGB channels in images. denoted by . We can then formulate the -step point-cloud stream forecasting problem, given observations, as:


Note that, in some cases, each point’s coordinates may be unchanged, since the data sources are deployed at fixed locations. An ideal point-cloud stream forecasting model should embrace five key properties, similar to other point-cloud applications and spatiotemporal forecasting problems qi2017pointnet ; shi2017deep :
(i) Order invariance: A point cloud is usually arranged without a specific order. Permutations of the input points should not affect the output of the forecasting qi2017pointnet .
(ii) Information intactness: The output of the model should have exactly the same number of points as the input, without losing any information, .
(iii) Interaction among points: Points in are not isolated, thus the model should be able to capture local dependencies among neighboring points and allow interactions qi2017pointnet .
(iv) Robustness to transformations: The model should be robust to correlation-preserving transformation operations on point-clouds, e.g., scaling and shifting qi2017pointnet .
(v) Location variance: The spatial correlations among points may change over time. Such dynamic correlations should be revised and learnable during training shi2017deep .

In what follows, we introduce the Dynamic Convolution (-Conv) operator as the core module of the Cloud-LSTM, and explain how -Conv satisfies the aforementioned properties.

3.2 Dynamic Convolution over Point Cloud

The dynamic convolution operator (-Conv) absorbs the concept of ordinary convolution over grids, which takes

channels of 2D tensors as input, and outputs

channels of 2D tensors of smaller size (if without padding). Similarly, the

-Conv takes channels of a point-cloud , and outputs channels of a point-cloud, but with the same number of elements as the input, to ensure the information intactness property (ii) discussed previously. For simplicity, we denote the channel of the input set as and the channel of the output as . Both and are 3D tensors, of shape and respectively.

We also define as a subset of points in , which includes the nearest points with respect to in the Euclidean space, i.e., , where is the -th nearest point to in the set . Note that itself is included in as an anchor point, i.e., . Recall that each contains value features and coordinate features, i.e., , where and . Similar to the vanilla convolution operator, for each in , the -Conv sums the element-wise product over all features and points in , to obtain the values and coordinates of a point in . The mathematical expression of the -Conv is thus:


In the above, we define learnable weights as 5D tensors with shape . The weights are shared across different anchor points in the input map. Each element is a scalar weight for the -th input channel, -th output channel, -th nearest neighbor of each point corresponding to the -th value and coordinate features for each input point, and -th value and coordinate features for output points. Similar to the convolution operator, we define as a bias for the -th output map. In the above, and are the ()-th value features of the input/output point set. Likewise, and are the ()-th coordinate features of the input/output.

is the sigmoid function, which limits the range of predicted coordinates to

, to avoid outliers. Before feeding them to the model, the coordinates of raw point-clouds are normalized to

by , on each dimension. This improves the transformation robustness of the operator.

Figure 2: Illustration of the -Conv operator, with a single input channel and neighbors. For every , -Conv weights its neighboring set to produce values and coordinate features for . Here, each is a set of weights with index (i.e., -th nearest neighbor) in Eq. 2, shared across different .

We provide a graphical illustration of -Conv in Fig. 2. For each point , the -Conv operator weights its nearest neighbors across all features, to produce the values and coordinates in the next layer. Since the permutation of the input neither affects the neighboring information nor the ranking of their distances for any , -Conv is a symmetric function whose output does not depend on the input order. This means that the property (i) discussed in Sec. 3.1 is satisfied. Further, -Conv is performed on every point in set and produces exactly the same number of features and points for its output; property (ii) is therefore naturally fulfilled. In addition, operating over a neighboring point set, irrespective of its layout, allows to capture local dependencies and improve the robustness to global transformations (e.g., shifting and scaling). The normalization over the coordinate features further improves the robustness to those transformations. This enables to meet the desired properties (iii) and (iv). More importantly, -Conv learns the layout and topology of the cloud-point for the next layer, which changes the neighboring set for each point at output . This enables the “location-variance” (property (v)), allowing the model to perform dynamic positioning tailored to each channel and time step. This is essential in spatiotemporal forecasting neural models, as spatial correlations change over time shi2017deep .

-Conv can be efficiently implemented using simple 2D convolution, by reshaping the input map and weight tensor, which can be parallelized easily in existing deep learning frameworks. We detail this in the appendix.

Complexity Analysis. We study the complexity of -Conv by separating the operation into two steps: (i) finding the neighboring set for each point , and (ii) performing the weighting computation in Eq. 2. We discuss the complexity of each step separately. For simplicity and without loss of generality, we assume the number of input and output channels are both 1. For step (i), the complexity of computing a point-wise Euclidean distance matrix is , while finding nearest neighbors for one point has complexity , if using heapsort schaffer1993analysis . As such, (i) has complexity . For step (ii), it is easy to see from Eq. 2 that the complexity of computing one feature of the output is . Since each point has features and the output point set has points, the overall complexity of step (ii) becomes . This is equivalent to the complexity of a vanilla convolution operator, where both the input and output have channels, and the input map and kernel have and elements, respectively. This implies that, compared to the convolution operator whose inputs, outputs, and filters have the same size, -Conv introduces extra complexity by searching the nearest neighbors for each point.

Relations with PointCNN li2018pointcnn and Deformable Convolution dai2017deformable . The -Conv operator builds upon the PointCNN li2018pointcnn

and deformable convolution neural network (DefCNN) on grids

dai2017deformable , but introduces several variations tailored to point-cloud structural data. PointCNN employs the

-transformation over point clouds, to learn the weight and permutation on a local point set using multilayer perceptrons (MLPs), which introduces extra complexity. This operator guarantees the order invariance property, but leads to information loss, since it performs aggregation over points. In our

-Conv operator, the permutation is maintained by aligning the weight of the ranking of distances between point and . Since the distance ranking is unrelated to the order of the inputs, the order invariance is ensured in a parameter-free manner without extra complexity and loss of information.

Further, the -Conv operator can be viewed as the DefCNN dai2017deformable over point-clouds, with the differences that (i) DefCNN deforms weighted filters, while -Conv deforms the input maps; and (ii)

DefCNN employs bilinear interpolation over input maps with a set of continuous offsets, while

-Conv instead selects neighboring points for its operations. Both DefCNN and -Conv have transformation modeling flexibility, allowing adaptive receptive fields on convolution.

3.3 The CloudLSTM Architecture

The -Conv operator can be plugged straightforwardly into LSTMs, to learn both spatial and temporal correlations over point-clouds. We formulate the Convolutional Point-cloud LSTM (CloudLSTM) as:


Similar to ConvLSTM xingjian2015convolutional , , , and , are input, forget, and output gates respectively. denotes the memory cell and is the hidden states. Note that , , , , and are all point cloud representations. and represent learnable weight and bias tensors. In Eq. 3, ‘’ denotes the element-wise product, ‘’ is the -Conv operator formalized in Eq. 2, and ‘’ a simplified -Conv that removes the sigmoid function in Eq. 2. The latter only operates over the gates computation, as the sigmoid functions are already involved in outer calculations (first, second, and fourth expressions in Eq. 3). We show the structure of a basic CloudLSTM cell in the left subplot of Fig. 3.

Figure 3: The inner structure of the CloudLSTM cell (left) and the overall Seq2seq CloudLSTM architecture (right). We denote by and the value and coordinate features of each input, while these features are unified for gates.

We combine our CloudLSTM with Seq2seq learning sutskever2014sequence and the soft attention mechanism luong2015effective , to perform forecasting, given that these neural models have been proven to be effective in spatiotemporal modelling on grid-structural data (e.g., xingjian2015convolutional ; zhang2018attention ). We show the overall Seq2seq CloudLSTM in the right subplot of Fig. 3

. The architecture incorporates an encoder and a decoder, which are different stacks of CloudLSTMs. The encoder encodes the historical information into a tensor, while the decoder decodes the tensor into predictions. The states of the encoder and decoder are connected using the soft attention mechanism via a context vector

luong2015effective . Before feeding the point-cloud to the model and generating the final forecasting, the data is processed by Point Cloud Convolutional (CloudCNN) layers, which perform the

-Conv operations. Their function is similar to the word embedding layer in natural language processing tasks

mikolov2013distributed , which helps translate the raw point cloud into tensors and vice versa. In this study, we employ a two-stack encoder-decoder architecture, and configure 36 channels for each CloudLSTM cell, as we found that further increasing the number of stacks and channels does not improve the performance significantly.

Beyond CloudLSTM, we also explore plugging the -Conv into vanilla RNN and Convolutional GRU, which leads to a new Convolutional Point-cloud RNN (CloudRNN) and Convolutional Point-cloud GRU (CloudGRU), as formulated by the following equations respectively:

CloudRNN: (4)
CloudGRU: (5)
The CloudRNN and CloudGRU share a similar Seq2seq architecture with CloudLSTM, except that they do not employ the attention mechanism. We compare their performance in the following section.

4 Experiments

To evaluate the performance of our architectures, we employ antenna-level mobile traffic forecasting as a case study and experiment with two large-scale mobile traffic datasets. We use the proposed CloudLSTM to forecast future mobile traffic consumption at scatter-distributed antennas in the regions of interest. We provide a comprehensive comparison with 8 baseline deep learning models, over four performance metrics. All models considered in this study are implemented using the open-source Python libraries TensorFlow 

tensorflow2015-whitepaper and TensorLayer tensorlayer . We train all architectures with a computing cluster with two NVIDIA Tesla K40M GPUs. We optimize all models by minimizing the mean square error (MSE) between predictions and ground truth, using the Adam optimizer kingma2015adam .

Next, we first introduce the dataset employed in this study, then discuss the baseline models used for comparison, the experimental settings, and the performance metrics employed for evaluation. Finally, we report on the experimental results and provide visualizations that reveal further insights.

4.1 Dataset and Preprocessing

We conduct experiments using large-scale multi-service datasets collected by a major operator in two large European metropolitan areas during 85 consecutive days. The data consists of the volume of traffic generated by devices associated to each of the 792 and respectively 260 antennas in the two target areas. The antennas are non-uniformly distributed over the urban regions, thus they can be viewed as 2D point clouds over space. Their locations are fixed across the measurements period.

At each antenna, the traffic volume is expressed in Megabytes and aggregated over 5-min intervals, which leads to 24,482 traffic snapshots. These snapshots are gathered independently for each of 38 different mobile services, selected among the most popular apps, including video streaming, gaming, messaging, cloud services, social networking, etc.  Due to data protection and confidentiality constraints, we do not disclose the identity of the mobile operator, and we do not provide information about the exact location of the data collection equipment, or the names of the mobile services considered. The data collection procedure was conducted under the supervision of the competent national privacy agency, which complies with regulations. The dataset is fully anonymized, as it only comprises service traffic aggregated at the antenna level, without unveiling personal information.222Due to a confidentiality agreement with the mobile traffic data owner, the raw data cannot be made public.

Before feeding to the models the traffic measurements for each mobile service, these are transformed into different input channels of the point-cloud . All coordinate features are normalized to the range. In addition, for the baseline models that require grid-structural input (i.e., CNN, 3D-CNN, ConvLSTM and PredRNN++), the point clouds are transformed into grids Patr1907:Multi using the Hungarian algorithm kuhn1955hungarian , as required. The ratio of training, validation, and testing sets is ::.

4.2 Benchmarks and Performance Metrics

We compare the performance of our proposed CloudLSTM with a set of baseline models, as follows. MLP Goodfellow-et-al-2016 , CNN krizhevsky2012imagenet , and 3D-CNN ji20133d are frequently used as benchmarks in mobile traffic forecasting (e.g., bega2019deepcog ; zhang2018long ). DefCNN learns the shape of the convolutional filters and has similarities with the -Conv operator proposed in this study dai2017deformable . PointCNN li2018pointcnn performs convolution over point-clouds and has been employed for point-cloud classification and segmentation. LSTM is an advanced RNN frequently employed for time series forecasting hochreiter1997long . While ConvLSTM xingjian2015convolutional can be viewed as a baseline model for spatiotemporal predictive learning, the PredRNN++ is the state-of-the-art architecture for spatiotemporal forecasting on grid-structural data and achieves the best performance in many applications wang2018predrnn++ . Beyond these models, we also compare the CloudLSTM with two of its variations, i.e., CloudRNN and CloudGRU, which were introduced in Sec. 3.3.

We quantify the accuracy of the proposed CloudLSTM, in terms of Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). Since the mobile traffic snapshots can be viewed as “urban images” liu2015urban , we also select Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) hore2010image to quantify the fidelity of the forecasts and their similarity with the ground truth, as suggested by relevant recent work zhang2017zipnet .

We employ all neural networks to forecast city-scale future mobile traffic consumption for up to 30 mins, given consecutive 30-min measurements sampled every 5 minutes. That is all models take as input 6 snapshots () and forecast following 6 traffic volume snapshots (). For RNN-based models, i.e., LSTM, ConvLSTM, PredRNN++, CloudLSTM, CloudRNN, and CloudGRU, we extend the number of prediction steps to (3 hours), to evaluate their long-term performance.

4.3 Result and Visualization

We perform 6-step forecasting for 4,888 instances across the test set, and report in Table 1

the mean and standard deviation (std) of each metric. We also investigate the effect of different numbers of neighboring points considered (i.e.,

), as well as the influence of the attention mechanism.

Model City 1 City 2
MLP Goodfellow-et-al-2016 4.790.54 9.942.56 49.562.13 0.270.12 4.590.59 9.442.45 50.302.28 0.330.14
CNN krizhevsky2012imagenet 6.000.61 11.022.10 48.931.60 0.250.12 5.310.50 10.062.05 49.961.86 0.320.14
3D-CNN ji20133d 5.150.60 10.252.46 49.342.05 0.300.13 5.320.48 10.182.02 49.821.77 0.350.16
DefCNN dai2017deformable 6.760.81 11.722.57 48.431.82 0.160.08 5.310.51 9.992.12 49.841.87 0.320.14
PointCNN li2018pointcnn 5.010.46 9.982.33 49.771.94 0.300.12 5.130.46 9.602.36 50.071.49 0.340.13
LSTM hochreiter1997long 4.230.66 9.613.18 50.083.21 0.300.12 4.361.64 9.223.03 50.753.26 0.370.13
ConvLSTM xingjian2015convolutional 4.191.66 9.683.19 49.993.21 0.310.11 4.281.63 9.193.02 50.683.24 0.380.13
PredRNN++ wang2018predrnn++ 4.161.65 9.643.18 49.993.20 0.310.11 4.251.60 9.163.02 50.683.24 0.380.13
CloudRNN () 4.151.68 9.313.20 50.303.22 0.300.12 4.161.66 8.863.05 50.963.24 0.380.14
CloudGRU () 3.981.64 9.253.17 50.283.19 0.340.11 4.091.61 8.783.02 50.983.25 0.410.13
CloudLSTM () 3.871.68 9.173.19 50.313.21 0.330.11 4.081.57 8.843.02 50.953.24 0.410.13
CloudLSTM () 3.871.68 9.173.19 50.313.21 0.340.11 4.041.64 8.813.03 50.983.25 0.400.13
CloudLSTM () 3.831.68 9.133.19 50.333.21 0.350.11 4.011.59 8.783.02 51.003.24 0.410.13
CloudLSTM ()
3.821.64 9.113.19 50.343.21 0.350.12 3.991.62 8.773.01 51.103.23 0.420.12
Table 1: The meanstd of MAE, RMSE, PSNR, and SSIM across all models considered, evaluated on two datasets collected in different cities.

Observe that RNN-based architectures in general obtain superior performance, compared to CNN-based models and the MLP. In particular, our proposed CloudLSTMs, CloudRNNs, and CloudGRUs outperform all the architectures considered in this study, on both datasets, achieving lower MAE and RMSE, and higher PSNR and SSIM. This suggests that the -Conv operator is more effective in feature learning over geospatial point-clouds, as compared to vanilla convolution used in other models. Turning attention to our approaches, by considering the same number of neighbours (), CloudLSTM performs better than CloudGRU, which in turn outperforms CloudRNN. The forecasting performance of the CloudLSTM seems fairly insensitive to the number of neighbors (). It is therefore worth using a small in practice, to reduce model complexity, as this does not compromise the accuracy significantly. Lastly, we observe that the attention mechanism only contributes marginally to forecasting performance. This is due to the nature of the spatiotemporal forecasting task: uncertainty will grow significantly over time in mutli-step prediction, thus dependency with states in the encoder also degenerates over time.

Long-term Forecasting Performance. We extend the prediction horizon to up to time steps (i.e., 3 hours) for all RNN-based architectures, and show their MAE evolution with respect to this horizon in Fig. 4. Note that the input length remains unchanged, i.e., 6 time steps. In city 1, observe that the MAE does not grow significantly with the prediction step for most models, as the curves flatten. This means that these model are reliable in terms of long-term forecasting. As for city 2, we note that low may lead to poorer long term performance for CloudLSTM, though not significant before step 20. This provides a guideline on choosing for different forecast length required.

Figure 4: MAE evolution wrt. prediction horizon achieved by RNN-based models on both cities.

Visualization. We complete the evaluation by visualizing the hidden features of the CloudLSTM, to give insights into the knowledge learned by the model. To this end, in Fig. 5 we show an example of the scatter distributions of the hidden state in of CloudLSTM and Attention CloudLSTM at both stacks. The first 6 columns show the for encoders, while the rest are for decoders. The input data snapshots are samples selected from City 2 (260 antennas/points). Recall that each has 1 value features and 2 coordinate features for each point, therefore each scatter subplot in Fig. 5 shows the value features (volume represented by different colors) and coordinate features (different locations), averaged over all channels. Observe that in most subplots, points with higher values (warm colors) tend to aggregate into clusters and have higher densities. These clusters exhibit gradual changes from higher to lower values, leading to comet-shape assemblages. This implies that points with high values also come with tighter spatial correlations, thus CloudLSTMs learn to aggregate them. This pattern becomes more obvious in stack 2, as features are extracted at a higher level, exhibiting more direct spatial correlations with respect to the output.

Figure 5: The scatter distributions of the value and coordinate features of the hidden state in for CloudLSTM and Attention CloudLSTM. Values and coordinates are averaged over all channels.

5 Conclusion

We introduce CloudLSTM, a dedicated neural model for spatiotemporal forecasting tailored to point-cloud data streams. The CloudLSTM builds upon the -Conv operator, which performs convolution over point-clouds to learn spatial features while maintaining permutation invariance. The -Conv simultaneously predicts the values and coordinates of each point, thereby adapting to changing spatial correlations of the data at each time step. -Conv is flexible, as it can be easily combined with various RNN models (i.e., RNN, GRU, and LSTM), Seq2seq learning and attention mechanisms. We employ antenna-level mobile traffic forecasting as a case study, where we show that our proposed CloudLSTM achieves state-of-the-art performance on large-scale datasets collected in two major European cities. We believe the CloudLSTM gives a new perspective on point-cloud stream modelling, and it can be easily extended to higher dimension point-clouds, without requiring changes to the model.


  • [1] Xingjian Shi and Dit-Yan Yeung. Machine learning for spatiotemporal sequence forecasting: A survey. arXiv preprint arXiv:1808.06865, 2018.
  • [2] Chaoyun Zhang, Paul Patras, and Hamed Haddadi. Deep learning in mobile and wireless networking: A survey. IEEE Communications Surveys & Tutorials, 2019.
  • [3] Weiyu Cheng, Yanyan Shen, Yanmin Zhu, and Linpeng Huang.

    A neural attention model for urban air quality inference: Learning the weights of monitoring stations.


    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [4] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo.

    Convolutional LSTM network: A machine learning approach for precipitation nowcasting.

    In Advances in neural information processing systems, pages 802–810, 2015.
  • [5] Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and S Yu Philip. PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In International Conference on Machine Learning, pages 5110–5119, 2018.
  • [6] Chaoyun Zhang, Marco Fiore, and Paul Patras. Multi-Service mobile traffic forecasting via convolutional long Short-Term memories. In 2019 IEEE International Symposium on Measurements & Networking (M&N) (IEEE M&N 2019), Jul 2019.
  • [7] Yuxuan Liang, Songyu Ke, Junbo Zhang, Xiuwen Yi, and Yu Zheng. GeoMAN: Multi-level attention networks for geo-sensory time series prediction. In IJCAI, pages 3428–3434, 2018.
  • [8] Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Deep learning for precipitation nowcasting: A benchmark and a new model. In Advances in Neural Information Processing Systems, pages 5617–5627, 2017.
  • [9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In

    Proceedings of the IEEE international conference on computer vision

    , pages 764–773, 2017.
  • [10] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • [11] Thang Luong, Hieu Pham, and Christopher D Manning.

    Effective approaches to attention-based neural machine translation.

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, 2015.
  • [12] Chaoyun Zhang and Paul Patras. Long-term mobile traffic forecasting using deep spatio-temporal neural networks. In Proc. ACM MobiHoc, 2018.
  • [13] Dario Bega, Marco Gramaglia, Marco Fiore, Albert Banchs, and Xavier Costa-Perez. DeepCog: Cognitive Network Management in Sliced 5G Networks with Deep Learning. In Proc. IEEE INFOCOM, 2019.
  • [14] Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and S Yu Philip. PredRNN: Recurrent neural networks for predictive learning using spatiotemporal lstms. In Advances in Neural Information Processing Systems, pages 879–888, 2017.
  • [15] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep learning on point sets for 3d classification and segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 652–660, 2017.
  • [16] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
  • [17] Yin Zhou and Oncel Tuzel. VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018.
  • [18] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. PointCNN: Convolution on -transformed points. In Advances in Neural Information Processing Systems, pages 820–830, 2018.
  • [19] Russel Schaffer and Robert Sedgewick. The analysis of heapsort. Journal of Algorithms, 15(1):76–100, 1993.
  • [20] Liang Zhang, Guangming Zhu, Lin Mei, Peiyi Shen, Syed Afaq Ali Shah, and Mohammed Bennamoun. Attention in convolutional LSTM for gesture recognition. In Advances in Neural Information Processing Systems, pages 1953–1962, 2018.
  • [21] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [22] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
  • [23] Hao Dong, Akara Supratak, Luo Mai, Fangde Liu, Axel Oehmichen, Simiao Yu, and Yike Guo. TensorLayer: A versatile library for efficient deep learning development. In Proc. ACM Multimedia, pages 1201–1204, 2017.
  • [24] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015.
  • [25] Harold W Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 2(1-2):83–97, 1955.
  • [26] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
  • [27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [28] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2013.
  • [29] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [30] Liang Liu, Wangyang Wei, Dong Zhao, and Huadong Ma. Urban resolution: New metric for measuring the quality of urban sensing. IEEE Transactions on Mobile Computing, 14(12):2560–2575, 2015.
  • [31] Alain Hore and Djemel Ziou. Image quality metrics: PSNR vs. SSIM. In Proc. IEEE International Conference on Pattern Recognition (ICPR), pages 2366–2369, 2010.
  • [32] Chaoyun Zhang, Xi Ouyang, and Paul Patras. ZipNet-GAN: Inferring fine-grained mobile traffic patterns via a generative adversarial neural network. In Proceedings of the 13th International Conference on emerging Networking EXperiments and Technologies, pages 363–375. ACM, 2017.


Appendix A -Conv Implementation

The -Conv can be efficiently implemented using a standard 2D convolution operator, by data shape transformation. We assume a batch size of 1 for simplicity. Recall that the input and output of -Conv, and , are 3D tensors with shape and , respectively. Note that for each in , we find the set of top nearest neighbors . Combining these, we transform the input into a 4D tensor , with shape . To perform -Conv over , we split the operator into the following steps:

2:      , with shape .
4:      The weight tensor .
5:Reshape the input map from shape to shape
6:Reshape the weight tensor from shape to shape
7:Perform 2D convolution with step 1 without padding. becomes a 3D tensor with shape
8:Reshape the output map to
9:Apply the sigmoid function to the coordinates feature in
Algorithm 1 Efficient algorithm for -Conv implementation using the 2D convolution operator

This enables to translate the -Conv into a standard convolution operation, which is highly optimized by existing deep learning frameworks.

Appendix B Proof of Transformation Invariance

We show that the normalization of the coordinates features enables transformation invariance with shifting and scaling. The shifting and scaling of a point can be represented as:


where and are a positive scaling coefficient and respectively an offset. By normalizing the coordinates, we have:


This implies that, by using normalization, the model is invariant to shifting and scaling transformations.

Appendix C Soft Attention Mechanism

We combine our proposed CloudLSTM with the attention mechanism introduced in [11]. We denote the -th and -th states of the encoder and decoder as and . The context tensor for state at the encoder can be represented as:


where is a score function, which can be selected among many alternatives. In this paper, we choose . Here is the concatenation operator and is the convolution function. Both and are learnable weights. The and context tensor are concatenated into a new tensor for the following operations.

Appendix D Models Configuration

We show in Table 2 the detailed configuration along with the number of parameters for each model considered in this study.

Model Configuration
MLP Five hidden layers, 500 hidden units for each layer
CNN Eleven 2D convolutional layers, each applies 108 channels and

filters, with batch normalization and ReLU functions.

3D-CNN Eleven 3D convolutional layers, each applies 108 channels and filters, with batch normalization and ReLU functions.
DefCNN Eleven 2D convolutional layers, each applies 108 channels and filters, with batch normalization and ReLU functions. Offsets are predicted by separate convolutional layers
PointCNN Eight -Conv layers
LSTM 2-stack Seq2seq LSTM, with 500 hidden units
ConvLSTM 2-stack Seq2seq ConvLSTM, with 36 channels and filters
PredRNN++ 2-stack Seq2seq PredRNN++, with 36 channels and filters
CloudRNN 2-stack Seq2seq CloudRNN, with 36 channels and
CloudGRU 2-stack Seq2seq CloudGRU, with 36 channels and
CloudLSTM () 2-stack Seq2seq CloudLSTM, with 36 channels and
CloudLSTM () 2-stack Seq2seq CloudLSTM, with 36 channels and
CloudLSTM () 2-stack Seq2seq CloudLSTM, with 36 channels and
Attention CloudLSTM 2-stack Seq2seq CloudLSTM, with 36 channels, and soft attention mechanism
Table 2: The configuration of all models considered in this study.

Appendix E Loss Function and Performance Metrics

We optimize all architectures using the MSE loss function:


Here is the mobile traffic volume forecast for the -th service at antenna at time , and is its corresponding ground truth. We employ MAE, RMSE, PSNR and SSIM to evaluate the performance of our models. These are defined as:


where and are the average and maximum traffic recorded for all services, at all antennas and time instants of the test set. and denote the variance and covariance, respectively. Coefficients and are employed to stabilize the fraction in the presence of weak denominators. Following standard practice, we set and , where is the dynamic range of float type data, and , .

Appendix F Dataset Statistics

f.1 Data Collection

Figure 6: Fraction of the total traffic consumed by each mobile service (left) and each service category (right) in the considered set.

The measurement data is collected via traditional flow-level deep packet inspection at the packet gateway P-GW. Proprietary traffic classifiers are used to associate flows to specific services. Due to data protection and confidentiality constraints, we do not disclose the name of the operator, the target metropolitan regions, or the detailed operation of the classifiers. For similar reasons, we cannot name the exact mobile services studied.

As a final remark on data collection, we stress that all measurements were carried out under the supervision of the competent national privacy agency and in compliance with applicable regulations. In addition, the dataset we employ for our study only provides mobile service traffic information accumulated at the antenna level, and does not contain personal information about individual subscribers. This implies that the dataset is fully anonymized and its use for our purposes does not raise privacy concerns.

f.2 Service Usage Overview

As already mentioned, the set of services considered in our analysis comprises 38 different services. An overview of the fraction of the total traffic consumed by each service and each category in both cities throughout the duration of the measurement campaign is in Fig. 6. The left plot confirms the power law previously observed in the demands generated by individual mobile services. Also, streaming is the dominant type of traffic, with five services ranking among the top ten. This is confirmed in the right plot, where streaming accounts for almost half of the total traffic consumption. Web, cloud, social media, and chat services also consume large fractions of the total mobile traffic, between 8% and 17%, whereas gaming only accounts for 0.5% of the demand.

Appendix G Service-wise Evaluation

Finally, we dive deeper into the performance of the proposed Attention CloudLSTMs, by evaluating the forecasting accuracy for each individual service, averaged over 36 steps. To this end, we present the MAE evaluation on a service basis (left) and category basis (right) in Fig. 7. Observe that the attention CloudLSTMs obtain similar performance over both cities at the service and category level. Jointly analyzing with Fig. 6, we see that services with higher traffic volume on average (e.g., streaming and cloud) also yield higher prediction errors. This is because their traffic evolution exhibits more frequent fluctuations, which introduces higher uncertainty, making the traffic series more diffciult to predict.

Figure 7: Service-level MAE evaluation on both cities for the Attention CloudLSTMs, averaged over 36 prediction steps.