1 Introduction
Pointcloud stream forecasting seeks to predict the future values and/or locations of data streams generated by a geospatial point cloud , given sequences of historical observations shi2018machine . Example data sources include mobile network antennas that serve the traffic generated by ubiquitous mobile services at city scale zhang2019deep , sensors that monitor the air quality of a target region cheng2018neural , or moving crowds that produce individual trajectories. Unlike traditional spatiotemporal forecasting on gridstructural data (e.g., precipitation nowcasting xingjian2015convolutional , or video frame prediction wang2018predrnn++
), pointcloud stream forecasting needs to operate on geometrically scattered sets of points, which are irregular and unordered, and encapsulate complex spatial correlations. While vanilla Long Shortterm Memories (LSTMs) have modest abilities to exploit spatial features
xingjian2015convolutional , convolutionbased recurrent neural network (RNN) models (e.g., ConvLSTM xingjian2015convolutional and PredRNN++ wang2018predrnn++ ) are limited to modeling gridstructural data, and are therefore inappropriate for handling scattered pointclouds.Leveraging the location information embedded in such irregular data sources, so as to learn important spatiotemporal features, is in fact challenging. Existing approaches that tackle the pointcloud stream forecasting problem can be categorized into two classes, both of which bear significant shortcomings: (i) methods that transform pointclouds into data structures amenable to processing with mature solutions (e.g., grids Patr1907:Multi , see Fig.1); and (ii) models that ignore the exact locations of each data source and inherent spatial correlations (e.g., liang2018geoman ). The transformations required by the former not only add data preprocessing overheads, but also introduce spatial displacements, which distorts relevant correlations among points Patr1907:Multi . On the other hand, the latter are largely locationinvariant, while recent literature suggests spatial correlations should be revisited over time, to suit series prediction tasks shi2017deep . In essence, overlooking dynamic spatial correlations will lead to modest forecasting performance.
Contributions. In this paper, we introduce Convolutional Pointcloud LSTMs (CloudLSTMs), a new branch of recurrent neural network models tailored to geospatial pointcloud stream forecasting. The CloudLSTM builds upon a Dynamic Convolution (Conv) operator, which takes raw pointcloud streams (both data time series and spatial coordinates) as input, and performs dynamic convolution over these, to learn spatiotemporal features over time, irrespective of the topology and permutations of the pointcloud. This eliminates the data preprocessing overheads mentioned above and circumvents the negative effects of spatial displacement. The proposed CloudLSTM takes into account the locations of each data source and performs dynamic positioning at each time step, to conduct a deformable convolution operation over pointclouds dai2017deformable . This allows revising the spatial and temporal correlations, and the configuration of the data points over time, and guarantees that the locationvariant property is met at different steps. Importantly, the Conv operator is flexible, as it can be easily plugged into existing neural network models for different purposes, such as RNNs, LSTMs, sequencetosequence (Seq2seq) learning sutskever2014sequence , and attention mechanisms luong2015effective .
We perform antennalevel forecasting of data traffic generated by mobile services zhang2018long ; bega2019deepcog as a case study, experimenting with metropolitanscale mobile traffic measurements collected in two European cities for 38 popular mobile apps. This represents an important application of geospatial pointcloud stream forecasting. We combine our CloudLSTM with Seq2seq learning and an attention mechanism, then undertake a comprehensive evaluation on both datasets. The results obtained demonstrate that our architecture can deliver precise longterm mobile traffic forecasting, outperforming eight different baseline neural network models in terms of four performance metrics, without any data preprocessing requirements. To the best of knowledge, the proposed CloudLSTM is the first dedicated neural architecture for spatiotemporal forecasting that operates directly on pointcloud streams.
2 Related Work
Spatiotemporal Forecasting. Convolutionbased RNN architectures have been widely employed for spatiotemporal forecasting, as they simultaneously capture spatial and temporal dynamics of the input. Shi et al., incorporate convolution into LSTMs, building a ConvLSTM for precipitation nowcasting xingjian2015convolutional
. This approach exploits spatial information, which in turn leads to higher prediction accuracy. The ConvLSTM is improved by constructing a subnetwork to predict statetostate connections, thereby guaranteeing locationvariance and flexibility of the model
shi2017deep . PredRNN wang2017predrnn and PredRNN++ wang2018predrnn++ evolve the ConvLSTM by constructing spatiotemporal cells and adding gradient highway units. These improve longterm forecasting performance and mitigate the gradient vanishing problem encountered in recurrent architectures. Although these solution work well for spatiotemporal forecasting, they can not be applied directly to pointcloud streams, as they require pointcloudtogrid preprocessing Patr1907:Multi .Feature Learning on Pointclouds. Deep neural networks for feature learning on pointcloud data are advancing rapidly. PointNet performs feature learning and maintains input permutation invariance qi2017pointnet
. PointNet++ upgrades this structure by hierarchically partitioning pointclouds and performing feature extraction on local regions
qi2017pointnet++ . VoxelNet employs voxel feature encoding to limit interpoint interactions within a voxel zhou2018voxelnet . This effectively projects cloudpoints onto subgrids, which enables feature learning. Li et al., generalize the convolution operation on pointclouds and employ transformations to learn the weights and permutations for the features li2018pointcnn . Through this, the proposed PointCNN leverages spatiallocal correlations of point clouds, irrespective of the order of the input. Notably, although these architectures can learn spatial features of pointclouds, they are designed to work with static data, thus have limited ability to discover temporal dependencies.3 Convolutional Pointcloud LSTM
Next, we describe in detail the concept and properties of forecasting over point cloudstreams. We then introduce the Conv operator, which is at the core of our proposed CloudLSTM architecture. Finally, we present CloudLSTM and its variants, and explain how to combine CloudLSTM with Seq2seq learning and attention mechanisms, to achieve precise forecasting over pointcloud streams.
3.1 Forecasting over Pointcloud Streams
We formally define a pointcloud containing a set of points, as . Each point contains two sets of features, i.e., , where are value features (e.g., mobile traffic measurements, air quality indexes, etc.) of , and are its dimensional coordinates. At each time step , we may obtain different channels of by conducting different measurements^{1}^{1}1These resemble the RGB channels in images. denoted by . We can then formulate the step pointcloud stream forecasting problem, given observations, as:
(1) 
Note that, in some cases, each point’s coordinates may be unchanged, since the data sources are deployed at fixed locations. An ideal pointcloud stream forecasting model should embrace five key properties, similar to other pointcloud applications and spatiotemporal forecasting problems qi2017pointnet ; shi2017deep :
(i) Order invariance: A point cloud is usually arranged without a specific order. Permutations of the input points should not affect the output of the forecasting qi2017pointnet .
(ii) Information intactness: The output of the model should have exactly the same number of points as the input, without losing any information, .
(iii) Interaction among points: Points in are not
isolated, thus the model should be able to capture local dependencies among neighboring points and allow interactions qi2017pointnet .
(iv) Robustness to transformations: The model should be robust to correlationpreserving transformation operations on pointclouds, e.g., scaling and shifting qi2017pointnet .
(v) Location variance: The spatial correlations among points may change over time. Such dynamic correlations should be revised and learnable during training shi2017deep .
In what follows, we introduce the Dynamic Convolution (Conv) operator as the core module of the CloudLSTM, and explain how Conv satisfies the aforementioned properties.
3.2 Dynamic Convolution over Point Cloud
The dynamic convolution operator (Conv) absorbs the concept of ordinary convolution over grids, which takes
channels of 2D tensors as input, and outputs
channels of 2D tensors of smaller size (if without padding). Similarly, the
Conv takes channels of a pointcloud , and outputs channels of a pointcloud, but with the same number of elements as the input, to ensure the information intactness property (ii) discussed previously. For simplicity, we denote the channel of the input set as and the channel of the output as . Both and are 3D tensors, of shape and respectively.We also define as a subset of points in , which includes the nearest points with respect to in the Euclidean space, i.e., , where is the th nearest point to in the set . Note that itself is included in as an anchor point, i.e., . Recall that each contains value features and coordinate features, i.e., , where and . Similar to the vanilla convolution operator, for each in , the Conv sums the elementwise product over all features and points in , to obtain the values and coordinates of a point in . The mathematical expression of the Conv is thus:
(2)  
In the above, we define learnable weights as 5D tensors with shape . The weights are shared across different anchor points in the input map. Each element is a scalar weight for the th input channel, th output channel, th nearest neighbor of each point corresponding to the th value and coordinate features for each input point, and th value and coordinate features for output points. Similar to the convolution operator, we define as a bias for the th output map. In the above, and are the ^{(}^{)}th value features of the input/output point set. Likewise, and are the ^{(}^{)}th coordinate features of the input/output.
is the sigmoid function, which limits the range of predicted coordinates to
, to avoid outliers. Before feeding them to the model, the coordinates of raw pointclouds are normalized to
by , on each dimension. This improves the transformation robustness of the operator.We provide a graphical illustration of Conv in Fig. 2. For each point , the Conv operator weights its nearest neighbors across all features, to produce the values and coordinates in the next layer. Since the permutation of the input neither affects the neighboring information nor the ranking of their distances for any , Conv is a symmetric function whose output does not depend on the input order. This means that the property (i) discussed in Sec. 3.1 is satisfied. Further, Conv is performed on every point in set and produces exactly the same number of features and points for its output; property (ii) is therefore naturally fulfilled. In addition, operating over a neighboring point set, irrespective of its layout, allows to capture local dependencies and improve the robustness to global transformations (e.g., shifting and scaling). The normalization over the coordinate features further improves the robustness to those transformations. This enables to meet the desired properties (iii) and (iv). More importantly, Conv learns the layout and topology of the cloudpoint for the next layer, which changes the neighboring set for each point at output . This enables the “locationvariance” (property (v)), allowing the model to perform dynamic positioning tailored to each channel and time step. This is essential in spatiotemporal forecasting neural models, as spatial correlations change over time shi2017deep .
Conv can be efficiently implemented using simple 2D convolution, by reshaping the input map and weight tensor, which can be parallelized easily in existing deep learning frameworks. We detail this in the appendix.
Complexity Analysis. We study the complexity of Conv by separating the operation into two steps: (i) finding the neighboring set for each point , and (ii) performing the weighting computation in Eq. 2. We discuss the complexity of each step separately. For simplicity and without loss of generality, we assume the number of input and output channels are both 1. For step (i), the complexity of computing a pointwise Euclidean distance matrix is , while finding nearest neighbors for one point has complexity , if using heapsort schaffer1993analysis . As such, (i) has complexity . For step (ii), it is easy to see from Eq. 2 that the complexity of computing one feature of the output is . Since each point has features and the output point set has points, the overall complexity of step (ii) becomes . This is equivalent to the complexity of a vanilla convolution operator, where both the input and output have channels, and the input map and kernel have and elements, respectively. This implies that, compared to the convolution operator whose inputs, outputs, and filters have the same size, Conv introduces extra complexity by searching the nearest neighbors for each point.
Relations with PointCNN li2018pointcnn and Deformable Convolution dai2017deformable . The Conv operator builds upon the PointCNN li2018pointcnn
and deformable convolution neural network (DefCNN) on grids
dai2017deformable , but introduces several variations tailored to pointcloud structural data. PointCNN employs thetransformation over point clouds, to learn the weight and permutation on a local point set using multilayer perceptrons (MLPs), which introduces extra complexity. This operator guarantees the order invariance property, but leads to information loss, since it performs aggregation over points. In our
Conv operator, the permutation is maintained by aligning the weight of the ranking of distances between point and . Since the distance ranking is unrelated to the order of the inputs, the order invariance is ensured in a parameterfree manner without extra complexity and loss of information.Further, the Conv operator can be viewed as the DefCNN dai2017deformable over pointclouds, with the differences that (i) DefCNN deforms weighted filters, while Conv deforms the input maps; and (ii)
DefCNN employs bilinear interpolation over input maps with a set of continuous offsets, while
Conv instead selects neighboring points for its operations. Both DefCNN and Conv have transformation modeling flexibility, allowing adaptive receptive fields on convolution.3.3 The CloudLSTM Architecture
The Conv operator can be plugged straightforwardly into LSTMs, to learn both spatial and temporal correlations over pointclouds. We formulate the Convolutional Pointcloud LSTM (CloudLSTM) as:
(3)  
Similar to ConvLSTM xingjian2015convolutional , , , and , are input, forget, and output gates respectively. denotes the memory cell and is the hidden states. Note that , , , , and are all point cloud representations. and represent learnable weight and bias tensors. In Eq. 3, ‘’ denotes the elementwise product, ‘’ is the Conv operator formalized in Eq. 2, and ‘’ a simplified Conv that removes the sigmoid function in Eq. 2. The latter only operates over the gates computation, as the sigmoid functions are already involved in outer calculations (first, second, and fourth expressions in Eq. 3). We show the structure of a basic CloudLSTM cell in the left subplot of Fig. 3.
We combine our CloudLSTM with Seq2seq learning sutskever2014sequence and the soft attention mechanism luong2015effective , to perform forecasting, given that these neural models have been proven to be effective in spatiotemporal modelling on gridstructural data (e.g., xingjian2015convolutional ; zhang2018attention ). We show the overall Seq2seq CloudLSTM in the right subplot of Fig. 3
. The architecture incorporates an encoder and a decoder, which are different stacks of CloudLSTMs. The encoder encodes the historical information into a tensor, while the decoder decodes the tensor into predictions. The states of the encoder and decoder are connected using the soft attention mechanism via a context vector
luong2015effective . Before feeding the pointcloud to the model and generating the final forecasting, the data is processed by Point Cloud Convolutional (CloudCNN) layers, which perform theConv operations. Their function is similar to the word embedding layer in natural language processing tasks
mikolov2013distributed , which helps translate the raw point cloud into tensors and vice versa. In this study, we employ a twostack encoderdecoder architecture, and configure 36 channels for each CloudLSTM cell, as we found that further increasing the number of stacks and channels does not improve the performance significantly.Beyond CloudLSTM, we also explore plugging the Conv into vanilla RNN and Convolutional GRU, which leads to a new Convolutional Pointcloud RNN (CloudRNN) and Convolutional Pointcloud GRU (CloudGRU), as formulated by the following equations respectively:
4 Experiments
To evaluate the performance of our architectures, we employ antennalevel mobile traffic forecasting as a case study and experiment with two largescale mobile traffic datasets. We use the proposed CloudLSTM to forecast future mobile traffic consumption at scatterdistributed antennas in the regions of interest. We provide a comprehensive comparison with 8 baseline deep learning models, over four performance metrics. All models considered in this study are implemented using the opensource Python libraries TensorFlow
tensorflow2015whitepaper and TensorLayer tensorlayer . We train all architectures with a computing cluster with two NVIDIA Tesla K40M GPUs. We optimize all models by minimizing the mean square error (MSE) between predictions and ground truth, using the Adam optimizer kingma2015adam .Next, we first introduce the dataset employed in this study, then discuss the baseline models used for comparison, the experimental settings, and the performance metrics employed for evaluation. Finally, we report on the experimental results and provide visualizations that reveal further insights.
4.1 Dataset and Preprocessing
We conduct experiments using largescale multiservice datasets collected by a major operator in two large European metropolitan areas during 85 consecutive days. The data consists of the volume of traffic generated by devices associated to each of the 792 and respectively 260 antennas in the two target areas. The antennas are nonuniformly distributed over the urban regions, thus they can be viewed as 2D point clouds over space. Their locations are fixed across the measurements period.
At each antenna, the traffic volume is expressed in Megabytes and aggregated over 5min intervals, which leads to 24,482 traffic snapshots. These snapshots are gathered independently for each of 38 different mobile services, selected among the most popular apps, including video streaming, gaming, messaging, cloud services, social networking, etc. Due to data protection and confidentiality constraints, we do not disclose the identity of the mobile operator, and we do not provide information about the exact location of the data collection equipment, or the names of the mobile services considered. The data collection procedure was conducted under the supervision of the competent national privacy agency, which complies with regulations. The dataset is fully anonymized, as it only comprises service traffic aggregated at the antenna level, without unveiling personal information.^{2}^{2}2Due to a confidentiality agreement with the mobile traffic data owner, the raw data cannot be made public.
Before feeding to the models the traffic measurements for each mobile service, these are transformed into different input channels of the pointcloud . All coordinate features are normalized to the range. In addition, for the baseline models that require gridstructural input (i.e., CNN, 3DCNN, ConvLSTM and PredRNN++), the point clouds are transformed into grids Patr1907:Multi using the Hungarian algorithm kuhn1955hungarian , as required. The ratio of training, validation, and testing sets is ::.
4.2 Benchmarks and Performance Metrics
We compare the performance of our proposed CloudLSTM with a set of baseline models, as follows. MLP Goodfellowetal2016 , CNN krizhevsky2012imagenet , and 3DCNN ji20133d are frequently used as benchmarks in mobile traffic forecasting (e.g., bega2019deepcog ; zhang2018long ). DefCNN learns the shape of the convolutional filters and has similarities with the Conv operator proposed in this study dai2017deformable . PointCNN li2018pointcnn performs convolution over pointclouds and has been employed for pointcloud classification and segmentation. LSTM is an advanced RNN frequently employed for time series forecasting hochreiter1997long . While ConvLSTM xingjian2015convolutional can be viewed as a baseline model for spatiotemporal predictive learning, the PredRNN++ is the stateoftheart architecture for spatiotemporal forecasting on gridstructural data and achieves the best performance in many applications wang2018predrnn++ . Beyond these models, we also compare the CloudLSTM with two of its variations, i.e., CloudRNN and CloudGRU, which were introduced in Sec. 3.3.
We quantify the accuracy of the proposed CloudLSTM, in terms of Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). Since the mobile traffic snapshots can be viewed as “urban images” liu2015urban , we also select Peak SignaltoNoise Ratio (PSNR) and Structural Similarity Index (SSIM) hore2010image to quantify the fidelity of the forecasts and their similarity with the ground truth, as suggested by relevant recent work zhang2017zipnet .
We employ all neural networks to forecast cityscale future mobile traffic consumption for up to 30 mins, given consecutive 30min measurements sampled every 5 minutes. That is all models take as input 6 snapshots () and forecast following 6 traffic volume snapshots (). For RNNbased models, i.e., LSTM, ConvLSTM, PredRNN++, CloudLSTM, CloudRNN, and CloudGRU, we extend the number of prediction steps to (3 hours), to evaluate their longterm performance.
4.3 Result and Visualization
We perform 6step forecasting for 4,888 instances across the test set, and report in Table 1
the mean and standard deviation (std) of each metric. We also investigate the effect of different numbers of neighboring points considered (i.e.,
), as well as the influence of the attention mechanism.Model  City 1  City 2  

MAE  RMSE  PSNR  SSIM  MAE  RMSE  PSNR  SSIM  
MLP Goodfellowetal2016  4.790.54  9.942.56  49.562.13  0.270.12  4.590.59  9.442.45  50.302.28  0.330.14  
CNN krizhevsky2012imagenet  6.000.61  11.022.10  48.931.60  0.250.12  5.310.50  10.062.05  49.961.86  0.320.14  
3DCNN ji20133d  5.150.60  10.252.46  49.342.05  0.300.13  5.320.48  10.182.02  49.821.77  0.350.16  
DefCNN dai2017deformable  6.760.81  11.722.57  48.431.82  0.160.08  5.310.51  9.992.12  49.841.87  0.320.14  
PointCNN li2018pointcnn  5.010.46  9.982.33  49.771.94  0.300.12  5.130.46  9.602.36  50.071.49  0.340.13  
LSTM hochreiter1997long  4.230.66  9.613.18  50.083.21  0.300.12  4.361.64  9.223.03  50.753.26  0.370.13  
ConvLSTM xingjian2015convolutional  4.191.66  9.683.19  49.993.21  0.310.11  4.281.63  9.193.02  50.683.24  0.380.13  
PredRNN++ wang2018predrnn++  4.161.65  9.643.18  49.993.20  0.310.11  4.251.60  9.163.02  50.683.24  0.380.13  
CloudRNN ()  4.151.68  9.313.20  50.303.22  0.300.12  4.161.66  8.863.05  50.963.24  0.380.14  
CloudGRU ()  3.981.64  9.253.17  50.283.19  0.340.11  4.091.61  8.783.02  50.983.25  0.410.13  
CloudLSTM ()  3.871.68  9.173.19  50.313.21  0.330.11  4.081.57  8.843.02  50.953.24  0.410.13  
CloudLSTM ()  3.871.68  9.173.19  50.313.21  0.340.11  4.041.64  8.813.03  50.983.25  0.400.13  
CloudLSTM ()  3.831.68  9.133.19  50.333.21  0.350.11  4.011.59  8.783.02  51.003.24  0.410.13  

3.821.64  9.113.19  50.343.21  0.350.12  3.991.62  8.773.01  51.103.23  0.420.12 
Observe that RNNbased architectures in general obtain superior performance, compared to CNNbased models and the MLP. In particular, our proposed CloudLSTMs, CloudRNNs, and CloudGRUs outperform all the architectures considered in this study, on both datasets, achieving lower MAE and RMSE, and higher PSNR and SSIM. This suggests that the Conv operator is more effective in feature learning over geospatial pointclouds, as compared to vanilla convolution used in other models. Turning attention to our approaches, by considering the same number of neighbours (), CloudLSTM performs better than CloudGRU, which in turn outperforms CloudRNN. The forecasting performance of the CloudLSTM seems fairly insensitive to the number of neighbors (). It is therefore worth using a small in practice, to reduce model complexity, as this does not compromise the accuracy significantly. Lastly, we observe that the attention mechanism only contributes marginally to forecasting performance. This is due to the nature of the spatiotemporal forecasting task: uncertainty will grow significantly over time in mutlistep prediction, thus dependency with states in the encoder also degenerates over time.
Longterm Forecasting Performance. We extend the prediction horizon to up to time steps (i.e., 3 hours) for all RNNbased architectures, and show their MAE evolution with respect to this horizon in Fig. 4. Note that the input length remains unchanged, i.e., 6 time steps. In city 1, observe that the MAE does not grow significantly with the prediction step for most models, as the curves flatten. This means that these model are reliable in terms of longterm forecasting. As for city 2, we note that low may lead to poorer long term performance for CloudLSTM, though not significant before step 20. This provides a guideline on choosing for different forecast length required.
Visualization. We complete the evaluation by visualizing the hidden features of the CloudLSTM, to give insights into the knowledge learned by the model. To this end, in Fig. 5 we show an example of the scatter distributions of the hidden state in of CloudLSTM and Attention CloudLSTM at both stacks. The first 6 columns show the for encoders, while the rest are for decoders. The input data snapshots are samples selected from City 2 (260 antennas/points). Recall that each has 1 value features and 2 coordinate features for each point, therefore each scatter subplot in Fig. 5 shows the value features (volume represented by different colors) and coordinate features (different locations), averaged over all channels. Observe that in most subplots, points with higher values (warm colors) tend to aggregate into clusters and have higher densities. These clusters exhibit gradual changes from higher to lower values, leading to cometshape assemblages. This implies that points with high values also come with tighter spatial correlations, thus CloudLSTMs learn to aggregate them. This pattern becomes more obvious in stack 2, as features are extracted at a higher level, exhibiting more direct spatial correlations with respect to the output.
5 Conclusion
We introduce CloudLSTM, a dedicated neural model for spatiotemporal forecasting tailored to pointcloud data streams. The CloudLSTM builds upon the Conv operator, which performs convolution over pointclouds to learn spatial features while maintaining permutation invariance. The Conv simultaneously predicts the values and coordinates of each point, thereby adapting to changing spatial correlations of the data at each time step. Conv is flexible, as it can be easily combined with various RNN models (i.e., RNN, GRU, and LSTM), Seq2seq learning and attention mechanisms. We employ antennalevel mobile traffic forecasting as a case study, where we show that our proposed CloudLSTM achieves stateoftheart performance on largescale datasets collected in two major European cities. We believe the CloudLSTM gives a new perspective on pointcloud stream modelling, and it can be easily extended to higher dimension pointclouds, without requiring changes to the model.
References
 [1] Xingjian Shi and DitYan Yeung. Machine learning for spatiotemporal sequence forecasting: A survey. arXiv preprint arXiv:1808.06865, 2018.
 [2] Chaoyun Zhang, Paul Patras, and Hamed Haddadi. Deep learning in mobile and wireless networking: A survey. IEEE Communications Surveys & Tutorials, 2019.

[3]
Weiyu Cheng, Yanyan Shen, Yanmin Zhu, and Linpeng Huang.
A neural attention model for urban air quality inference: Learning the weights of monitoring stations.
InThirtySecond AAAI Conference on Artificial Intelligence
, 2018. 
[4]
Xingjian Shi, Zhourong Chen, Hao Wang, DitYan Yeung, WaiKin Wong, and
Wangchun Woo.
Convolutional LSTM network: A machine learning approach for precipitation nowcasting.
In Advances in neural information processing systems, pages 802–810, 2015.  [5] Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and S Yu Philip. PredRNN++: Towards a resolution of the deepintime dilemma in spatiotemporal predictive learning. In International Conference on Machine Learning, pages 5110–5119, 2018.
 [6] Chaoyun Zhang, Marco Fiore, and Paul Patras. MultiService mobile traffic forecasting via convolutional long ShortTerm memories. In 2019 IEEE International Symposium on Measurements & Networking (M&N) (IEEE M&N 2019), Jul 2019.
 [7] Yuxuan Liang, Songyu Ke, Junbo Zhang, Xiuwen Yi, and Yu Zheng. GeoMAN: Multilevel attention networks for geosensory time series prediction. In IJCAI, pages 3428–3434, 2018.
 [8] Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, DitYan Yeung, Waikin Wong, and Wangchun Woo. Deep learning for precipitation nowcasting: A benchmark and a new model. In Advances in Neural Information Processing Systems, pages 5617–5627, 2017.

[9]
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen
Wei.
Deformable convolutional networks.
In
Proceedings of the IEEE international conference on computer vision
, pages 764–773, 2017.  [10] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.

[11]
Thang Luong, Hieu Pham, and Christopher D Manning.
Effective approaches to attentionbased neural machine translation.
In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, 2015.  [12] Chaoyun Zhang and Paul Patras. Longterm mobile traffic forecasting using deep spatiotemporal neural networks. In Proc. ACM MobiHoc, 2018.
 [13] Dario Bega, Marco Gramaglia, Marco Fiore, Albert Banchs, and Xavier CostaPerez. DeepCog: Cognitive Network Management in Sliced 5G Networks with Deep Learning. In Proc. IEEE INFOCOM, 2019.
 [14] Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and S Yu Philip. PredRNN: Recurrent neural networks for predictive learning using spatiotemporal lstms. In Advances in Neural Information Processing Systems, pages 879–888, 2017.

[15]
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.
PointNet: Deep learning on point sets for 3d classification and
segmentation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 652–660, 2017.  [16] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
 [17] Yin Zhou and Oncel Tuzel. VoxelNet: Endtoend learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018.
 [18] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. PointCNN: Convolution on transformed points. In Advances in Neural Information Processing Systems, pages 820–830, 2018.
 [19] Russel Schaffer and Robert Sedgewick. The analysis of heapsort. Journal of Algorithms, 15(1):76–100, 1993.
 [20] Liang Zhang, Guangming Zhu, Lin Mei, Peiyi Shen, Syed Afaq Ali Shah, and Mohammed Bennamoun. Attention in convolutional LSTM for gesture recognition. In Advances in Neural Information Processing Systems, pages 1953–1962, 2018.
 [21] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [22] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 [23] Hao Dong, Akara Supratak, Luo Mai, Fangde Liu, Axel Oehmichen, Simiao Yu, and Yike Guo. TensorLayer: A versatile library for efficient deep learning development. In Proc. ACM Multimedia, pages 1201–1204, 2017.
 [24] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015.
 [25] Harold W Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 2(12):83–97, 1955.
 [26] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 [27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [28] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2013.
 [29] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [30] Liang Liu, Wangyang Wei, Dong Zhao, and Huadong Ma. Urban resolution: New metric for measuring the quality of urban sensing. IEEE Transactions on Mobile Computing, 14(12):2560–2575, 2015.
 [31] Alain Hore and Djemel Ziou. Image quality metrics: PSNR vs. SSIM. In Proc. IEEE International Conference on Pattern Recognition (ICPR), pages 2366–2369, 2010.
 [32] Chaoyun Zhang, Xi Ouyang, and Paul Patras. ZipNetGAN: Inferring finegrained mobile traffic patterns via a generative adversarial neural network. In Proceedings of the 13th International Conference on emerging Networking EXperiments and Technologies, pages 363–375. ACM, 2017.
Appendix
Appendix A Conv Implementation
The Conv can be efficiently implemented using a standard 2D convolution operator, by data shape transformation. We assume a batch size of 1 for simplicity. Recall that the input and output of Conv, and , are 3D tensors with shape and , respectively. Note that for each in , we find the set of top nearest neighbors . Combining these, we transform the input into a 4D tensor , with shape . To perform Conv over , we split the operator into the following steps:
This enables to translate the Conv into a standard convolution operation, which is highly optimized by existing deep learning frameworks.
Appendix B Proof of Transformation Invariance
We show that the normalization of the coordinates features enables transformation invariance with shifting and scaling. The shifting and scaling of a point can be represented as:
(6) 
where and are a positive scaling coefficient and respectively an offset. By normalizing the coordinates, we have:
(7)  
This implies that, by using normalization, the model is invariant to shifting and scaling transformations.
Appendix C Soft Attention Mechanism
We combine our proposed CloudLSTM with the attention mechanism introduced in [11]. We denote the th and th states of the encoder and decoder as and . The context tensor for state at the encoder can be represented as:
(8) 
where is a score function, which can be selected among many alternatives. In this paper, we choose . Here is the concatenation operator and is the convolution function. Both and are learnable weights. The and context tensor are concatenated into a new tensor for the following operations.
Appendix D Models Configuration
We show in Table 2 the detailed configuration along with the number of parameters for each model considered in this study.
Model  Configuration 

MLP  Five hidden layers, 500 hidden units for each layer 
CNN  Eleven 2D convolutional layers, each applies 108 channels and filters, with batch normalization and ReLU functions. 
3DCNN  Eleven 3D convolutional layers, each applies 108 channels and filters, with batch normalization and ReLU functions. 
DefCNN  Eleven 2D convolutional layers, each applies 108 channels and filters, with batch normalization and ReLU functions. Offsets are predicted by separate convolutional layers 
PointCNN  Eight Conv layers 
LSTM  2stack Seq2seq LSTM, with 500 hidden units 
ConvLSTM  2stack Seq2seq ConvLSTM, with 36 channels and filters 
PredRNN++  2stack Seq2seq PredRNN++, with 36 channels and filters 
CloudRNN  2stack Seq2seq CloudRNN, with 36 channels and 
CloudGRU  2stack Seq2seq CloudGRU, with 36 channels and 
CloudLSTM ()  2stack Seq2seq CloudLSTM, with 36 channels and 
CloudLSTM ()  2stack Seq2seq CloudLSTM, with 36 channels and 
CloudLSTM ()  2stack Seq2seq CloudLSTM, with 36 channels and 
Attention CloudLSTM  2stack Seq2seq CloudLSTM, with 36 channels, and soft attention mechanism 
Appendix E Loss Function and Performance Metrics
We optimize all architectures using the MSE loss function:
(9) 
Here is the mobile traffic volume forecast for the th service at antenna at time , and is its corresponding ground truth. We employ MAE, RMSE, PSNR and SSIM to evaluate the performance of our models. These are defined as:
(10) 
(11) 
(12) 
(13) 
where and are the average and maximum traffic recorded for all services, at all antennas and time instants of the test set. and denote the variance and covariance, respectively. Coefficients and are employed to stabilize the fraction in the presence of weak denominators. Following standard practice, we set and , where is the dynamic range of float type data, and , .
Appendix F Dataset Statistics
f.1 Data Collection
The measurement data is collected via traditional flowlevel deep packet inspection at the packet gateway PGW. Proprietary traffic classifiers are used to associate flows to specific services. Due to data protection and confidentiality constraints, we do not disclose the name of the operator, the target metropolitan regions, or the detailed operation of the classifiers. For similar reasons, we cannot name the exact mobile services studied.
As a final remark on data collection, we stress that all measurements were carried out under the supervision of the competent national privacy agency and in compliance with applicable regulations. In addition, the dataset we employ for our study only provides mobile service traffic information accumulated at the antenna level, and does not contain personal information about individual subscribers. This implies that the dataset is fully anonymized and its use for our purposes does not raise privacy concerns.
f.2 Service Usage Overview
As already mentioned, the set of services considered in our analysis comprises 38 different services. An overview of the fraction of the total traffic consumed by each service and each category in both cities throughout the duration of the measurement campaign is in Fig. 6. The left plot confirms the power law previously observed in the demands generated by individual mobile services. Also, streaming is the dominant type of traffic, with five services ranking among the top ten. This is confirmed in the right plot, where streaming accounts for almost half of the total traffic consumption. Web, cloud, social media, and chat services also consume large fractions of the total mobile traffic, between 8% and 17%, whereas gaming only accounts for 0.5% of the demand.
Appendix G Servicewise Evaluation
Finally, we dive deeper into the performance of the proposed Attention CloudLSTMs, by evaluating the forecasting accuracy for each individual service, averaged over 36 steps. To this end, we present the MAE evaluation on a service basis (left) and category basis (right) in Fig. 7. Observe that the attention CloudLSTMs obtain similar performance over both cities at the service and category level. Jointly analyzing with Fig. 6, we see that services with higher traffic volume on average (e.g., streaming and cloud) also yield higher prediction errors. This is because their traffic evolution exhibits more frequent fluctuations, which introduces higher uncertainty, making the traffic series more diffciult to predict.
Comments
There are no comments yet.