I Introduction
Timeseries prediction plays an important role in many fields, such as sensor network monitoring[1],energy and smart grid management,economics and finance[2],and disease propagation analysis[3]
. In each of the previous scenarios, we can make longterm predictions using a large number of past time series,namely long sequence timeseries forecastin(LSTF). Multidimensional time series prediction is an important part of it. The socalled multidimensional time series prediction refers to the generation of multiple kinds of data at the same time, using this datas at the historical moment to predict the data at the future moment. This paper mainly discusses the different types of time series data detected in the same place at the same time. However, existing models cannot learn the hidden relationship between different dimensions well. In the previous period,vector autoregression(VAR) is arguably the most widely used models in multivariate time series
[4][5][6] due to its simplicity. In recent years, various VAR models have made significant progress,including the elliptical VAR[7] model for heavytail time series and structured VAR[8] model in order to better explain the dependence between high dimensional variables. However,the model capacity of VAR grows linearly over the temporal window size and quadratically over the number of variables. This means that the model is easy to overfit when it deals with long time series. In order to alleviate this problem,[9]proposed to reduce the original highdimensional signal to lowdimensional implicit representation, and then use VAR to perform a variety of regularization prediction.The time series prediction problem can also be regarded as the standard regression problem with timevarying parameters. Therefore, there is a lot of work to apply various regression models with different loss functions and regularization terms to time series prediction tasks. For example,linear support vector regression(SVR) learns a hyperplane based on regression loss and controls the threshold of prediction error with the hyperparameter E.Lastly,
[10]used LASSO models to encourage sparsity in the model parameters so that interesting patterns among different input signals can be manifest. These linear methods are actually more efficient in multivariate time series prediction due to highquality offtheshelf solvers in the machine learning community. However,like VAR,these models cannot capture the complex nonlinear relationship between multiple dimensions, resulting in low performance. And the power of GP comes at the cost of high computational complexity. Due to the inverse of kernel matrix, the prediction of multivariate time series using this model has cubic complexity.
Gaussian Processes(GP) is a nonparametric method for modeling distributions over a continuous domain of functions. GP can be applied to the multivariate time series prediction task suggested in [11]. For example,[12]proposes a full Bayesian method with GP priors for nonlinear state space models, which can capture complex dynamic phenomena.
In the LSTNet model[13], the convolutional network layer used cannot solve this problem well.As for the model LSTMa originally used for text translation[14]
, its Recurrent Neural Network module cannot deal with the problem of multidimensional dependence well when it is transferred to time series prediction. There is another serious problem with the model based on LSTM.
[15] points out that the accuracy of LSTM prediction will decrease greatly when the prediction length reaches a certain length. Therefore, the model does not perform well in the prediction of long time series.At present, there are two main challenges in multidimensional long time series prediction. The first is that the model should be able to deal with the increasingly long prediction series. Second, as there are dependencies between different dimensions in multidimensional sequence data, the model needs to capture this dependencies. Recently emerged Transformer models perform well in the task of capturing longrange dependency than recurrent neural networks models(RNN). The selfattention mechanism can reduce the transmission distance of signals from the maximum length to the minimum and avoid the recurrent structure, thus Transformer has greater potential than RNN in dealing with LSTF problem. However, due to its Lquadratic computation and memory consumption on Llength inputs/outputs of Transformer model, it has a large cost for long time series prediction. In order to solve these problems,Reformer[16] achieves with locallysensitive hashing selfattention,but it only works on extremely long sequences. Linformer claims a linear complexity ,but the project matrix can not be fixed for realworld long sequence input, which may have the risk of degradation to . TransformerXL[17] and Compressive Transformer[18] use auxiliary hidden states to capture longrange dependency. These efforts are mainly focused on the quadratic computation of selfattention,thus neglecting the memory bottleneck in stacking layers for long inputs which limits the model scalability in receiving long sequence inputs.
[15] proposed ProbSparse selfattention mechanism and selfattention distilling on this basis, which greatly reduced the time and space complexity of the model. Since then,one pressing demand is to extend the forecast time into the far future, which is quite meaningful for the longterm planning and early warning.
However, this model only solves the first problem and still cannot model the dependency between different dimensions.
Recently, graph convolutional neural(GCN) networks have shown superior performance in capturing spatial dependencies compared to convolutional neural networks(CNN). In GCN network, when updating node information, the updated information can be obtained by weighted summation of the node information around the node and the node itself.As is shown in
1. In the spatiotemporal graph neural network model, GCN generally models sensors scattered in different Spaces, and each sensor corresponds to a graph node. So GCN can aggregate these spatial information by using known adjacency matrix. It occurred to us that since GCN was able to model different sensors, could sensors be replaced with each dimension of the multidimensional data? So that raises the question, what about the adjacency matrix? Because in the sensor scenario, the adjacency matrix is constructed according to the distance, if the two sensors are close enough, we can set the corresponding position of the adjacency matrix to 1. Inspired by Wu[19] et al. ’s work, we know that the adjacency matrix can be learned by the model itself. So we thought, since we can’t calculate the dependencies between different dimensions in advance, then we can make the model learn this dependency on its own. The nodes used to correspond to sensors at different locations, but now we replace the sensors with one of the multiple dimensions.To this end,Our work addresses the above multidimensional prediction problem. We find that almost all the multidimensional time prediction models do not take into account the dependence between different dimensions,and we solved this problem by introducing selfadaption graph convolution network,and conduct extensive experiments. The contributions of this paper are summarized as follows:

We introduce adaptive GCN into the multidimensional time series prediction model so that the model can deal with the dependence between different dimensions.

We took Adaptive GCN as a framework alone and fused it with the existing multidimensional time series model.

We tested the fused models on several recognized data sets and got good result (about 10%)s.
Ii Method
In this section, we first give a mathematical definition of the problem we are trying to solve. Next, we will introduce two components of the framework, graph convolution layer and Attention layer,which GCN needs to be mainly described. Finally, summarize the overall framework model.
Iia Problem Definition
The multidimensional time series data prediction problem is defined on the graph ,where is a set of nodes and is a set of edges. The adjacency matrix is . If ,then is one otherwise it is zero. If you want more precise representation of dependencies between nodes, it can be in .The closer this value is to 1, the greater the dependence between i and j. In each time step , has a dynamic feature matrix . Given H step graph signals of information in ,our problem is to learn a function f to predict the next T step graph signals as
(1) 
where and .
IiB Graph Convolution Layer
At present, graph convolutional network is a very important model, which can extract node features with given node structure information.Kipf et al.[20] proposed a first approximation of Chebyshev spectral filter[21]. Graph convolutional network has two ways based on spectrum or space. From the point of view of space, it is to smooth the signal of node by gathering and transforming the neighborhood information of node.The advantage of this method is that multidimensional data can be modeled and the prediction effect can be improved through aggregation and transfer mechanisms.
represents the selfloops adjacency matrix. By adding the identity matrix, the graph convolution network can retain its own characteristic information during aggregation. In other words,
becomes by adding an identity matrix. represents the input signal. represents the input output,,and represents the model parameter matrix,in [20] the graph convolution layer is defined as(2) 
[22]proposed a diffusion convolution layer and integrated it into the recurrent neural network to make the model more effective in predicting spatiotemporal data. We summarise that the diffusion convolution layer into equation 3,which results in,
(3) 
IiC Selfadaptive Adjacency Matrix
Wu et al.[19] proposed Graph Wavenet model to solve the prediction problem of spatiotemporal data, in which they introduced adaptive adjacency matrix into graph convolution network, so that the model did not need adjacency matrix based on spatial relations. This model can also incorporate adjacency matrix based on spatial distance. What they do is they operate on the adaptive adjacency matrix and the distancebased adjacency matrix separately and stack the results together. We know that in multidimensional prediction problems, there are dependencies between data of different dimensions, just like there are dependencies between different locations of spatiotemporal data. Therefore, since the adaptive adjacency matrix can be used to extract the dependencies of different spatial positions, it can also extract the dependencies between different dimensions. In air quality forecasts, for example, visibility should be strongly correlated with wind speed, since fog is only present when wind speeds are low. Then in the adjacency matrix, the intersection value of these two dimensions should be larger than that of other unrelated dimensions. In the spatiotemporal data, we can define the dependence of nodes in the adjacency matrix according to the distance between nodes, but in the multidimensional prediction problem, it is difficult to quantify the relationship between different dimensions. So we use this adaptive adjacency matrix and let the model learn this dependency by itself. The adaptive adjacency matrix is constructed by multiplying two learnable parameters. It is defined as
(4) 
After combining the formula of adaptive adjacency matrix and graph convolution network, it can be defined as
(5) 
IiD Temporal Convolution Layer
The adaptive GCN framework can be integrated into almost all multidimensional time series prediction models. Here, one of the models is taken as an example. We adopt the Informer model[15] based on Transformer structure to capture the time trends in each dimension. The Informer model improves the Transformer model by reducing the time and space complexity through ProbSparse selfattention mechanism, and enables the model to receive long sequences of inputs through selfattention distilling operation.And the model is a onetime output sequence to avoid cumulative errors.Informer processes sequence data in a nonrecursive way compared to the RNnbased approach, which alleviates the problem of gradient explosions. Informer prevents the model from knowing the predicted result by setting the last few points of the decoder’s input sequence to zero.
Iii Experiments
Datasets
We conducted experiments on all four data sets, two of which were public data sets and the other two were real data sets.
Weather^{1}^{1}1Weather dataset was acquired at https://www.ncei.noaa.gov/data/localclimatologicaldata/.:The weather data set includes local U.S. climate data from nearly 1,600 sites, collected at hourly intervals from 2010 to 2013.Each piece of data has 12 dimensions, of which 11 are the characteristics of the input and the remaining one is the target value with the prediction.
ECL(Electricity Consuming Load)^{2}^{2}2ECL dataset was acquired at https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014.This data set collected the power consumption of 321 customers[23].We set characteristic MT320 as the target value.
ETT(Electricity Transformer Temperature)^{3}^{3}3ETT is posted on Github.https://github.com/zhouhaoyi/ETDataset:The data set was collected over two years from two different counties in China. Data sets with an interval of 1 hour and 15 minutes are created respectively. Each data contains six power load characteristics and a target value ”oil temperature”.
Experimental Details
Baselines:Since we are mainly improving the processing ability of the multidimensional data of the Informer model, We generally compare the original model with the model incorporating adaptive GCN. We choose the most advanced time series prediction models, including Autoformer[24], Informer[15], Transformer[25], Reformer[16] and SCINet[26] models.
Hyperparameter tuning:Again, take informer,informer contains 3layer stack and a 1layer stack in the encoder,and a 2layer int the decoder. For the optimizer, we chose Adam, whose starting value of learning rate is
,and the value of each epoch would be halved.Trainepochs for the model was set to 6 and patience was set to 3.Patience 3 means that if the loss is not minimized for three consecutive times, the training will be stopped.The training batch of the experiment is 32. In actual experiments, we generally do not change the parameters of the original model, and only compare the results with those after adding adaptive GCN, so as to avoid interference from other factors.
Setup:The input of each dataset is zeromean normalized. In order to verify the effectiveness of GCNInformer in long time series prediction, we also keep consistent with the prediction length of Informer’s experiment. And the lengths are 24,48,168,336 respectively.There are other models that increase the predicted length even further, such as the minimum predicted length of 96 in the Autoformer article, Our predicted length is also consistent with it. The predicted length of 720 cannot be tested due to performance issues with the graphics card. Metrics: We use MSE and MAE criteria to test the effectiveness of the model.Its calculation formula is and . Platform: All experiments were done on a GTX1660super 6GB GPU.
dataset  WTH  ECL  ETTh1  ETTm1  
Model  Informer  AdpInformer  Informer  AdpInformer  Informer  AdpAutoformer  Informer  AdpInformer 
Metric  MSE MAE  MSE MAE  MSE MAE  MSE MAE  MSE MAE  MSE MAE  MSE MAE  MSE MAE 
24  0.326 0.380  0.311 0.363  0.239 0.347  0.237 0.338  0.604 0.573  0.599 0.570  0.344 0.393  0.358 0.406 
68  0.414 0.441  0.389 0.435  0.265 0.363  0.253 0.350  0.692 0.630  0.650 0.597  0.535 0.505  0.513 0.500 
168  0.581 0.544  0.548 0.532  0.285 0.377  0.314 0.390  0.930 0.751  0.933 0.750  0.644 0.583  0.637 0.570 
336  0.630 0.590  0.572 0.561  0.309 0.397  0.305 0.381  1.120 0.847  1.108 0.845  0.931 0.734  0.884 0.706 
dataset  WTH  ECL  ETTh1  ETTm1  
Model  Autoformer  AdpAutoformer  Autoformer  AdpAutoformer  Autoformer  AdpAutoformer  Autoformer  AdpAutoformer 
Metric  MSE MAE  MSE MAE  MSE MAE  MSE MAE  MSE MAE  MSE MAE  MSE MAE  MSE MAE 
96  0.590 0.552  0.492 0.503  0.201 0.316  0.217 0.307  0.448 0.452  0.441 0.455  0.542 0.492  0.422 0.438 
192  0.612 0.567  0.552 0.543  0.217 0.329  0.209 0.321  0.501 0.485  0.462 0.465  0.603 0.517  0.490 0.483 
336  0.636 0.580  0.578 0.558  0.319 0.413  0.215 0.328  0.515 0.493  0.495 0.484  0.609 0.524  0.576 0.523 
720  0.664 0.598  0.615 0.583  out of memory  out of memory  0.552 0.536  0.613 0.555  out of memory  out of memory 
Results and Analysis Table I show the comparison of the results of multidimensional long time series prediction based on the original Informer model and the improved GCNInformer model. We also restored the predicted lengths in the informer paper’s experiment as much as possible, which were 24,48,168,336 respectively. Due to insufficient GPU memory, there is no test to predict a length of 720. Both MSE and MAE were calculated by means of 6 repetitions. The time was also the average of six repeated experiments. The best results are in bold. Taking the WTH data set as an example, our model showed an 11% improvement in MSE and a 9.7% improvement in MAE at a predicted length of 24. When the predicted length was 336, MSE increased by 10.36% and MAE increased by 6.0%. It can be seen that with the increase of the predicted length, our model still has a good improvement. Since we added stacked GCN model in Informer, the training time will be about 2 minutes slower than that of Informer model. This is because in adaptive graph convolution networks, we need to perform additional matrix operations, as well as the iteration of the adjacency matrix. In the Autoformer model, we also achieved an improvement in accuracy with the addition of adaptive GCN, which is the highest in the current model.
Iv Conclusions
In this paper, We proposed a new framework, which we named ADPGCN. Our model can adaptively capture the hidden dependencies between different dimensions and update the adjacency matrix with such dependencies, so that the GCN module can aggregate such dependencies and finally improve the accuracy. In addition, the coupling degree between the ADPGCN we added and the Informer model is very low. This model can also be added to other multidimensional time series prediction models. We’ve also demonstrated this by incorporating it into other models. In the future work,We’ll explore whether this ADPGCN framework can be used to learn about other types of dependencies.
Acknowledgment
References
 [1] S. Papadimitriou and P. Yu, “Optimal multiscale patterns in time series streams,” in Proceedings of the 2006 ACM SIGMOD international conference on Management of data, 2006, pp. 647–658.
 [2] Y. Zhu and D. Shasha, “Statstream: Statistical monitoring of thousands of data streams in real time,” in VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 2002, pp. 358–369.
 [3] Y. Matsubara, Y. Sakurai, W. G. Van Panhuis, and C. Faloutsos, “Funnel: automatic mining of spatially coevolving epidemics,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 105–114.
 [4] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series analysis: forecasting and control. John Wiley & Sons, 2015.
 [5] J. D. Hamilton, Time series analysis. Princeton university press, 2020.
 [6] H. Lütkepohl, New introduction to multiple time series analysis. Springer Science & Business Media, 2005.

[7]
H. Qiu, S. Xu, F. Han, H. Liu, and B. Caffo, “Robust estimation of transition matrices in high dimensional heavytailed vector autoregressive processes,” in
International Conference on Machine Learning. PMLR, 2015, pp. 1843–1851. 
[8]
I. Melnyk and A. Banerjee, “Estimating structured vector autoregressive models,” in
International Conference on Machine Learning. PMLR, 2016, pp. 830–839.  [9] H.F. Yu, N. Rao, and I. S. Dhillon, “Temporal regularized matrix factorization for highdimensional time series prediction,” Advances in neural information processing systems, vol. 29, 2016.
 [10] J. Li and W. Chen, “Forecasting macroeconomic time series: Lassobased approaches and their forecast combinations with dynamic factor models,” International Journal of Forecasting, vol. 30, no. 4, pp. 996–1015, 2014.
 [11] S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain, “Gaussian processes for timeseries modelling,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 371, no. 1984, p. 20110550, 2013.

[12]
R. Frigola, F. Lindsten, T. B. Schön, and C. E. Rasmussen, “Bayesian inference and learning in gaussian process statespace models with particle mcmc,”
Advances in neural information processing systems, vol. 26, 2013.  [13] G. Lai, W.C. Chang, Y. Yang, and H. Liu, “Modeling longand shortterm temporal patterns with deep neural networks,” in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 95–104.
 [14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
 [15] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long sequence timeseries forecasting,” in Proceedings of AAAI, 2021.
 [16] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient transformer,” arXiv preprint arXiv:2001.04451, 2020.
 [17] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformerxl: Attentive language models beyond a fixedlength context,” arXiv preprint arXiv:1901.02860, 2019.
 [18] J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, “Compressive transformers for longrange sequence modelling,” arXiv preprint arXiv:1911.05507, 2019.
 [19] Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang, “Graph wavenet for deep spatialtemporal graph modeling,” arXiv preprint arXiv:1906.00121, 2019.
 [20] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 [21] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” Advances in neural information processing systems, vol. 29, 2016.
 [22] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Datadriven traffic forecasting,” arXiv preprint arXiv:1707.01926, 2017.
 [23] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.X. Wang, and X. Yan, “Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting,” Advances in Neural Information Processing Systems, vol. 32, 2019.
 [24] J. Xu, J. Wang, M. Long et al., “Autoformer: Decomposition transformers with autocorrelation for longterm series forecasting,” Advances in Neural Information Processing Systems, vol. 34, 2021.
 [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
 [26] M. Liu, A. Zeng, Z. Xu, Q. Lai, and Q. Xu, “Time series is a special sequence: Forecasting with sample convolution and interaction,” arXiv preprint arXiv:2106.09305, 2021.