I Introduction
Accurate and timely traffic flow prediction is essential for traffic management and allows travelers to make betterinformed travel decisions. In some applications, it is often necessary to predict traffic flow not only accurately but also several steps ahead in the future. For example, in order for the traffic patch to manage a congested road and develop contingency plans, traffic dispatch may need to estimate traffic conditions at least 30 minutes in advance. However, most traffic flow prediction approaches were developed with singlestep prediction methods. The multistep prediction problem is significantly more difficult than its singlestep variant and is known to suffer from degradation in predictions the farther we go in future timesteps. Therefore, it is essential to develop multistep prediction approaches to achieve accurate multistep traffic flow prediction.
Multistep time series prediction tasks are defined as tasks of predicting the next values given a historical time series , where denote the past and future horizons. Generally, there are three main strategies for multistep time series prediction: recursive, direct, and multi output [1]. In the recursive strategy, a onestep model is first trained to fit the following function:
(1) 
The learned model, , predicts a multistep timeseries trajectory by repeatedly passing its predictions at one time step as input to the next time step. In the simple case where both the history and predictions are length scalars, given , a model predicts as , as , and so on^{1}^{1}1This simple case in presenting recursive prediction is used in this paper, but the ideas discussed apply to the general case.. Due to accumulating errors and shifting input distribution, model predictions farther in the future increasingly drift from ground truth trajectories [2]. Moreover, there is a mismatch between what the model is optimized for, i.e., singlestep error, and what it is used for, i.e., multistep prediction, that gives rise to optimistic error estimates during training [3]. These weaknesses are present when the true singlestep model is not identified during training [4], which is almost always the case in nonlinear problems. Recent work showed that a learned model can be tuned to learn corrections to the drift patterns seen in the training data [2]. This is an iterative training process, in which the training set is repeatedly augmented with additional data points of the form , where represents predictions of the model when applied to the training set. When the model is applied recursively to the training set points, it generates prediction trajectories of some length. The intuition of this iterative training process is that by augmenting the training set with samples of these predicted trajectories, coupled with the nextstep ground truth values, the model can learn to correct the drift patterns in its predicted trajectories.
Alternatively, one can do without the recursive process by learning a separate model to directly predict each timestep in the future, i.e., the direct strategy, which is given as
(2) 
where is the predictions horizon. In this strategy, multistep predictions are obtained by concatenating the predictions. Unlike the recursive strategy, the direct strategy does not suffer from accumulating errors since it does not use any predicted values for the subsequent predictions. However, there are two major weaknesses possessed by this strategy. First, since each model is learned independently, dependencies between two distant horizons are not modeled. Second, this strategy requires large computational resources, i.e., time and space, since the number of models depends on the size of the prediction horizon.
The third strategy is the multioutput strategy. This strategy is defined as the problem of finding a model that predicts the future given the historical data . The strategy requires a model that is able to produce multistep predictions simultaneously, as depicted in Figure 1
. This way, each prediction uses the actual observations rather than the predicted ones. Therefore, accumulated errors are not of concern in this strategy. Moreover, this strategy can learn the dependency between inputs and outputs as well as among outputs. Hence, the strategy involves more complex models than the recursive one does, which directly translates to a slower training process and requires more training data to avoid overfitting. While in some cases the direct and multioutput strategies can avoid some of the pitfalls of the recursive strategy, these models still suffer from degrading performance in the farther timesteps. Intuitively, there is a higher uncertainty associated with the farther future that makes it more difficult to forecast. Moreover, direct models can suffer from higher variance
[5]. Researchers have attempted to analyze theoretically and empirically the differences between recursive and direct/multioutput approaches and understand which would be more appropriate for a given problem [6], but the results of this effort so far have been inconclusive. In practice, all approaches continue to suffer from increasingly drifting predictions for farther timesteps.Recently, an approach to counter the drifts in trajectories of multistep predictions called Data as Demonstrator (DaD) is proposed in [2], specifically in the context of recursive prediction models. The underlying idea in their approach is to use the drift patterns seen when a trained model is applied to the training data to tune that model such that it can compensate for these drifts. Another way to look at it is as a data augmentation technique that alleviates the mismatch between training and testing distributions. Inspired by this, two approaches to enhance multistep prediction accuracy are introduced in this thesis. In the context of recursive models, a timestepaugmented model that implicitly learns to associate a different corrective action with different future timesteps is proposed. The model is an extension of the approach proposed in [2]. This is also related to the Rectify method proposed in [7], where a direct model is trained to correct the predictions of a recursive model at each timestep in the prediction trajectory. In the second approach, a data augmentation method that enhances multistep prediction accuracy in multioutput models is proposed. Here, a conditional generative adversarial network (CGAN) is used to learn a generator model that can mimic the historical patterns corresponding to the future patterns seen in the training data. Subsequently, this model is used to generate new historyfuture pairs that are aggregated with the original training data.
The main contributions in this work are summarized as follows:

An extension to the algorithm presented in [2], where information about the current timestep prediction is augmented in the model. This extension is called ConditionalDaD (CDaD).

A novel approach of generating new historyfuture pairs of data that are aggregated with the original training data using CGAN.

Comprehensive traffic flow prediction experiments involving recursive, direct, and multioutput strategies. In the recursive strategy, the vanilla approach, DaD, and CDaD are experimented. Furthermore, the vanilla direct strategy and its modification using recursive strategy, namely Hybrid method, are also presented. Finally, the proposed method CGAN is compared with noiseaugmented training strategy as well as the vanilla multioutput strategy.
The rest of this paper is structured as follows. Section II introduces the two proposed approaches. Section III describes the experimental setup and the data sets used to evaluate the proposed models. In Section IV, the experimental results are presented and discussed. Finally, the paper is concluded with some observations in Section V.
Ii Methodology
In this section, two methods to improve multistep timeseries prediction are introduced. An approach to improving recursive multistep prediction, called ConditionalDaD (CDaD), followed by a conditionalGANbased data augmentation approach to improving multioutput multistep prediction are introduced.
Iia ConditionalDaD
One weakness of the approach presented in [2], and recursive prediction generally, is that it does not take into consideration the number of steps predicted by the model so far. The amount of correction the model needs to add differs from timestep to another along a multistep prediction trajectory since the deviation from the ground truth is less acute in early steps. Therefore, the model stands to benefit from having information about the current timestep along the prediction trajectory. In particular, the amount of correction the model needs to add is affected by the number of timesteps that have passed in which the model used its prediction as input to the next timestep.
Based on this, an extension to DaD called CDaD, in which the input is augmented with a representation of the current timestep, is proposed. For length, scalar history,
, the singlestep prediction model is modified to accept an augmented vector,
, where is the prediction timestep, and is a representation of . In the presentation and experiments, is used. An illustration of this is shown in Fig. 2.A CDaD model learns a single mapping that is a function of the number of predicted values that have been recycled as input (and also a function of the timeseries history). This arrangement allows the model to output different for the same input, depending on the current timestep along the prediction trajectory, and, hence, allows the model to learn different corrections for different timesteps. This setup differs from the parameterized recursive prediction approach, where a different set of parameters are learned for each timestep in the future.
Training a CDaD model follows a similar process to the metaalgorithm proposed in [2], the difference being in the addition of the timestep representation. Algorithm 1
describes this process. In short, the timestepaugmented training data is generated by forwardpassing the original training data through a base model. Next, a CDaD model is iteratively trained, and a new augmented training data set is generated every epoch by passing the original data through the previous CDaD model. Furthermore, the final model is selected based on the performances of all models on the validation data set.
IiB ConditionalGAN Data Augmentation
In some applications, recursive models do not perform well compared to other approaches such as multioutput models [5]. Nonetheless, the performance of multioutput models suffers degradation as the prediction timestep increases. In this section, a CGANbased data augmentation approach to improve multioutput multistep time series prediction is introduced.
The multioutput strategy requires a model that is able to produce multistep predictions simultaneously. This way, each prediction uses the actual observations rather than the predicted ones. Therefore, accumulated errors are not of concern in this strategy. Moreover, this strategy can learn the dependency between inputs and outputs as well as among outputs. Hence, the strategy involves more complex models than the recursive one does, which directly translates to a slower training process and requires more training data to avoid overfitting.
Similar to the recursive strategy, a data augmentation method can be applied to improve the multistep time series prediction. One simple way to augment the data is to contaminate the features with noise and pair them with the actual labels. This method can increase the multistep prediction performance if the noise is carefully chosen. Poor choice of noise, however, may significantly degrade the prediction performance. In this work, an alternative method, i.e., Generative Adversarial Network (GAN) [8], to augment the data by learning a distribution over input conditioned on the output is proposed.
GAN is a framework for estimating a distribution in an adversarial manner. It simultaneously trains two models, namely a generative model and discriminative model
. The discriminative model is trained to maximize the probability of assigning appropriate labels for both samples coming from the training data and generative model. Simultaneously, the generative model is trained to minimize
, where is a random sample from an input noise distribution. Both and are playing a twoplayer minimax game with the value function given as follows:(3) 
An extension of GAN, namely conditional generative adversarial nets (CGAN), is proposed in [9]. In this extension, both and are conditioned on some extra information, which can be any kind of auxiliary information such as class labels. There have been some research applying CGAN to discrete labels [10], text [11], and images [12]. In this work, CGAN is trained to generate inputs (historical data) given the actual labels (future data). In both the generative and discriminative models, the actual labels are concatenated with noise. This idea is illustrated in Figure 3, where is the label (future data), is the generated inputs (past data). The pair of generated inputs and actual labels then are augmented in the original training data for multioutput time series prediction.
Using this method, an infinite amount of data to enhance the predictor performance can be generated. In addition, using the generated data, there is no need any special treatments in the training process. It can be done in a standard multioutput training without any iterative training processes.
Iii Datasets and Experimental Settings
To test the performance of the proposed methods, we conduct experiments using a traffic flow data set. The data set was downloaded from the Caltrans Performance Measurements Systems (PeMS) [13]. The original traffic flow was sampled every 30 seconds. These data were aggregated into 5min duration by PeMS. Highway Capacity Manual 2010 [14] recommends to aggregate the data further into 15min duration. We collected the traffic flow data of a freeway from January 1^{st} 2011 to December 31^{st} 2012. We use data from from January 1^{st} 2011from August 31^{st} 2011 for training, September 1^{st} 2011December 31^{st} 2011 for validation, and the rest for testing.
Three sets of experiments are conducted using the data set. In the first set, the vanilla recursive strategy, DaD, and CDaD are implemented on the dataset. The main goal of these experiments is to investigate which strategy performs better in the recursive setting. Next, the vanilla direct strategy and Hybrid approach are applied to the dataset. Subsequently, the proposed CGAN data augmentation is applied and compared with the vanilla multioutput strategy and noise data augmentation model. The number of the time steps for the multistep prediction is chosen to be 8. Furthermore, the performances of the best models from of each strategy are compared and analyzed. The performance of each of the experiments is evaluated using Mean Squared Error (MSE) and Mean Absolute Error (MAE). To illustrate the superiority of the proposed methods, the percentages of improvement of the errors with respect to the baselines are computed.
Iv Results and Analysis
In the first set of the experiments, three recursive approaches are applied to the dataset. Each approach uses similar base predictor, which is a deep neural network (DNN). To have fair comparisons, the configurations of the DNN for all of the approaches, in terms of the number of hidden layers, number of hidden units, activation function, and tricks used for the training, are kept identical. The main difference is that in the CDaD approach the input is augmented with the timestep information, which means there are extra weights associated with this input.
The base DNN is configured to have 2 hidden layers where each layer contains 150 hidden units, and the activation function is selected to be ReLU. Since it is a regression problem, a linear activation is used in the output layer with MSE as the loss function. Furthermore, the data are minmax normalized between 0 and 1. Moreover, to avoid the model from overfitting too easily, dropout regularizers with the rate equals to 0.1 are implemented on each layer. In addition, the Adam
[15] algorithm is used for the gradientbased optimization.The summary of the performance of the recursive approaches can be seen in Table I. The table shows that, with respect to the vanilla recursive approach, the DaD and CDaD approaches have successfully improved the overall MSE and MAE. This shows that reusing the prediction results as the input data, together with the original training data, leads to improvement in the performances. Furthermore, augmenting the information of the time step in the model, i.e., CDaD approach, can further improve the performances for more than 2 times in the MSE and almost 1.5 times in the MAE, as can be seen in the table. These performances are achieved after 25 and 29 iterations in the CDaD and DaD approaches, respectively.
ModelsPerf.  MSE  % Improv.  MAE  % Improv. 

Recursive  0.0101    0.0781   
DaD  0.0092  8.16  0.0627  19.64 
CDaD  0.0078  22.92  0.0563  27.89 
Figure 4 depicts the error occurs at each time step. It shows that, in the early step, both the DaD and CDaD approaches perform significantly better than the vanilla recursive does. However, at time step equals to 8, the MSE of the DaD approach is worse than that of the vanilla recursive, while the CDaD approach is able to maintain its performances all the way through the last step. This is not the case for the MAE, where both the DaD and CDaD approaches are able to maintain its performances at all time steps.
The results of the traffic flow predictions for the recursive approaches can be seen in Figure 5. This figure presents traffic flow predictions at time step 1 and 8. At time step 1, all the approaches produce similar predictions, which are very close to the actual traffic flow. However, at time 8, only the CDaD approach is showing an acceptable prediction. Indeed, the augmentation of the timestep information provides an extra dimension to the model, which helps the model to understand the state of the prediction and learn better. It can be concluded that adding this extra dimension is worth the effort.
In the second set of experiments, two direct approaches, namely vanilla direct and Hybrid approaches, are tested. Similar base predictors as the previous set of experiments are used. The number of models trained in this approach depends on the size of the future horizon. Since the number of time steps is 8, then the number of models in each approach will be 8. The main difference between the vanilla direct and Hybrid approaches is in the size of the input. The number of input in the Hybrid increases as the time step increases while the number of input in the vanilla direct is static.
ModelsPerf.  MSE  % Improv.  MAE  % Improv. 

Direct  0.0090    0.0715   
Hybrid  0.0082  8.68  0.0674  5.63 
The performance of the direct approaches is summarized in Table II. In this table, it can be seen that the vanilla direct approach has smaller errors than the vanilla recursive does. This is expected since in the vanilla direct approach the accumulating error problem does not exist. Each timestep prediction is handled independently by each model. Furthermore, the Hybrid approach improves the vanilla direct performance considerably. However, if it is compared with the best performance in the recursive approach, CDaD is still shown its superiority. This may be attributed to the accumulating errors when the previous predictions are used as inputs in the subsequent models.
From Figure 6, it can be seen that the errors in the first step are identical. This is possible because the models in this step are practically the same since the previous predictions are not utilized yet. Overall, the Hybrid approach improves the prediction errors in all time steps. However, a closer look suggests that the Hybrid model performances degrade as the time step increases. Indeed, at the later steps, the accumulating errors dominate the input of the model. In this type of approach, the performance in the next step highly depends on the one in the previous model.
Figure 7 shows that, initially, the Hybrid approach produces acceptable prediction, which is not the case in the last step. This is reciprocal with the errors shown in Figure 6, where there is almost no improvement in MSE and MAE achieved in the last step. In comparison with CDaD, this method is computationally inefficient since it requires several models for multistep predictions. With 8 times less computational efforts, the CDaD approaches perform considerably well than the Hybrid method does.
The last set of experiments involves three multioutput approaches: vanilla multi, noiseaugmented, and CGAN approaches. The base DNN is configured to have 2 hidden layers where each layer contains 150 hidden units, and the activation function is selected to be ReLU. The three approaches use identical models, including the output layer size. In the noiseaugmented approach, the data are contaminated with a Gaussian noise with mean equals to 0 and variance equals to 0.1. After several trials, this variance is found to produce the best performance on the validation data set. Meanwhile, the discriminative and generative models of the CGAN use DNN with a similar configuration as the base DNN. The important aspect of training the CGAN is the learning rates of both the discriminative and generative models. Usually, the discriminative model is configured to learn faster than the generative model. This way the discriminative loss stays low, which makes it stays ahead of discriminating new strange representations from the generative model. The evolution of the losses in the DCGAN is depicted in Figure 8. It can be seen that the losses of both the discriminative and the generative models converge. Furthermore, the CGAN accuracy converges to 50%, which means the discriminator is not able to distinguish the data generated by the generative model from the actual data. Therefore, it can be concluded that the generative model acts as a distribution that mimics the training data.
The overall performances of the multioutput traffic flow prediction approaches are summarized in Table III. So far, the lowest MSE using the vanilla approaches is obtained by the multioutput approach. It is attributed to the fact that in the multioutput setting, the accumulating errors problem does not exist and the dependencies between time steps are modeled. In the noiseaugmented approach, a poor choice of noise may significantly degrade the prediction performances. However, in this experiment, the noise has been successfully chosen as it is evident in the prediction performances improvements. Furthermore, the best improvement is achieved when the original data is augmented with the one generated by the generative model. Therefore, CGAN can be seen as an intelligent way for data augmentation.
ModelsPerf.  MSE  % Improv.  MAE  % Improv. 

Multi  0.0089    0.0718   
Noise  0.0082  8.13  0.0671  6.57 
GAN  0.0072  18.47  0.0576  19.71 
Figure 9 depicts the MSE and MAE of the multioutput approaches at all time steps. Both the noiseaugmented and CGAN approaches consistently produce improved traffic flow predictions all the way through the all time steps. Furthermore, in Figure 10, it can be seen that the performances of the vanilla multi and noiseaugmented approaches are poor in the first time step. Indeed, learning several timesteps simultaneously is more difficult than learning 1 step only as it is done in the recursive and direct approaches. However, the proposed CGAN approach is able to significantly improve the early time step predictions and overall performances.
Finally, the comparison of CDaD, Hybrid, and CGAN approaches is depicted in Figure 11. This figure shows that the CGAN approach has shown its superiority in term of MSE at all time steps. However, in term of MAE, the CDaD approach is better at the later steps compared to the CGAN approach. Indeed, based on the prediction plots in Figure5 and Figure10, it can be seen that the CDaD approach produce better traffic flow prediction than the CGAN does. In [16], it is suggested that MAE is more natural and quite often MSE can be misleading and is not a good indicator of average model performance because it is a function of two characteristics of a set of errors, rather than of one. In addition, [17] demonstrates that the MSE is more appropriate to represent model performance when the error distribution is expected to be Gaussian.
V Conclusions
This paper proposed two methods to improve multistep traffic flow predictions: CDaD and CGAN approaches. The first approach is developed using recursive strategy and inspired by previous work [2]. This approach augments the information about the current time step and follows a similar training process to the metaalgorithm proposed in [2]. The second model is developed using multioutput strategy and utilizes the ability of GAN in mimicking a data set distribution. The CGAN model is developed to generate historical data conditioned on the future data. This way, the original data set can be enriched with an infinite amount of historicalfuture pairs of data for training purposes.
The experiments show that the proposed approaches are able to improve multistep traffic predictions relative to their vanilla approaches. Moreover, in term of MSE, the CGAN approach performs better than the all of the approaches. However, in the latter steps, the MAE of the CDaD is lower than all the experimented approaches. Compared to the CDaD, the training of the CGAN approach is fairly simpler since it does not require iterative training once the new data are generated. However, there are applications where it is more efficient to use recursive model, such as for video sequence prediction, and in such application recursive prediction can benefit from the improvement offered by the CDaD approach.
References

[1]
G. Bontempi, S. B. Taieb, and Y.A. Le Borgne, “Machine learning strategies for time series forecasting,” in
European Business Intelligence Summer School, pp. 62–77, Springer, 2012.  [2] A. Venkatraman, M. Hebert, and J. A. Bagnell, “Improving multistep prediction of learned time series models.,” in AAAI, pp. 3024–3030, 2015.
 [3] S. B. Taieb, Machine learning strategies for multistepahead time series forecasting. PhD thesis, Universit Libre de Bruxelles, Belgium, 2014.
 [4] S. B. Taieb and A. F. Atiya, “A bias and variance analysis for multistepahead time series forecasting,” IEEE transactions on neural networks and learning systems, vol. 27, no. 1, pp. 62–76, 2016.
 [5] M. Marcellino, J. H. Stock, and M. W. Watson, “A comparison of direct and iterated multistep ar methods for forecasting macroeconomic time series,” Journal of econometrics, vol. 135, no. 1, pp. 499–526, 2006.
 [6] S. B. Taieb, G. Bontempi, A. F. Atiya, and A. Sorjamaa, “A review and comparison of strategies for multistep ahead time series forecasting based on the nn5 forecasting competition,” Expert systems with applications, vol. 39, no. 8, pp. 7067–7083, 2012.
 [7] S. B. Taieb, R. J. Hyndman, et al., “Recursive and direct multistep forecasting: the best of both worlds,” Monash University, Department of Econometrics and Business Statistics, Tech. Rep, 2012.
 [8] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
 [9] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
 [10] E. L. Denton, S. Chintala, R. Fergus, et al., “Deep generative image models using a laplacian pyramid of adversarial networks,” in Advances in neural information processing systems, pp. 1486–1494, 2015.
 [11] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” arXiv preprint arXiv:1605.05396, 2016.

[12]
D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon, “Pixellevel domain
transfer,” in
European Conference on Computer Vision
, pp. 517–532, Springer, 2016.  [13] C. D. of Transportation, “Caltrans Performance Measurement System.” http://pems.dot.ca.gov/, 2016. ”[Online; accessed June2016]”.
 [14] H. C. Manual, “Volumes 14.(2010),” Transporation Research Board, 2010.
 [15] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [16] C. J. Willmott and K. Matsuura, “Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance,” Climate research, vol. 30, no. 1, pp. 79–82, 2005.
 [17] T. Chai and R. R. Draxler, “Root mean square error (rmse) or mean absolute error (mae)?,” Geoscientific Model Development Discussions, vol. 7, pp. 1525–1534, 2014.
Comments
There are no comments yet.