Introduction
With the accelerated pace of modern life, more and more cars come into use. The increased use of cars brings convenience to the public on the one hand, but on the other hand produces many social problems such as traffic congestion, traffic accidents and environmental pollution. To make traffic management and traveler information services more efficient, intelligent transportation systems (ITS) emerges as the times required, and the manual traffic management is no longer viable. The benefits of ITS can not be realized without the ability to anticipate shortterm traffic conditions. As an important aspect of traffic conditions, traffic flows can give insights into traffic conditions(Yu et al., 2003). Therefore, shortterm traffic flow forecasting becomes one of the most important and fundamental problems in ITS. Shortterm traffic flow forecasting is to determine the traffic flows in the next time interval, usually in the range of five minutes to half an hour, using historical data(Abdulhai et al., 2002; Sun and Zhang, 2007). A good shortterm traffic flow forecasting model can tell the right traffic condition in the near future and make traffic management more effective in turn. In recent years, ITS, especially shortterm traffic flow forecasting has already attracted great interest of researchers. Our work also focuses on the topic of shortterm traffic flow forecasting.
The items detected by ITS generally include traffic flow, volume and occupancy etc. Among all these items, traffic flow is considered to be the typical metric of traffic condition on a certain link (Chen and Chen, 2007). Traffic flow measures the number of vehicles passed through in a defined time interval and lower traffic flow means heavier traffic congestion. Traditional traffic flow forecasting predicts a future flow of a certain link only using the historical data on the same link, which is also called singlelink traffic flow forecasting. Obviously, singlelink forecasting approaches ignore the relations between the measured link and its adjacent links. In fact, each link is closely related to other links in the whole transportation system, especially their adjacent links. In this paper, we put forward the multilink forecasting models which take the relations between adjacent links into account. Sufficient experiments on real world data show that multilink approaches are superior to singlelink approaches.
In the past decade, series of traffic flow forecasting approaches have been proposed, such as time series based approaches (Moorthy and Ratcliffe, 1988; Lee and Fambro, 1999; William and Hoel, 2003), nonparametric methods (Davis and Nihan, 1991), local regression models (Davis, 1990; Smith and Demetsky, 1997), neural network approaches (Hall and Mars, 1998)
(Okutani and Stephanedes, 1984), Markov chain model
(Yu et al., 2003) and so on. Among all these approaches, neuralnetworkbased forecasting approaches are considered as relatively effective methods due to their wellestablished models. Typical neuralnetworkbased forecasting methods mainly include back propagation (BP) neural network (Smith and Demetsky, 1994), radial basis function (RBF) neural network
(Wang and Xiao, 2003; Park et al., 1998)(Ulbricht, 1994), time delayed neural network (Abdulhai et al., 1999), resource allocated networks (Chen and Grant, 2001), etc. In this paper, we select BP neural networks to serve as the corresponding neuralnetworkbased experiments. The competitive results further verify the superiority of the proposed neuralnetworkbased approaches.Gaussian process regression (GPR) is a classic regression algorithm basing on Bayesian theory. A Gaussian process is a generalized Gaussian probability distribution and each process is specified by its mean function and covariance function
(Rasmussen and Williams, 2006). Because of the characteristics of easy implementation, few parameters and strong interpretability, GPR is studied widely in machine learning. Furthermore, theoretical and practical developments over the last decade have shown that Gaussian process is a serious competitor for supervised learning applications
(Rasmussen and Williams, 2006). However, there are few applications of GPR in traffic flow forecasting. In this paper, we give a brief analysis of GPR and apply it to traffic flow forecasting. Through sufficient tests of GPR in realworld data sets, we point out the potential of GPR for traffic flow forecasting.Graphical model is not rare in both statistics and computer science. It is considered as an intersection of the two fields. In statistics applications, there are often largescale models with thousands or even millions of variables involved. Similarly, there are the same problems in machine learning applications, such as biological information retrieval, language processing and so on. Graphical lasso (GL) provides a general methodology for solving such problems (Jordan, 2004). By using L1 regularization, GL builds a sparse graphical model making use of the sparse inverse covariance matrix. In this paper, we provide a detailed discussion of the GL algorithm in theory and apply it to multilink traffic flow forecasting. With the further information extracted by GL, and combining with the BP neural networks, we construct a new multilink singletask traffic flow prediction model, which we refer to as GL_NN.
Parts of our work have been presented recently at international conferences (Gao and Sun, 2010; Gao et al., 2011). In this paper, we combine and extend them to give a more systematical analysis. The remainder of this paper is organized as follows. First, we introduce the four prediction models basing on NNs. Next, we give the introduction of GPR and GL, respectively, which are closely related to our work. Then, all the corresponding experiments and discussions on GPR are presented in the section of experiments. Finally, conclusions are given in the last section.
Prediction Models with Neural Networks
Due to the excellent ability in handling complex problems, and the characteristics of selflearning, selforganizing and selfadaptation, neural networks (NNs) usually perform well in machine learning problems. On the other hand, multitask learning (MTL) is widely applied in computational intelligence and has shown competitive performance. The main difference between MTL and singletask learning (STL) can be that, with the same inputs, MTL has multiple outputs but STL has only one output at a time. As to the multiple tasks in MTL, there is only one main task and the others are extra tasks assisting the learning of the main task. More details about MTL and STL can be found in Caruana (1997). In this paper, we further combine the singlelink and multilink models with singletask learning and multitask learning to construct four prediction models. The four models are singlelink singletask learning (SSTL), singlelink multitask learning (SMTL), multilink singletask learning (MSTL) and multilink multitask learning (MMTL), respectively.
SingleLink Model
Traditional traffic flow prediction models are singlelink models, which predict the future flow of one certain road link using only the historical data of the same link. Combining the singlelink model with singletask learning and multitask learning, we construct two models which are SSTL and SMTL, respectively. The main difference of the two models lies in the different numbers of outputs, which is also the difference of STL and MTL approaches in a very narrow sense. Following the settings in Caruana (1997), we set the number of outputs as 3(one main task and two extra tasks) for our MTL approaches. For one link, we use the first 5 historical traffic flows to predict the next one. That is, the number of inputs is 5.
Take link Ba in Fig. 1 as an example. Record the traffic flow of road link Ba at time interval as , and then the corresponding five historical traffic flows are respectively . In the singlelink model, we predict using . Basing on NNs, the five historical flows are five inputs. In SSTL, is the one and only output. While in SMTL, there are three outputs and . Among the three outputs, is the main task, and are extra tasks to assist the prediction of the main task. Note that the selection of extra tasks are not specified. Here we follow the settings in previous experiments (Jin and Sun, 2008). Diagrams of the two singlelink models SSTL and SMTL are shown in Fig. 2a and Fig. 2b.
MultiLink Model
Obviously, the singlelink prediction model is inefficient because it just predicts one future flow of only one link at a time. Worse still, it does not make use of the relevant information between adjacent links to improve the prediction. In fact, since vehicles always come from one link and go to other links in the whole transportation system, traffic flows of all links, especially adjacent links in the whole traffic network are relevant. Therefore, taking the relevance between adjacent links into account, we combine the multilink model with singletask learning and multitask learning to construct the multilink singletask learning model (MSTL) and the multilink multitask learning model (MMTL). The multilink model can simultaneously predict multiple traffic flows of multiple links using historical flows of all links in the same junction at a time. Certainly, using the historical data from multiple links in different junctions to predict the flow on one link of them is the special case of the multilink model. For example, GL_NN is a special case of the multilink model.
Again take the map in Fig. 1 as an example. We can see that the junction B connects three links Ba, Bb and Bc. In multilink models, we simultaneously predict future flows of the three links, using all the historical data of the three links. Therefore, similar to the analysis in singlelink model, there are 35=15 inputs in multilink models, which are respectively , and . Combing with singletask learning and multitask learning, there are three outputs and in MSTL, while 33=9 outputs and in MMTL. Diagrams of the two multilink models are shown in Fig. 3a and Fig. 3b. For the sake of clarity, in Fig. 3b, we draw the three outputs corresponding to the three links Ba, Bb and Bc each in a box. Through the four diagrams shown as Fig. 2a, Fig. 2b, Fig. 3a and Fig. 3b, we can get a better understanding of singlelink and multilink models.
Gaussian Process Regression
Gaussian process regression (GPR) is an important Bayesian machine learning approach which places a prior distribution over the function space. All inferences of GPR are taken place in the function space. In supervised learning applications, it aims at the conditional distribution of the targets given the inputs but not the distribution of the inputs. Moreover, Gaussian process is a generalized Gaussian probability distribution. The random variables in Gaussian probability distribution are scalars or vectors (multivariate case). In Gaussian process case, the random processes are represented as functions. There are several ways to interpret GPR. Here we give a brief inference on the theoretical basis of GPR from the functionspace view. More details can be seen in
(Rasmussen and Williams, 2006).Suppose that we have a training set , where is the number of observations, denotes the th Ddimensional input variable and is the corresponding target which is always a real value in the regression case. Aggregating the inputs as column vectors for all case, we can get a design matrix . Similarly, collect the targets in a vector as y. Then the training set can be written as . In the same way, we represent a test set as , where is a matrix.
A Gaussian process is specified by its mean function and covariance function (Rasmussen and Williams, 2006). Define the mean function and covariance function of a Gaussian process as
(1) 
Then, the Gaussian process can be written as
(2) 
Generally, for the simplicity of notation and computation, the mean function is set to be zero. In addition, Gaussian process has a definition that a Gaussian process is a collection of random variables and any finite number of them have a joint Gaussian distribution. According to this definition, we can get a consistency principle. That is, if there is a distribution
, then there exists , where is the relevant submatrix of . This consistency principle plays an important role in the inference of the GPR algorithm, which is also known as the marginalization property.Since prediction with noisefree observations is the special case of prediction with noise observations, we take the noise case to do the inference of GPR. In the noise case, the relation between the observed target value and the function value is
(3) 
where
is the additive independent identically distributed Gaussian noise with variance
. In the GPR inference, the covariance function should be predefined. We take the squared exponential (SE) covariance function as an example.(4) 
where is a diagonal matrix with being the diagonal elements, is the dimension of the input space, and is the signal variance which controls the global variation. From formula (4), we note that the covariance between the outputs can be written as a function of the inputs. Following the independence assumption and with the specified covariance function, we can easily get the prior of the training set . That is
(5) 
It is easy to find out that formula (5) can also be seen as the prior in noisefree case by removing the noise term. According to the independence assumption, the noise term is a diagonal matrix.
With the prior, we can further get the prior joint distribution of the observed target values
and the function values of the test samples:(6) 
To get the posterior distribution over functions, we need to reject those functions that disagree with observations of the prior. In probabilistic terms, this operation can be done easily. According to properties of marginal distribution and conditional distribution (Rasmussen and Williams, 2006), we can get
(7) 
Formula (7) gives the distribution of the function values . In practical applications, the mean function value is evaluated as the output of GPR. In fact, the GPR algorithm simultaneously outputs the variance of prediction values, which is also considered as the potential capability of GPR compared to other regression algorithms. A faster and more stable algorithm using Cholesky decomposition to compute the inverse covariance matrix in formula (7) also can be found in (Rasmussen and Williams, 2006).
Graphical Lasso
Graphical lasso (GL) is an algorithm to construct a sparse graphical model by applying the lasso penalty to the inverse covariance matrix. There is a basic model assumption that the observations have a multivariate Gaussian distribution with mean and covariance matrix (Friedman et al., 2008). The key to build a sparse graphical model is to make the inverse covariance matrix as sparse as possible. If the th component of is zero, then there is no link between the two variables and
in the sparse graphical model. Otherwise, there exists a link between the two variables. In recent years, series of approaches have been proposed to solve this problem. All the approaches can be classified into two types: the approximate approaches and the exact approaches. The approximate approaches estimate the sparse graphical model by fitting a lasso model to each variable and using the others as predictors
(Meinshausen and Bühlmann, 2006). The exact approaches solve the maximization of the L1penalized loglikelihood problem. There are also several ways to solve the exact problem. For example, interior point optimization methods (Dahl et al., 2008) and blockwise coordinate descent (BCD) algorithm (Friedman et al., 2008), etc. Thereinto, the BCD algorithm based on GL is appreciated as a relatively efficient method (Meinshausen and Bühlmann, 2006). Below, for completeness, we introduce the GL algorithm. More detailed information can be found in related references.Problem Setup
Assume that we are given observations independently drawn from a variate normal Gaussian distribution, with mean and covariance . Let denote the empirical covariance matrix. Thus we have
(8) 
where denotes the th observation. According to the independence assumption, we can easily get the likelihood on the given data set.
(9) 
Then, the loglikelihood can be written as
(10) 
Because the GL algorithm has to solve the problem of maximizing the L1penalized loglikelihood, we make a few transformations on formula (10). Removing the constant term in formula (10) and combining it with formula (8), we get
(11) 
Therefore, the exact problem that the GL algorithm solves can be written as
(12) 
where is the L1norm of matrix , which is the sum of the absolute values of the elements of , and is the penalty parameter which controls the extent of penalization (Banerjee et al., 2008).
Formula Transformations
Focusing on formula (12), a series of transformations are carried out to get an equivalent form which can be easily solved. Firstly, we write the problem as
(13) 
where denotes the maximum absolute value element of the symmetric matrix (Banerjee et al., 2008). Exchanging the max and min, formula (13) is transformed as follows.
(14) 
Computing the derivative of formula (14) over , we obtain . Replace in formula (14) with and it follows that the dual problem of formula (13) becomes
(15) 
where is the dimension of matrix , and the relation between the primal and the dual variables is . To write neatly, set . Then, the dual problem of the primal maximum L1penalized loglikelihood problem is
(16) 
According to the series of transformations above, we find that we finally estimate in the dual problem (16) while the inverse covariance matrix in the primal problem (12). Moreover, we also observe that the diagonal elements of and have a relation as
(17) 
which holds for all .
Block Coordinate Descent (BCD) Algorithm
Let be the estimation of . The block coordinate descent (BCD) algorithm solves problem (16) by optimizing cyclically over each row and column of , until achieving the given convergence condition. More details about BCD algorithm can be seen in (Banerjee et al., 2008). As to the GL approach we discussed, BCD algorithm plays as a launching point.
Divide and into blocks as
(18) 
The BCD algorithm updates through solving the quadratic programming
(19) 
which is solved by an interior point procedure. Permuting the rows and columns to make the target column always be the last one, BCD solves the problem like formula (19) for each column and updates its estimation for when all columns have been processed. This process is repeated until convergence.
The dual problem of formula (19) shown as the following is also deduced in (Banerjee et al., 2008).
(20) 
where . If solves formula (20), then the solution of formula (19) is It can be easily found that formula (20
) resembles a lasso regression problem, which is the launching point of GL approach. There is a verification on the equivalence between the solutions of formula (
12) and (20) given in (Friedman et al., 2008).Algorithm Description and Realization
According to the lasso problem (20) achieved by the BCD algorithm, the GL approach solves and updates this problem recursively. Details of the GL algorithm can be described as
1. Set , where
is the identity matrix. Then the diagonal of
remains unchanged in all the following steps.2. For each row and column of , solve the lasso problem (20) and obtain the solution .
3. Compute by , and replace the corresponding row and column of with .
4. Repeat the above steps 2 and 3 until convergence.
5. Compute the inverse matrix of , which is also the required inverse covariance matrix .
In Friedman et al. (2008), there also gives a relatively cheap method to execute the last step of the GL algorithm above. From the achieved sparse matrix , GL builds the desired sparse undirected graphical model. Each row or column of matrix represents a node in the graphical model. The row or column is corresponding to a variable of the multivariable data. Therefore, dimensional data has nodes in the graphical model. Whether there is a link between two nodes is determined by the corresponding component of matrix being zero or not. If the component is zero, then there is no link in the graphical model. That is, the two variables are conditionally independent given other variables. In the next section, combing with the multilink traffic flow prediction model, we give an instance of building the sparse graphical model by GL.
Experiments
Data Description
The data sets used in this paper are vehicle flow rates recorded every 15 minutes, which were gathered along many road links by the UTC/SCOOT system of Beijing Traffic Management (Sun and Zhang, 2007). The unit of the data is normalized as vehicles per hour (vehs/h). For the shortterm traffic flow forecasting, we carry a onestep prediction and take 15 minutes as the prediction horizon. That is, we predict the traffic flow rates of the next 15minute interval every time.
From the urban traffic map, we select a portion including 31 road links shown as Fig. 1. Each circle node in the figure represents a road junction which combines several road links. The arrows show the directions of traffic flows from the upstream junctions to the corresponding downstream junctions. Paths without arrows denote no traffic flow records. Raw data are taken from March 1 to March 31, 2002, totally 31 days. Considering malfunctions of traffic flow detectors, we wiped away the days with empty data. Finally, the remaining data we used are of 25 days and have totally 2400 sample points. We divide the data into two parts, the first 2112 samples as training data and the rest as test data.
Model Building with GL
In multilink traffic flow prediction case, we can use certain historical traffic flows of all the links in the whole map. Through building a sparse graphical model by GL, we extract the informative historical traffic flows provided by all the links. Basing on the data set described above, we take 6 continuous traffic flows of each link to build the sparse graphical model. Because there are 31 links, we will get a inverse variance matrix . For traffic flow forecasting, the first 5 historical traffic flows of 31 links are all used to predict the 6th traffic flow of one link. As to a predicted link, when building the graphical model, we need the 6 traffic flows on the predicted link and the other 30 links’ first 5 historical traffic flows. Therefore, to a certain link, there are at most 18630=156 nodes in the graphical model.
Still take link Ba as an example. In the singlelink prediction model, we predict the traffic flow Ba(n) using the continuous historical traffic flows Ba(n5), , Ba(n1) on link Ba. While in the multilink prediction model, we consider the historical traffic flows of all adjacent links or all the links in the whole traffic map. The latter one seems to be more comprehensive, but it also brings too much computation. Fortunately, this problem can be easily solved by the GL approach. With the sparse graphical model built by GL, we can extract the most relevant historical traffic flows to predict the predicted traffic flow. In modeling of link Ba, we just consider the components of the corresponding column or row in the inverse covariance matrix. If the component is zero, it means there is no relevance or very little relevance between the two variables. Then there is no link between the two corresponding variables in the graphical model. For example, let variable represent the predicted traffic flow and variable represent some historical traffic flow. If there is no link between variables and , it means historical traffic flow contributes nothing to the prediction of traffic flow . Fig. 4 gives the sparse graphical model of link Ba built by GL.
By comparing Fig. 1 and Fig. 4 with link Ba, we can see that the prediction of link Ba is not only relevant to the three traffic flows Ba(t3), Ba(t2), Ba(t1) on link Ba itself but also the other five traffic flows Eb(t1), Fe(t2), Fe(t1) and Hl(t1), Ib(t1) of link Eb, Fe, Hl and Ib respectively. There are only 8 variables considered relevant to the prediction of Ba(t), which is much fewer than all the 155 variables considered in the general multilink model. Therefore, GL further extracts the relevant information based on our previous multilink prediction model.
Experimental Settings
In the design of NNs, a threelayer BP neural network is selected. On the one hand, a threelayer NN can approximate arbitrary bounded and continuous functions (Duda et al., 2001), and on the other hand, more layers will make the network more complex. Besides, BP NNs are well known for their good selflearning capability. The number of input and output units is determined by the dimension of the experimental data. For example, in singlelink models based on NNs, as we use 5 historical traffic flows to predict the traffic flow of the next time interval, the number of input units is 5, while the number of output units is 1 in the SSTL model and 3 in the SMTL model. The case of the multilink model also can be inferred from the representation of the multilink model in the section of multilink model. In GL with NN case, the number of input units depends on the extracted dimension of the GL algorithm and the number of output units is the same as the SSTL model. Obviously, different links will have different numbers of input units in GL with the NN case. For all approaches based on NNs, the number of hidden units is computed by the empirical formula shown below.
(21) 
where , and respectively denote the number of hidden, input and outputlayer units, is a constant that can be chosen between 1 and 10(Zhang and Sun, 2010). To obtain a relatively optimal construction of NN, we try different values of from 1 to 10 with an interval 1 and finally choose the one with best performance. As to the transfer functions of NN, sigmoid function is selected between the input layer and the hidden layer, purelin function is selected between the hidden layer and the output layer. The trainlm function is selected as the training function, because it is based on LevenbergMarquardt algorithm and can converge rapidly with a high prediction accuracy.
In the realization of GPR, we need to specify the covariance function and find a method to optimize the parameters. In our experiments, we choose the following squared exponential (SE) covariance function.
(22) 
where is a diagonal matrix with diagonal elements being , is the dimension of the input space, is the signal variance which controls the global variation. Therefore, together with the noise variance involved in GPR algorithm, there are totally parameters. Following the suggestion in (Rasmussen and Williams, 2006), we initially set , all as 1 and as 0.1. Then, we optimize these parameters by maximizing the marginal likelihood
(23) 
which can also be easily achieved by . We use the gradient descent algorithm to minimize the negative marginal likelihood to get the optimal parameters.
Similarly, there are two parameters to be specified in GL. One is the penalty parameter , and the other is the lower limit value which determines the relevance between two variables. As to the selection of the penalty parameter, we follow the suggestion in Section 2.3 of (Banerjee et al., 2008). There is also a statement saying that, if we follow the suggestion to choose the penalty parameter, the error rate of estimating the graphical model can be controlled. Since the GL algorithm builds the sparse graphical model according to the inverse covariance matrix, we need a lower limit value to screen out the nonzero components of the inverse covariance matrix that represent effective information. In our experiments with GL, we set the component of the inverse covariance matrix as zero if it is less than 5e4. That is, we think that there is little relevance between the two variables when the corresponding component in the inverse covariance matrix is so small.
Results
To examine the rationality of the multilink model, we first compute the correlation coefficients of adjacent links. There are 10 junctions in the traffic map of Fig. 1, and there are totally 36 correlation coefficients. We list the 36 correlation coefficients in Table 1.
Junction  cor_coef  cor_coef  cor_coef  cor_coef  cor_coef  cor_coef 
B  0.9512  0.9370  0.9608  
C  0.9320  0.8731  0.8836  0.9446  0.9510  0.9365 
D  0.7998  0.9073  0.9218  0.7405  0.7493  0.9398 
E  0.7869  
F  0.9420  0.8875  0.9637  0.9019  0.9359  0.9289 
G  0.9510  
H  0.9423  0.9597  0.9473  
I  0.8735  0.8987  0.9543  
J  0.9238  
K  0.8708  0.7280  0.7876  0.6662  0.7683  0.5988 
From Table 1, we can see that the minimum, the maximum and the mean of the 36 correlation coefficients are 0.5988, 0.9637 and 0.8790, respectively. Thereinto, almost all of the 36 correlation coefficients are larger than 0.8, which means high correlation between variables. Therefore, as to the realworld data sets we used in the experiments, it is reasonable and meaningful to test the proposed multilink approaches on them.
In this paper, the mentioned approaches for traffic flow forecasting include SSTL, SMTL, MSTL, MMTL, GPR and GLNN, respectively. We test them on the 31 realworld traffic flow data sets described above, which are collected from the 31 road links of Fig. 1. To get a complete evaluation of all the proposed approaches, the historical average, which we mark as Hist_Avg, is adopted as a base line for comparison. We adopt root mean square error () and mean absolute relative error () to evaluate the prediction performance of different approaches. and are formulated as follows.
(24) 
and
(25) 
where is the prediction of , and is the number of test samples.
In Table 2 and Table 3, we present the experimental results of and on the 31 road links corresponding to all the compared approaches.
MARE  GPR  SSTL  SMTL  MSTL  MMTL  GL_NN  Hist_Avg 
Ba  11.55  12.83  11.14  10.61  11.43  11.28  12.94 
Bb  7.72  7.98  8.02  7.57  7.78  7.68  9.01 
Bc  9.34  9.95  9.40  8.38  8.23  8.26  9.90 
Ce  9.82  10.30  10.23  9.34  9.53  9.52  9.04 
Cf  8.92  9.03  8.95  9.08  8.32  7.58  8.76 
Cg  11.86  12.53  12.00  11.45  11.67  13.37  14.70 
Ch  10.56  10.18  10.61  10.19  10.12  9.47  9.74 
Da  20.85  20.51  18.89  18.44  17.16  20.02  23.79 
Db  24.61  25.02  24.73  25.06  24.68  26.26  21.11 
Dc  13.31  13.95  13.84  16.79  14.59  12.20  12.97 
Dd  12.94  13.17  13.50  12.52  11.80  9.94  12.39 
Eb  10.83  12.74  11.13  12.92  12.74  10.24  14.34 
Ed  15.51  16.35  15.36  16.66  16.16  14.21  30.04 
Fe  7.74  8.38  7.50  7.94  10.07  7.54  9.81 
Ff  11.29  11.68  11.81  10.65  11.32  12.50  13.97 
Fg  10.08  11.20  9.88  9.69  11.10  9.34  11.14 
Fh  8.52  10.02  9.31  9.11  9.39  8.69  9.81 
Gb  13.37  14.10  15.30  13.52  12.80  12.80  14.52 
Gd  10.36  11.11  10.63  10.22  9.96  8.84  11.83 
Hi  11.64  11.79  11.58  12.34  11.27  12.12  15.48 
Hk  13.44  14.03  14.28  12.78  12.96  13.33  16.60 
Hl  9.29  9.79  10.07  9.15  9.53  8.00  10.71 
Ia  16.00  16.85  16.66  15.88  15.90  18.20  20.86 
Ib  9.77  9.51  9.39  8.71  8.60  8.04  8.53 
Id  8.07  8.46  8.37  8.37  8.61  6.92  8.14 
Jh  7.95  8.02  7.89  8.29  8.69  8.26  8.47 
Jf  9.23  9.43  9.66  7.88  7.61  6.92  10.66 
Ka  9.24  9.50  9.23  10.00  10.24  8.51  10.86 
Kb  10.31  10.48  10.73  10.30  10.08  10.69  11.38 
Kc  26.01  28.51  31.39  27.63  31.14  25.00  27.54 
Kd  11.80  11.81  11.54  11.06  11.26  10.31  14.52 
RMSE  GPR  SSTL  SMTL  MSTL  MMTL  GL_NN  Hist_Avg 
Ba  142.76  148.99  147.16  150.81  147.79  139.71  174.76 
Bb  67.80  72.15  71.86  73.60  72.59  70.46  85.79 
Bc  96.83  104.11  103.56  98.80  97.65  91.89  123.11 
Ce  51.95  55.65  55.31  54.73  53.57  52.70  52.53 
Cf  91.34  89.31  88.87  86.79  84.58  81.62  105.06 
Cg  50.87  50.32  50.56  49.51  49.19  53.27  69.12 
Ch  67.35  65.95  66.01  63.48  63.13  64.02  67.20 
Da  112.40  77.44  79.05  82.28  77.15  95.47  132.70 
Db  50.81  53.29  53.24  54.60  53.49  63.75  60.73 
Dc  78.49  85.93  85.88  88.32  87.69  73.39  81.15 
Dd  62.93  62.08  61.60  68.61  65.07  55.99  80.07 
Eb  154.78  166.89  162.26  168.14  165.58  150.17  212.08 
Ed  195.81  191.85  196.43  208.95  199.36  179.67  340.79 
Fe  116.60  116.80  115.69  122.73  119.94  112.40  160.03 
Ff  87.47  84.62  84.74  83.88  83.23  103.15  106.47 
Fg  85.10  95.16  92.85  93.12  92.40  87.67  108.79 
Fh  151.53  151.51  149.71  141.46  136.23  144.00  171.69 
Gb  85.59  85.25  84.77  83.64  83.34  103.03  102.68 
Gd  157.37  152.95  151.42  153.39  155.08  144.28  191.14 
Hi  90.29  89.54  88.50  87.23  87.10  95.11  128.12 
Hk  149.57  137.16  140.78  131.72  131.61  158.27  175.22 
Hl  130.24  132.59  129.23  130.04  129.67  108.92  144.20 
Ia  83.22  86.54  86.10  88.60  88.13  100.65  118.84 
Ib  140.21  136.05  135.40  132.83  129.45  124.16  136.44 
Id  122.52  134.42  134.45  135.06  133.13  113.36  125.31 
Jh  119.73  118.34  116.65  148.23  148.88  130.23  136.04 
Jf  137.57  159.15  160.30  120.33  119.46  108.42  171.17 
Ka  81.38  77.31  77.07  75.72  76.45  75.60  96.70 
Kb  146.35  142.16  141.15  134.27  130.85  159.13  160.19 
Kc  371.17  384.20  382.86  385.35  378.47  365.17  410.77 
Kd  172.14  168.21  167.22  163.50  161.21  159.61  218.94 
In order to get a comprehensive comparison, we compare the experimental results from two views: the global and the local. Table 2 and Table 3 locally give the s and the s of the 31 links corresponding to all the compared approaches. Fig. 5 globally shows the sum of the s of the 31 links corresponding to all the compared approaches.
Besides these, we further compare all the proposed methods using ttest. The ttest results of all the compared approaches are listed on Table
4.approaches  SSTL  SMTL  MSTL  MMTL  GPR  GL_NN  Hist_Avg 
SSTL  0.2074  0.0036  0.0226  0  0  0.0394  
SMTL  0.1231  0.1792  0.0838  0.0052  0.0302  
MSTL  0.6961  0.9214  0.0615  0.0047  
MMTL  0.7299  0.0612  0.0095  
GPR  0.0183  0.0035  
GL_NN  0.0005  
Hist_Avg 
Firstly, we compare all the proposed approaches with the base line Hist_Avg. According to the experimental results shown in Table 2, Table 3 and Fig. 5, of the 31 data sets, the number of data sets on which SSTL, SMTL, MSTL, MMTL, GPR and GL_NN performs better than Hist_Avg is 19, 21, 22, 22, 24 and 29, respectively, which means that all the proposed traffic flow forecasting approaches are superior to Hist_Avg. Secondly, we compare the performance of all the proposed approaches with each other such as the comparison of singlelink approaches with multilink approaches and singletask approaches with multitask approaches. GPR is a singlelink singletask prediction approach in nature, when evaluating its performance, we compare it with SSTL. As to the comparison criterion , GPR outperforms SSTL on 28 data sets according to Table 2. And according to ttest results shown in Table 4, we can find that GPR is significantly better than SSTL. In the GPR column of Table 3, we marked the components in bold which show that GPR is better than SSTL. We can see that, in the total 31 links, there are 14 links showing that GPR outperforms SSTL. However, there are another 5 links (italics in the GPR column of Table 3) with error rate difference of GPR and SSTL less than 1. In Fig. 5, GPR is globally better than SSTL. Therefore, we still can conclude that GPR outperforms SSTL in traffic flow forecasting. According to the experimental results based on in Table 3, in the columns corresponding to SSTL, SMTL, MSTL, MMTL, we marked the two best components of the four approaches in bold. The numbers of boldfaces are respectively 11, 13, 17 and 21. Therefore, we can get the conclusion that, multilink approaches perform better than singlelink approaches and multitask learning approaches are better than singlelink learning approaches. This is also why MMTL performs best in the four approaches. GLNN constitutionally belongs to the multilink singletask prediction approaches. The difference between GLNN and MMTL is that GLNN extracts the relevant information. From Table 2, GL_NN performs better than MMTL on 22 data sets. In the GLNN column of Table 3, we marked the components in bold which show that GLNN is better than MMTL. It is easy to find out that there are 21 links of all 31 links showing that GLNN is better than MMTL. According to the ttest result of GL_NN and MMTL in Table 4, it shows that GL_NN outperforms MMTL to some extent. The results fully verify the superiority of GL used in extracting information through building the sparse graphical model. GL_NN performs best on traffic flow forecasting compared to all the other proposed approaches.
Discussions on GPR
As a singlelink singletask approach, GPR performs better than SSTL on traffic flow forecasting. In this section, we would give special illustrations of GPR on the potential capability. Actually, the GPR algorithm outputs two terms that are the mean and the variance to be predicted. Exactly speaking, GPR gives the distribution of the targets rather than the exact values. When computing the prediction errors, we use the mean as the prediction value. Take link Kd as an example, Fig. 6 shows the practical prediction results. The star curve represents the actual values while the dot curve represents the prediction values. The shaded area is the fluctuating range of targets predicted by GPR. As we can see, with the shaded part, more actual targets can be contained in the prediction scope.
Reducing the noise variance in Fig. 6, we can get a new prediction figure of link Kd shown as Fig. 7. From Fig. 7, we can see that the fluctuating range gets larger when the noise variance is reduced. The shaded area in Fig. 7 can even contain all the actual targets. This is the potential capability of GPR. GPR can get more precise predictions by adjusting more appropriate parameters. Therefore, in the area that it needs to predict only the output scope rather than precise values, GPR is the approach well worth considering.
Conclusions
Due to the disadvantage of the traditional singlelink traffic flow forecasting model, we propose the multilink model that predicts traffic flows using historical data from all the adjacent links. By combining the singlelink model and multilink model with singletask learning and multitask learning, we propose four basic traffic flow forecasting approaches, SSTL, SMTL, MSTL and MMTL. Graphical Lasso(GL) is an effective approach in extracting the relevant information of variables of complex problems by building the sparse graphical model. We make use of GL to extract the most informative historical flows from all the links in the whole transportation system, and then construct a BP neural network with the extracted data to predict traffic flows. We refer to the approach combining GL with NN as GL_NN. GL_NN is aslo a multilink traffic flow forecasting approach, but it is more efficient than MMTL. The test of GL_NN on realworld traffic flow forecasting shows competitive results. In addition, we apply GPR to traffic flow forecasting and discuss its potential. Competitive experimental results reveal the superiority of GLNN to other proposed approaches. Moreover, the results further verify that multilink approaches outperform singlelink approaches and multitask learning approaches outperform singletask learning approaches in traffic flow forecasting.
In the future, three interesting aspects can be considered. Firstly, the potential of the multilink model in traffic flow forecasting should be further studied. Secondly, GL can be combined with other approaches not just with NNs. Thirdly, considering that in practical applications a much lower traffic flow estimation of one road can probably attract vehicles from adjacent roads and thus causes subsequent traffic difficulties, the traffic flow prediction methods discussed in this paper can be further enhanced by investigating microcosmic prediction errors and taking some precautionary actions for lower estimations.
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China under Project 61075005, and the Fundamental Research Funds for the Central Universities.
References
 Abdulhai et al. (1999) Abdulhai, B., Porwal, H., and Recher, W., 1999. Short term freeway flow prediction using geneticallyoptimized timedelay based neural networks, in: Proceedings of the 78th Annual Meeting of the Transportation Research Board, Washington D.C., USA.

Abdulhai et al. (2002)
Abdulhai, B., Porwal, H., and Recher, W., 2002. Shortterm traffic flow prediction using neurogenetic algorithms,
Journal of Intelligent Transportation Systems, vol. 7, 3–41.  Banerjee et al. (2008) Banerjee, O., Ghaoui, L., and Aspremont, A., 2008. Model selection through sparse maximum likelihood estimation, Journal of Machine Learning Research, vol. 9, 485–516.
 Chen and Grant (2001) Chen, H. and GrantMuller, S., 2001. Use of sequential learning for shortterm traffic flow forecasting, Journal of Transportation Research Record, Part C: Emerging Technologies, vol. 9(5), 319–336.
 Chen and Chen (2007) Chen, L. and Chen, C., 2007. Ensemble learning approach for freeway shortterm traffic flow prediction, in: Proceedings of IEEE International Conference on System of Systems Engineering, 1–6.
 Caruana (1997) Caruana, R., 1997. Multitask learning, in: Proceedings of International Joint Conference on Machine Learning, vol. 28(1), 41–75.
 Davis and Nihan (1991) Davis, G. and Nihan, N., 1991. Nonparametric regression and shortterm freeway traffic forecasting, Journal of Transportation Engineering, vol. 177(2), 178–188.
 Davis (1990) Davis, G., 1990. Adaptive forecasting of freeway traffic congestion, Journal of Transportation Research Record, vol. 1287, 29–33.
 Dahl et al. (2008) Dahl, J., Vandenberghe, L., and Roychowdhury, V., 2008. Covariancem selection for nonchordal graphs via chordal embedding, Optimization Methods and Software, vol. 23(4), 501–520.
 Duda et al. (2001) Duda, R., Hart, P., and Stork, D., 2001. Pattern Classification, John Wiley and Sons, New York.
 Friedman et al. (2008) Friedman, J., Hastie, T., and Tibshirani, R., 2008. Sparse inverse covariance estimation with the graphical lasso, Biostatistics, vol. 9(3), 432–441.
 Gao and Sun (2010) Gao, Y. and Sun, S., 2010. Multilink traffic flow forecasting using neural networks, in: Proceedings of the Sixth International Conference on Natural Computation (ICNC), 398–401.
 Gao et al. (2011) Gao, Y., Sun, S., and Shi, D., 2011. Networkscale traffic modeling and forecasting with graphical lasso, in: Proceedings of the Eighth International Symposium on Neural Networks (ISNN), 151–158.
 Hall and Mars (1998) Hall, J. and Mars, P., 1998. The limitations of artificial neural networks for traffic prediction, in: Proceedings of the Third IEEE Symposium on Computers and Communications, 8–12.
 Jordan (2004) Jordan, M., 2004. Graphical models, Statistical Science, vol. 19(1), 140–155.
 Jin and Sun (2008) Jin, F. and Sun, S., 2008. Neural network multitask learning for traffic flow forecasting, in: Proceedings of International Joint Conference on Neural Networks (IJCNN), 1898–1902.
 Lee and Fambro (1999) Lee, S. and Fambro, D., 1999. Application of subsets autoregressive integrated moving average model for shortterm freeway traffic volume forecasting, Journal of Transportation Research Record, vol. 1678, 179–188.
 Moorthy and Ratcliffe (1988) Moorthy, C. and Ratcliffe, B., 1988. Short term traffic forecasting using time series methods, Journal of Transportation Planning and Technology, vol. 12(1), 45–56.
 Meinshausen and Bühlmann (2006) Meinshausen, N. and Bühlmann, P., 2006. High dimensional graphs and variable selection with the lasso, The Annals of Statistics, vol. 34, 1436–1462.
 Okutani and Stephanedes (1984) Okutani, I. and Stephanedes, Y., 1984. Dynamic prediction of traffic volume through kalman filter theory, Journal of Transportation Research Record, Part B, vol. 18(1), 1–11.
 Park et al. (1998) Park, B., Messer, C., and Urbanik II, T., 1998. Shortterm freeway traffic volume forecasting using radial basis function neural network, Journal of Transportation Research Record, vol. 1651, 39–47.
 Rasmussen and Williams (2006) Rasmussen, C. and Williams, K., 2006. Gaussian processes for machine learning, the MIT press.
 Smith and Demetsky (1997) Smith, B. and Demetsky, M., 1997. Traffic flow forecasting: comparison of modeling approaches, Journal of Transportation Engineering, vol. 123(4), 261–266.
 Smith and Demetsky (1994) Smith, B. and Demetsky, M., 1994. Shortterm traffic flow prediction approaches. Prediction: neural network approach, Journal of Transportation Research Record, vol. 1453, 98–104.
 Sun and Zhang (2007) Sun, S. and Zhang, C., 2007. The selective random subspace predictor for traffic flow forecasting, IEEE Transactions on Intelligent Transportation Systems, vol. 8(2), 367–373.

Ulbricht (1994)
Ulbricht, C., 1994. Multirecurrent networks for traffic
forecasting, in:
Proceedings of the Twelfth National Conference on Artificial Intelligence
, vol. 1, 883–888.  William and Hoel (2003) William, B. and Hoel, L., 2003. Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process, Journal of Transportation Engineering, vol. 129(6), 664–672.
 Wang and Xiao (2003) Wang, X. and Xiao, J., 2003. A radial basis function neural network approach to traffic flow forecasting, in: Proceedings of IEEE Intelligent Transportation Systems, vol. 1, 614–617.
 Yu et al. (2003) Yu, G., Hu, J., Zhang, C., Zhuang, L., and Song, J., 2003. Shortterm traffic flow forecasting based on markov chain model, in: Proceedings of IEEE Intelligent Vehicles Symposium, 208–212.

Zhang and Sun (2010)
Zhang, Q. and Sun, S., 2010. Multipleview multiplelearner active learning,
Pattern Recognition, vol. 43, 3113–3119.
Comments
There are no comments yet.