Real World Application Scenario
In this report, we present and discuss a potential application which estimates the traffic density at traffic lights/junctions using public cameras to adapt the traffic lights accordingly to get the best result.
Even if traffic is flowing very slowly, streets could handle the traffic flow much more efficiently; this means either more traffic at the same time or the same traffic in a shorter time. The key for this is that all cars have to move at constant speed without much braking and accelerating. Therefore, an intelligent traffic system could detect the amount of cars at every position, estimate the velocity of the cars in a later stage, and ultimately adapt the traffic lights accordingly to get the optimal outcome. As described in [Wang, Vrancken, and Soares2009] a top down traffic control, which is used in general, is completely centralized and its control schemes are developed off-line.
The problem of this top-down control based on specific scenarios triggered according to some patterns is that they hardly fits well in practice. [Wang, Vrancken, and Soares2009]
Just think about extraordinary situations such as changing weather, accidents or other unplanned traffic fluctuations. Efficiency can be gained by locally adapting the traffic lights, considering the local traffic situation.
We identified several reasons why there is a need for this application in Singapore. For us, the crucial points are the following:
This application will help everyone who moves around Singapore frequently, so it is universally beneficial.
It reduces the time and cost of traffic congestion.
The higher efficiency in traffic and less traffic jams also have a positive impact on the climate (by reducing greenhouse gases emissions such as CO).
It is useful for future integration with autonomous vehicle technology since it paves the way for an efficient ”fleet management”.
The required infrastructure (cameras on top of traffic lights) is publicly available for Singapore and could be easily used.
For our system to be employable some certain requirements must be fulfilled. In the following we show the necessities for our intelligent traffic control:
Real Time: Receiving a camera image must lead to an instantaneous estimation and to the needed Traffic light adaption.
Fail Safeness: Since a wrongly working traffic light system is highly dangerous it must be absolutely failsafe.
Superior Rules: It is still necessary to introduce some rules to avoid wrongdoing, e.g. to avoid starving of cars.
Work under different conditions: Our software must be versatile and should work in different situations (changing lighting, weather and traffic conditions) as well as at different places.
Streaming Data: We are constantly receiving data by the cameras. Therefore, we must be able to perform Stream Processing (incrementally).
The first and designated interface between the application and the involved human drivers is quite obvious. The system just gives the same outputs as a normal traffic light. People then just follow this regulation as they did previously. Therefore, the system helps the humans in this case, and they don’t have to pay attention to any additional signs. The second part of the interface involves pedestrians. What happens if people want to cross the road? If there are provided pedestrian lights, we then simply add an additional input to our pipeline. If not, there are two possibilities: Try to perceive them using the camera as well and include them in our traffic decisions or to just ignore them. Both have valid reasons, and the decision depends on individual circumstance (e.g. compare a motorway to a play street).
For the latter case we therefore need to expand our decision policy.
We think that our application is not too critical in this respect, which is also a reason for us to pick this specifically.
The application does not displace jobs because it simply improves existing traffic algorithms. Camera images are already available publicly, and training on them presents no privacy violation.
The potential concern with this application is the possibility of exploitation for malicious intent. E.g. consider a scenario where a party wants to use this application commercially and to privilege some cars who have paid large amounts of money, leading to inequity. Hence, it is likely better to let the authorities be in charge of this application.
Also, not being vulnerable to hacking attacks or expanding the service to more critical activities would be one of our important objectives.
We divide our application in two main topics:
Traffic Density Estimation and
Decision Making based on the Estimation.
The first part receives the live camera image of every lane facing towards the traffic junction. Using this information, it then deals with determining the traffic density on each of the lanes.
Using this information, the task of the second part is then to set up the optimal traffic light state considering also all of the requirements specified in the Requirements section.
In the next sections we will present both parts; however, our main focus will be on the first one. For this one, there’s no way of getting around machine learning algorithms. Therefore, we present our own pipeline, show how we approached this problem and will discuss how the results differ from our expectations. For the second part, we will discuss existing approaches and their suitability.
As shortly described above, in the practical application the estimation part would receive a live image stream of every lane intersecting the junction. In our case, it was very difficult to find an appropriate training set in general and specifically for Singapore. Therefore we decided to use the live camera data from Singapore (LTA)111https://data.gov.sg/dataset/traffic-images.
We wrote a script to download images from all of the cameras over a weekend and selected three cameras which seemed to be the most suitable for our use case; here we chose those that have varying traffic density over the days and contains a clearly visible road (i.e. unobstructed by trees, etc). We then used these images and randomly partitioned the dataset into 90% for training and 10% for validation. We decided to use images taken during both day time and night time. You can find 3 sample images of the three different situations with different lighting and density conditions in figure 1. Overall we had 4582 images available.
We decided to define 5 classifiers to categorize the images. They can be found in Table1. We counted motorcycles as half cars.
|Empty||Almost empty street||0-8 Cars|
|Low||Only a few cars||9-20 Cars|
|Medium||Slightly filled street||50 cars|
|High||Filled Street or Blocked Lane||100 cars|
|Traffic Jam||Traffic almost not moving||100 cars|
Amongst all the proposed advanced machine learning topics, Convolutional Neural Network was the most suitable approach for us.
In addition to the choice of this model there are also many other possibilities available. We thought about the following possibilities:
Feeding the machine learning pipeline with the raw image or with some extracted features (SIFT, SURF, etc.).
Preprocess the image (cut off unimportant parts or not).
Grayscale or colored image.
Resolution of the image.
Structure of the underlying Neural Network (activation functions, number of layers, etc.), see sectionML Model in our Case for more specific analysis.
For us, the most suitable of the other alternative models would have been the Recurrent Neural Network. Due to its structure in which connections form a directed graph along a sequence, for RNNs it’s possible to use their internal state as a memory. This allows processing sequences of inputs222https://en.wikipedia.org/wiki/Recurrent_neural_network.
This could be suitable to even input multiple sequential images into the pipeline and hence be able to estimate the velocity.
However, we did not attempt this because the Singapore live dataset (see footnote 1) were at 20 seconds interval, which we considered to be too long for RNNs to be used effectively. Thus, we chose to focus on traffic density estimation in this project.
Technical Details and Insights
Convolutional neural networks, or CNNs in short, are a special form of Neural Networks. They are especially well suited for the processing of inputs which have a grid-like topology. According to [Goodfellow, Bengio, and Courville2016]
they have been tremendously successful in practice applications.
The same book defines CNNs simply by the following quote:
Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.
In this discussion we will mainly focus on the distinctions and the advantages this topology induces in comparison to general neural networks.
Normally a mathematical convolution is denoted as:
In the case of CNNs, the first argument in equation 1 is most of the time referred to as the input, whereas the second argument as the kernel. The output is sometimes called feature map.
There are three important ideas in existence why convolution can help improve a machine learning system, in our case here CNNs compared to the typical neural networks (typical NNs):
Parameter Sharing and
Moreover, the convolution provides a good possibility to process variable sized inputs.
Compared to typical NNs, where every input unit interacts with every output unit, CNNs use sparse interactions. The reason for this is that the kernel is chosen to be smaller than the input. This means we need to store fewer parameters which reduces requirements in both memory and calculations. According to [Goodfellow, Bengio, and Courville2016] the improvements in efficiency are usually quite large. The difference between conventional NNs and CNNs is illustrated in figure 2.
It is important to mention that neurons in deeper layers may still indirectly interact with a larger portion of the input. This enables the successful consideration of complicated interactions between thesimple building blocks and hence the detection of more complicated structures in the input. For clarification refer to figure 3.
Parameter Sharing describes the usage of the same parameter for multiple model functions. It is also often denoted as tied weights since the weight values applied to one input value is simply tied to a weight value applied somewhere else. In more detail, this means that each kernel is used at every position of the input (except maybe the boundaries). Therefore we learn only one set of parameters. This further reduces the storage requirements. According to [Goodfellow, Bengio, and Courville2016],
Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the memory requirements and statistical efficiency.
By result of parameter sharing, we observe another important property, the equivariance to translation. This property ensures that pixel shifts in an image does not affect the output. If we define as the shifted image, now it makes no difference if we apply the convolution to the shifted image or if we apply the convolution to the original image and then shift it.
The third stage of each layer, after performing the convolutions and applying the activation functions, is often called pooling. For this, a pooling function is used which replaces the output at a certain location with a summary statistic of nearby outputs333Typically, pooling is done by extracting the maximum value, the average, the norm of the rectangular neighborhood or any other weighted average (e.g. based on the distance from center). It always helps to make the output almost invariant to small translations, and for many tasks it is just essential to make the network applicable to inputs of varying size.
For describing the mentioned properties we used the book of [Goodfellow, Bengio, and Courville2016], which we highly recommend.
Advantages and Exploitation of Model
We model the problem as a task of classifying images into 5 traffic density classes. Based on the properties of CNNs, they are suitable for this. This is especially true given the definition from above that CNNs are
especially well suited for the processing of inputs which have known grid-like topology. [Goodfellow, Bengio, and Courville2016]
Also the size of the Kernel for the convolution is a big advantage of using CNNs. An image might have thousands or millions of pixels, but we can detect small and meaningful features (e.g. edges, corners) with kernels that are only consisting out of tens or hundreds of pixels.
Also the parameter sharing is a very nice property in our case since it just reduces the amount of parameters significantly which would be a lot having a whole image as an input and using typical NNs.
The translational equivariance can also be very helpful due to the general reasons mentioned above.
However the pooling still has no negative effects since in our case the exact location of the crucial structures (for detecting the cars) is not fixed, since cars are moving anyway. So one of the typical big disadvantages of pooling, perturbing the performance in situations in which the very exact location is important, is not an issue.
ML Model in our Case
State-of-the-art models have been empirically demonstrated to have good performance on general image classification tasks [Szegedy et al.2016]
. However, training these large models from scratch on our dataset is slow and prone to overfitting. Instead, we can explore the use of transfer learning[Yosinski et al.2014]. We make use of pre-trained weights from an InceptionV3
model by removing its penultimate layer and training a new softmax layer on top of it to produce predictions for our task. This allows us to make use ofInceptionV3 to extract general image features for us and to train new models very quickly since we only have to train the additional layers.
With the traffic estimation approaches designed, we demonstrate that there exist basic approaches for using these estimates to solve the traffic algorithm problem.
One of those possibilities was shown by [Gao et al.2017] where they proposed a deep reinforcement learning algorithm which extracts all useful machine crafted features from raw real-time traffic data. The goal was to learn the optimal policy to adapt the traffic lights. Impressively, they were able to reduce the vehicle delay by up to 47% compared to the well known longest queue first algorithm and even by up to 86% compared to fixed time control
. The key behind this approach is the formulation of the traffic signal control problem as a reinforcement learning problem. In this case the goal of theagent is to reduce the vehicle staying time in the long run. The reward for the agent is given at each time step for choosing actions that decrease the time of vehicles staying at the intersection.
Another possibility, which is admittedly not really state of the art, is using the Genetic Algorithms as proposed by [Singh, Tripathi, and Arora2009]. This paper is presenting a strategy which is giving appropriate green time extensions to minimize a fitness function. In this case the fitness function is consisting of a linear combination of performance indexes of all four lanes used in this example. This approach reaches in this paper a performance increase of 21.9% which is not as good as the reinforcement policy from last chapter.
Training and Ways of Finding the Solution
With the traffic algorithm readily available, we would also require proper simulation environments.
The first simulation we tried was the aimsun next traffic modeling software 444https://www.aimsun.com/aimsun-next/. The tool allows whole big cities can be imported and simulated which we considered to be too massive in scale for our use-case.
Another better alternative is the popular open source simulator Simulation of Urban MObility (SUMO)555http://sumo.dlr.de/userdoc/. SUMO is an easy program with python API, can be configured using simple xml-files, controlled using the terminal. This allows not only to verify the results but to also actively use this simulation during the training, e.g. for the case when a Reinforcement Learning approach is used as presented in [Gao et al.2017]. It’s also possible to visualize the results in a GUI, an example image can be seen in figure 4.
Tests and Experiments
We use self-labeled traffic images from 3 cameras of different junctions and angles as seen in figure 1. Each image is labeled with a density level from empty to traffic jam.
Traffic density estimation can be modeled as a multi-class classification problem and be solved by CNN classifiers. We identified the following approaches for making use of CNNs and investigated their effectiveness for this problem:
Basic CNN and
Transfer Learning on InceptionV3.
We trained the two classifiers on the dataset and evaluated them on a few metrics666All metrics were measured by training a model on the training set and performing evaluation on a separate validation set. We take the average of all results over 10 runs: accuracies, f1 scores777We extend F1 scores to the multi-class scenario by taking the average of all independently computed F1 scores for each class. Each F1 score and top 2 accuracies888Top 2 accuracy refers to the frequency that an example was correctly labeled by the rank 1-2 predictions. All results shown in Table 2
are evaluated on a cross-validated classifier with hyperparameters selected based onaccuracy. The cross-validation was performed using a simple grid-search.
|Classifier||Accuracy||F1||Top 2 Accuracy|
|Classifier||Training Time||Training Time|
In Table 3 you can find the time the training took us with and without GPUs.
A simple CNN provided an overall better performance than transfer learning on InceptionV3. This is likely because we froze the entire InceptionV3’s weights and higher-level features from the larger dataset cannot be transferred to our dataset [Yosinski et al.2014]. In the future, we could explore freezing the bottom layers only.
Although using a CNN provides better overall accuracy than transfer learning, it take a significantly longer time to train without a GPU.999 All time measurements are for 50 epochs of training with no extra preprocessing. GPU measurements were conducted on an
All time measurements are for 50 epochs of training with no extra preprocessing. GPU measurements were conducted on anNvidia GTX1080Ti Therefore, transfer learning is a viable approach if computation power is limited. While CNN is the preferred approach when GPUs are available.
Other Possibilities Tested
Due to uneven distributions of classes in our self-labeled dataset (see Table 4), we also explore the following measures to handle class imbalance (CI) and compare them through experiments:
Ratio-weighted losses: We scale the cross entropy losses contributed each example according to their class ratios using the following formulation:
This increases the cost of misclassification of a minority class, forcing the learner to prioritize the correct classification of minority classes [Eigen and Fergus2014].
Real-time data augmentation: By performing basic image transformations on existing data, we are able to generate new examples on-the-fly for training to increase the variety of examples seen by the classifier. As it can be seen in Table 5, this leads to a better accuracy for the minority classes as we are able to obtain more training examples for them [Wong et al.2016].
|Method||Accuracy||F1||Top 2 Accuracy|
|Basic CNN with|
With CI measures, accuracy increased slightly but f1 scores101010F1 score is a better predictor of performance than accuracy for class imbalance scenarios since it accounts for precision and recall increased significantly. Therefore, CI measures have shown significant improvements. Notably, the top 2 accuracy is about the same which indicates that class imbalance does not affect the top 2 predictions.
Since traffic images consists of 2 opposite traffic lanes, we also propose the use of image masking111111A visualization of masked preprocessed image can be found here: https://youtu.be/KA4SbJVX0mc to remove parts of the images that are not in the interested traffic lane.
|Method||Accuracy||F1||Top 2 Accuracy|
|Basic CNN with|
Other Tools, Online Resources
We made use of a data-labelling tool, Labelbox121212https://www.labelbox.io/. It provides a user-friendly web interface for us to collaboratively label the entire dataset.
For the experiments and implementation, besides Keras, we used OpenCV for general image processing (e.g. masking the non-relevant parts of the images and resizing the image). matplotlib
was also used for general data visualization.
Our most important online resource was the Singaporean live camera dataset (see footnote 1). We wrote all scripts for downloading, processing and classifying these images by ourselves.
Manpower for the project was managed by assigning each person to a task that they were suited for. In the brainstorming phase, everyone was given time to develop their own ideas and to choose among all the ideas by a majority vote decision.
Moreover, unpleasant tasks (such as e.g. the labeling of the data) were also divided equally among the members.
Looking at the final results in Table 6 we are happy to reach these numbers. It was quite a long way to get to this point with only having 4582 images available. An accuracy of 74.3% and an F1 score of 81.3% are already quite satisfying and in practice when having a frame rate of a few images a second, the average classification (like a ”low pass filter”) over a certain time interval will very likely produce good results. Also for us the 94% Top 2 accuracy is very significant because we labeled the images intuitively, hence the tendency is almost as important as the exact classification. E.g. if a low density is classified as empty it’s still a good insight. The high Top 2 accuracy just approves this.
Of course before bringing the application to the markets, further practice tests would be required. But we are very confident that its potential is high while it’s also not too hard to guarantee the general requirements for this case (e.g. real time behavior).
A big improvement would be to just use a larger training set which only hardly would have been possible for us because of the limited time.
For bringing up the Accuracy as well as the F1 scores it would also be helpful to do the labeling more precisely, i.e. really count the number of cars. But as mentioned in the last chapter, also classifying the right tendency can be very helpful for our application.
- [Eigen and Fergus2014] Eigen, D., and Fergus, R. 2014. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. CoRR abs/1411.4734.
- [Gao et al.2017] Gao, J.; Shen, Y.; Liu, J.; Ito, M.; and Shiratori, N. 2017. Adaptive traffic signal control: Deep reinforcement learning algorithm with experience replay and target network. CoRR abs/1705.02755.
- [Goodfellow, Bengio, and Courville2016] Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
- [Singh, Tripathi, and Arora2009] Singh, L.; Tripathi, S.; and Arora, H. 2009. Time optimization for traffic signal control using genetic algorithm. 2.
[Szegedy et al.2016]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z.
Rethinking the inception architecture for computer vision.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.
- [Wang, Vrancken, and Soares2009] Wang, Y.; Vrancken, J.; and Soares, M. 2009. Intelligent network traffic control by integrating top-down and bottom-up control.
- [Wong et al.2016] Wong, S. C.; Gatt, A.; Stamatescu, V.; and McDonnell, M. D. 2016. Understanding data augmentation for classification: When to warp? 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA) 1–6.
- [Yosinski et al.2014] Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems, 3320–3328.