Traffic Density Estimation using a Convolutional Neural Network

09/05/2018 ∙ by Julian Nubert, et al. ∙ 0

The goal of this project is to introduce and present a machine learning application that aims to improve the quality of life of people in Singapore. In particular, we investigate the use of machine learning solutions to tackle the problem of traffic congestion in Singapore. In layman's terms, we seek to make Singapore (or any other city) a smoother place. To accomplish this aim, we present an end-to-end system comprising of 1. A traffic density estimation algorithm at traffic lights/junctions and 2. a suitable traffic signal control algorithms that make use of the density information for better traffic control. Traffic density estimation can be obtained from traffic junction images using various machine learning techniques (combined with CV tools). After research into various advanced machine learning methods, we decided on convolutional neural networks (CNNs). We conducted experiments on our algorithms, using the publicly available traffic camera dataset published by the Land Transport Authority (LTA) to demonstrate the feasibility of this approach. With these traffic density estimates, different traffic algorithms can be applied to minimize congestion at traffic junctions in general.



There are no comments yet.


page 2

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Real World Application Scenario

In this report, we present and discuss a potential application which estimates the traffic density at traffic lights/junctions using public cameras to adapt the traffic lights accordingly to get the best result.


Even if traffic is flowing very slowly, streets could handle the traffic flow much more efficiently; this means either more traffic at the same time or the same traffic in a shorter time. The key for this is that all cars have to move at constant speed without much braking and accelerating. Therefore, an intelligent traffic system could detect the amount of cars at every position, estimate the velocity of the cars in a later stage, and ultimately adapt the traffic lights accordingly to get the optimal outcome. As described in [Wang, Vrancken, and Soares2009] a top down traffic control, which is used in general, is completely centralized and its control schemes are developed off-line.

The problem of this top-down control based on specific scenarios triggered according to some patterns is that they hardly fits well in practice. [Wang, Vrancken, and Soares2009]

Just think about extraordinary situations such as changing weather, accidents or other unplanned traffic fluctuations. Efficiency can be gained by locally adapting the traffic lights, considering the local traffic situation.


We identified several reasons why there is a need for this application in Singapore. For us, the crucial points are the following:

  • This application will help everyone who moves around Singapore frequently, so it is universally beneficial.

  • It reduces the time and cost of traffic congestion.

  • The higher efficiency in traffic and less traffic jams also have a positive impact on the climate (by reducing greenhouse gases emissions such as CO).

  • It is useful for future integration with autonomous vehicle technology since it paves the way for an efficient ”fleet management”.

  • The required infrastructure (cameras on top of traffic lights) is publicly available for Singapore and could be easily used.


For our system to be employable some certain requirements must be fulfilled. In the following we show the necessities for our intelligent traffic control:

  1. Real Time: Receiving a camera image must lead to an instantaneous estimation and to the needed Traffic light adaption.

  2. Fail Safeness: Since a wrongly working traffic light system is highly dangerous it must be absolutely failsafe.

  3. Superior Rules: It is still necessary to introduce some rules to avoid wrongdoing, e.g. to avoid starving of cars.

  4. Work under different conditions: Our software must be versatile and should work in different situations (changing lighting, weather and traffic conditions) as well as at different places.

  5. Streaming Data: We are constantly receiving data by the cameras. Therefore, we must be able to perform Stream Processing (incrementally).

Human-Application Interface

The first and designated interface between the application and the involved human drivers is quite obvious. The system just gives the same outputs as a normal traffic light. People then just follow this regulation as they did previously. Therefore, the system helps the humans in this case, and they don’t have to pay attention to any additional signs. The second part of the interface involves pedestrians. What happens if people want to cross the road? If there are provided pedestrian lights, we then simply add an additional input to our pipeline. If not, there are two possibilities: Try to perceive them using the camera as well and include them in our traffic decisions or to just ignore them. Both have valid reasons, and the decision depends on individual circumstance (e.g. compare a motorway to a play street).

For the latter case we therefore need to expand our decision policy.

Ethical Implications

We think that our application is not too critical in this respect, which is also a reason for us to pick this specifically.

The application does not displace jobs because it simply improves existing traffic algorithms. Camera images are already available publicly, and training on them presents no privacy violation.

The potential concern with this application is the possibility of exploitation for malicious intent. E.g. consider a scenario where a party wants to use this application commercially and to privilege some cars who have paid large amounts of money, leading to inequity. Hence, it is likely better to let the authorities be in charge of this application.

Also, not being vulnerable to hacking attacks or expanding the service to more critical activities would be one of our important objectives.

Algorithmic Structure

We divide our application in two main topics:

  1. Traffic Density Estimation and

  2. Decision Making based on the Estimation.

The first part receives the live camera image of every lane facing towards the traffic junction. Using this information, it then deals with determining the traffic density on each of the lanes.

Using this information, the task of the second part is then to set up the optimal traffic light state considering also all of the requirements specified in the Requirements section.

In the next sections we will present both parts; however, our main focus will be on the first one. For this one, there’s no way of getting around machine learning algorithms. Therefore, we present our own pipeline, show how we approached this problem and will discuss how the results differ from our expectations. For the second part, we will discuss existing approaches and their suitability.



As shortly described above, in the practical application the estimation part would receive a live image stream of every lane intersecting the junction. In our case, it was very difficult to find an appropriate training set in general and specifically for Singapore. Therefore we decided to use the live camera data from Singapore (LTA)111

We wrote a script to download images from all of the cameras over a weekend and selected three cameras which seemed to be the most suitable for our use case; here we chose those that have varying traffic density over the days and contains a clearly visible road (i.e. unobstructed by trees, etc). We then used these images and randomly partitioned the dataset into 90% for training and 10% for validation. We decided to use images taken during both day time and night time. You can find 3 sample images of the three different situations with different lighting and density conditions in figure 1. Overall we had 4582 images available.

Figure 1: Image from Camera1/Camera2/Camera3 at Night/Day/Day with High/Low/Traffic Jam Density

We decided to define 5 classifiers to categorize the images. They can be found in Table 

1. We counted motorcycles as half cars.

Classifier Meaning Definition
Empty Almost empty street 0-8 Cars
Low Only a few cars 9-20 Cars
Medium Slightly filled street 50 cars
High Filled Street or Blocked Lane 100 cars
Traffic Jam Traffic almost not moving 100 cars
Table 1: Definition of Traffic Density


Amongst all the proposed advanced machine learning topics, Convolutional Neural Network was the most suitable approach for us.

In addition to the choice of this model there are also many other possibilities available. We thought about the following possibilities:

  • Feeding the machine learning pipeline with the raw image or with some extracted features (SIFT, SURF, etc.).

  • Preprocess the image (cut off unimportant parts or not).

  • Grayscale or colored image.

  • Resolution of the image.

  • Structure of the underlying Neural Network (activation functions, number of layers, etc.), see section

    ML Model in our Case for more specific analysis.

Alternative Model

For us, the most suitable of the other alternative models would have been the Recurrent Neural Network. Due to its structure in which connections form a directed graph along a sequence, for RNNs it’s possible to use their internal state as a memory. This allows processing sequences of inputs222

This could be suitable to even input multiple sequential images into the pipeline and hence be able to estimate the velocity.

However, we did not attempt this because the Singapore live dataset (see footnote 1) were at 20 seconds interval, which we considered to be too long for RNNs to be used effectively. Thus, we chose to focus on traffic density estimation in this project.

Technical Details and Insights

Convolutional neural networks, or CNNs in short, are a special form of Neural Networks. They are especially well suited for the processing of inputs which have a grid-like topology. According to [Goodfellow, Bengio, and Courville2016]

they have been tremendously successful in practice applications.

The same book defines CNNs simply by the following quote:

Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.

In this discussion we will mainly focus on the distinctions and the advantages this topology induces in comparison to general neural networks.


Normally a mathematical convolution is denoted as:


In the case of CNNs, the first argument in equation 1 is most of the time referred to as the input, whereas the second argument as the kernel. The output is sometimes called feature map.


There are three important ideas in existence why convolution can help improve a machine learning system, in our case here CNNs compared to the typical neural networks (typical NNs):

  1. Sparse Interactions,

  2. Parameter Sharing and

  3. Equivariant Representations.

Moreover, the convolution provides a good possibility to process variable sized inputs.


Compared to typical NNs, where every input unit interacts with every output unit, CNNs use sparse interactions. The reason for this is that the kernel is chosen to be smaller than the input. This means we need to store fewer parameters which reduces requirements in both memory and calculations. According to [Goodfellow, Bengio, and Courville2016] the improvements in efficiency are usually quite large. The difference between conventional NNs and CNNs is illustrated in figure 2.

Figure 2: Units affecting the output unit; Left: Formed by normal matrix multiplication; Right: Formed by convolution, see [Goodfellow, Bengio, and Courville2016]

It is important to mention that neurons in deeper layers may still indirectly interact with a larger portion of the input. This enables the successful consideration of complicated interactions between the

simple building blocks and hence the detection of more complicated structures in the input. For clarification refer to figure 3.

Figure 3: Illustration of Deeper Interaction of Neurons, see [Goodfellow, Bengio, and Courville2016]

Parameter Sharing describes the usage of the same parameter for multiple model functions. It is also often denoted as tied weights since the weight values applied to one input value is simply tied to a weight value applied somewhere else. In more detail, this means that each kernel is used at every position of the input (except maybe the boundaries). Therefore we learn only one set of parameters. This further reduces the storage requirements. According to [Goodfellow, Bengio, and Courville2016],

Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the memory requirements and statistical efficiency.

By result of parameter sharing, we observe another important property, the equivariance to translation. This property ensures that pixel shifts in an image does not affect the output. If we define as the shifted image, now it makes no difference if we apply the convolution to the shifted image or if we apply the convolution to the original image and then shift it.


The third stage of each layer, after performing the convolutions and applying the activation functions, is often called pooling. For this, a pooling function is used which replaces the output at a certain location with a summary statistic of nearby outputs333Typically, pooling is done by extracting the maximum value, the average, the norm of the rectangular neighborhood or any other weighted average (e.g. based on the distance from center). It always helps to make the output almost invariant to small translations, and for many tasks it is just essential to make the network applicable to inputs of varying size.

For describing the mentioned properties we used the book of [Goodfellow, Bengio, and Courville2016], which we highly recommend.

Advantages and Exploitation of Model

We model the problem as a task of classifying images into 5 traffic density classes. Based on the properties of CNNs, they are suitable for this. This is especially true given the definition from above that CNNs are

especially well suited for the processing of inputs which have known grid-like topology. [Goodfellow, Bengio, and Courville2016]

Also the size of the Kernel for the convolution is a big advantage of using CNNs. An image might have thousands or millions of pixels, but we can detect small and meaningful features (e.g. edges, corners) with kernels that are only consisting out of tens or hundreds of pixels.

Also the parameter sharing is a very nice property in our case since it just reduces the amount of parameters significantly which would be a lot having a whole image as an input and using typical NNs.

The translational equivariance can also be very helpful due to the general reasons mentioned above.

However the pooling still has no negative effects since in our case the exact location of the crucial structures (for detecting the cars) is not fixed, since cars are moving anyway. So one of the typical big disadvantages of pooling, perturbing the performance in situations in which the very exact location is important, is not an issue.

ML Model in our Case

State-of-the-art models have been empirically demonstrated to have good performance on general image classification tasks [Szegedy et al.2016]

. However, training these large models from scratch on our dataset is slow and prone to overfitting. Instead, we can explore the use of transfer learning 

[Yosinski et al.2014]. We make use of pre-trained weights from an InceptionV3

model by removing its penultimate layer and training a new softmax layer on top of it to produce predictions for our task. This allows us to make use of

InceptionV3 to extract general image features for us and to train new models very quickly since we only have to train the additional layers.

Decision Making

With the traffic estimation approaches designed, we demonstrate that there exist basic approaches for using these estimates to solve the traffic algorithm problem.

Possible Approaches

Reinforcement Learning

One of those possibilities was shown by [Gao et al.2017] where they proposed a deep reinforcement learning algorithm which extracts all useful machine crafted features from raw real-time traffic data. The goal was to learn the optimal policy to adapt the traffic lights. Impressively, they were able to reduce the vehicle delay by up to 47% compared to the well known longest queue first algorithm and even by up to 86% compared to fixed time control

. The key behind this approach is the formulation of the traffic signal control problem as a reinforcement learning problem. In this case the goal of the

agent is to reduce the vehicle staying time in the long run. The reward for the agent is given at each time step for choosing actions that decrease the time of vehicles staying at the intersection.

Genetic Algorithms

Another possibility, which is admittedly not really state of the art, is using the Genetic Algorithms as proposed by [Singh, Tripathi, and Arora2009]. This paper is presenting a strategy which is giving appropriate green time extensions to minimize a fitness function. In this case the fitness function is consisting of a linear combination of performance indexes of all four lanes used in this example. This approach reaches in this paper a performance increase of 21.9% which is not as good as the reinforcement policy from last chapter.

Training and Ways of Finding the Solution

With the traffic algorithm readily available, we would also require proper simulation environments.


The first simulation we tried was the aimsun next traffic modeling software 444 The tool allows whole big cities can be imported and simulated which we considered to be too massive in scale for our use-case.


Another better alternative is the popular open source simulator Simulation of Urban MObility (SUMO)555 SUMO is an easy program with python API, can be configured using simple xml-files, controlled using the terminal. This allows not only to verify the results but to also actively use this simulation during the training, e.g. for the case when a Reinforcement Learning approach is used as presented in [Gao et al.2017]. It’s also possible to visualize the results in a GUI, an example image can be seen in figure 4.

Figure 4: GUI of the SUMO simulation [Gao et al.2017]

Tests and Experiments


We use self-labeled traffic images from 3 cameras of different junctions and angles as seen in figure 1. Each image is labeled with a density level from empty to traffic jam.


We ran experiments using models built upon the Keras library with TensorFlow backend. Keras provides us with an easy API

for building deep learning models which allowed us to focus more on the experiments.


Traffic density estimation can be modeled as a multi-class classification problem and be solved by CNN classifiers. We identified the following approaches for making use of CNNs and investigated their effectiveness for this problem:

  1. Basic CNN and

  2. Transfer Learning on InceptionV3.


We trained the two classifiers on the dataset and evaluated them on a few metrics666All metrics were measured by training a model on the training set and performing evaluation on a separate validation set. We take the average of all results over 10 runs: accuracies, f1 scores777We extend F1 scores to the multi-class scenario by taking the average of all independently computed F1 scores for each class. Each F1 score

is taken to be the harmonic mean of precision and recall for that class

and top 2 accuracies888Top 2 accuracy refers to the frequency that an example was correctly labeled by the rank 1-2 predictions. All results shown in Table 2

are evaluated on a cross-validated classifier with hyperparameters selected based on

accuracy. The cross-validation was performed using a simple grid-search.

Classifier Accuracy F1 Top 2 Accuracy
Basic CNN 71.35 71.26 93.23
Transfer Learning 66.38 59.21 88.43
Table 2: Classifier accuracy results
Classifier Training Time Training Time
/min (With GPU)/min
Basic CNN 40.8 1.1
Transfer Learning 1.2 0.65
Table 3: Time Efficiency Results

In Table 3 you can find the time the training took us with and without GPUs.

A simple CNN provided an overall better performance than transfer learning on InceptionV3. This is likely because we froze the entire InceptionV3’s weights and higher-level features from the larger dataset cannot be transferred to our dataset [Yosinski et al.2014]. In the future, we could explore freezing the bottom layers only.

Although using a CNN provides better overall accuracy than transfer learning, it take a significantly longer time to train without a GPU.999

All time measurements are for 50 epochs of training with no extra preprocessing. GPU measurements were conducted on an

Nvidia GTX1080Ti Therefore, transfer learning is a viable approach if computation power is limited. While CNN is the preferred approach when GPUs are available.

Other Possibilities Tested

Class Count
Empty 1679
Low 1306
Medium 556
High 554
Traffic Jam 488
Table 4: Number of traffic images per class

Uneven Distribution:

Due to uneven distributions of classes in our self-labeled dataset (see Table 4), we also explore the following measures to handle class imbalance (CI) and compare them through experiments:

  1. Ratio-weighted losses: We scale the cross entropy losses contributed each example according to their class ratios using the following formulation:


    This increases the cost of misclassification of a minority class, forcing the learner to prioritize the correct classification of minority classes [Eigen and Fergus2014].

  2. Real-time data augmentation: By performing basic image transformations on existing data, we are able to generate new examples on-the-fly for training to increase the variety of examples seen by the classifier. As it can be seen in Table 5, this leads to a better accuracy for the minority classes as we are able to obtain more training examples for them [Wong et al.2016].

Method Accuracy F1 Top 2 Accuracy
Basic CNN with
class imbalance 73.0 80.52 93.56
measures applied
Basic CNN 71.35 71.26 93.23
Table 5: Class imbalance results

With CI measures, accuracy increased slightly but f1 scores101010F1 score is a better predictor of performance than accuracy for class imbalance scenarios since it accounts for precision and recall increased significantly. Therefore, CI measures have shown significant improvements. Notably, the top 2 accuracy is about the same which indicates that class imbalance does not affect the top 2 predictions.

Image Preprocessing:

Since traffic images consists of 2 opposite traffic lanes, we also propose the use of image masking111111A visualization of masked preprocessed image can be found here: to remove parts of the images that are not in the interested traffic lane.

Method Accuracy F1 Top 2 Accuracy
Basic CNN with
CI measures 74.3 81.3 94
and masking
Table 6: Image masking results

As seen in Table 6 the use of masking provided a 1-2% increase in accuracy which is not very significant. This shows that the CNN model was able to identify the non-relevant parts of the image even without masking. An example of the masking can be seen in Figure 5.

Figure 5: Original and Corresponding Masked Images

Other Tools, Online Resources

We made use of a data-labelling tool, Labelbox121212 It provides a user-friendly web interface for us to collaboratively label the entire dataset.

For the experiments and implementation, besides Keras, we used OpenCV for general image processing (e.g. masking the non-relevant parts of the images and resizing the image). matplotlib

was also used for general data visualization.

Our most important online resource was the Singaporean live camera dataset (see footnote 1). We wrote all scripts for downloading, processing and classifying these images by ourselves.


Manpower for the project was managed by assigning each person to a task that they were suited for. In the brainstorming phase, everyone was given time to develop their own ideas and to choose among all the ideas by a majority vote decision.

Moreover, unpleasant tasks (such as e.g. the labeling of the data) were also divided equally among the members.


Reaching Requirements

Looking at the final results in Table 6 we are happy to reach these numbers. It was quite a long way to get to this point with only having 4582 images available. An accuracy of 74.3% and an F1 score of 81.3% are already quite satisfying and in practice when having a frame rate of a few images a second, the average classification (like a ”low pass filter”) over a certain time interval will very likely produce good results. Also for us the 94% Top 2 accuracy is very significant because we labeled the images intuitively, hence the tendency is almost as important as the exact classification. E.g. if a low density is classified as empty it’s still a good insight. The high Top 2 accuracy just approves this.

Of course before bringing the application to the markets, further practice tests would be required. But we are very confident that its potential is high while it’s also not too hard to guarantee the general requirements for this case (e.g. real time behavior).

Future Improvements

A big improvement would be to just use a larger training set which only hardly would have been possible for us because of the limited time.

For bringing up the Accuracy as well as the F1 scores it would also be helpful to do the labeling more precisely, i.e. really count the number of cars. But as mentioned in the last chapter, also classifying the right tendency can be very helpful for our application.