In this work, we propose a novel deep network for traffic sign classification that achieves outstanding performance on GTSRB surpassing all previous methods. Our deep network consists of spatial transformer layers and a modified version of inception module specifically designed for capturing local and global features together. This features adoption allows our network to classify precisely intraclass samples even under deformations. Use of spatial transformer layer makes this network more robust to deformations such as translation, rotation, scaling of input images. Unlike existing approaches that are developed with hand-crafted features, multiple deep networks with huge parameters and data augmentations, our method addresses the concern of exploding parameters and augmentations. We have achieved the state-of-the-art performance of 99.81% on GTSRB dataset.READ FULL TEXT VIEW PDF
Lua, Torch7, CNN, Deep Learning [Updating]
. Most of the time driver missed traffic signs due to different obstacles and lack of attentiveness. Automating the process of classification of the traffic signs would help reducing accidents. Traditional computer vision and machine learning based methods were widely used for traffic signs classification[18, 3]
, but those methods were soon replaced by deep learning based classifiers. Recently deep convolutional networks have surpassed traditional learning methods in traffic signs classification. With the rapid advances of deep learning algorithm structures and feasibility of its high performance implementation with graphical processing units (GPU), it is advantageous to relook the traffic signs classification problems from the efficient deep learning perspective. Classification of traffic signs is not so simple task, images are effected to adverse variation due to illumination, orientation, the speed variation of vehicles etc. Normally wide angle camera is mounted on the top of a vehicle to capture traffic signs and other related visual features for ADAS. This images are distorted due to several external factors including vehicles speed, sunlight, rain etc. Sample images from GTSRB dataset are shown in Fig.1.
In this work, we have developed a new system for classification on the top of existing deep learning methods. We aim to address the traditional hand crafted data augmentation problem and also the reduction of an overwhelming number of parameters need to be learned by the system. With the reduction of a number of parameters, lesser number of computations and memory will be used. Instead of using a single sized filter for one convolutional layer, we would use multiple sized filters and concatenate those convolutional filters response to get more abstract representations in one layer. Also, there are disadvantages of using hand crafted data augmentation on some specific datasets for learning parameters that would be used for general purpose. Hence learning parameters for data modification inside network is a far better way to improve the accuracy of classification. A modified inception module suitable for traffic sign classification is proposed to use with GoogLeNet 
and to get way with traditional data augmentation we have incorporated spatial transformer network layer with our proposed network. In terms of accuracy and number of parameters, this method has surpassed state of the art method for traffic sign classification.
Traffic sign classification becomes a mature area with the increasing focus on autonomous driving research. Notable research work exists on detection and classification traffic signs for advanced driver assistance systems. Most of the works attempted to address the challenged involved in real life problems due to scaling, rotation, blurring etc. We will go through the overview of some relevant works since it is not possible to discuss all those research works. Most of the works based on computer vision and machine learning algorithms which use data from several camera sensors mounted on the car roof at different angles. In some of the work, researchers explore detection based on colour features, such as converting the colour space from RGB to HSV and then using colour thresholding method for detection and classification by using support vector machine. In colour thresholding approach morphological operation like connected component analysis was done for accurate location. Bahlmann et al have used colour, shape, motion information and haar wavelet based features for detection, classification of the traffic sign. By using SVM based colour classification on a block of pixels Le et al 
addressed the problems of weather variation. German Traffic Sign Recognition Benchmark (GTSRB) is one of the reliable datasets for testing and validating traffic sign classification and detection algorithms. In the competition of GTSRB, top-performing algorithm exceeds best human classification accuracy. By using committee of neural networks Ciresan et al achieved highest ever performance of 99.46%, which surpassed the best human performance of 98.84%. Their proposed committee composed of 25 networks each having 3 convolutional and 2 fully connected networks with traditional data augmentations and jittering. The main disadvantages of this committe are multiples networks, a huge number of parameters ( around 90Millions) and dataset dependent handcrafted augmentations. Sermanet et al. proposed multi-scale convolutional network  with 2 different features stages, which has achieved 98.31% accuracy in this dataset. In our previous work 16] achieved significant accuracy.
Traffic signs classification are affected due to contrast variation, rotational and translational changes. It is possible to nullify the effect of spatial transformations in an image undergo due to varying speed of vehicles camera by using multiple transformations to the input image. But these handcrafted transformations are not effective always and vary with scenarios. In this work, a spatial transformer network  capable of generating automatic transformation of input image is used to make classification more robust and accurate along with a modified version of GoogLeNet .
Due to a moving camera, a image undergoes deformation like blurring, translational deformation, rotational and scale deformations, skew etc. For classification feature map (include input image batch) would be passed through layers of spatial transformers. Spatial transformations modules are differentiable hence could be used with backpropagation algorithm for training. Spatial transformer layers consist of three parts such as localisation network, grid generator and the sampling unit. Figure2
shows spatial transformer network with its components. This layer could be inserted at any point into the CNN network and it is efficient to deal with due to its very low computational overhead. Using this layer with CNN obviated the use of handcrafted data augmentation such as translation, rotation etc. and allows the network to learn active transformation of features map. Localisation network can be fully connected or convolutional neural network with one mandatory final regression layer to generate parameters. Dimension of depends on the parameterized transformation type. To compute parameters localisation network may take input image or input feature map with are the width, height and channels respectively. Localisation network can deal with multiples channels. Also localisation network Fig. 3 may take any number of convolutional and fully connected layers as per application requirement. Using parameters produced by localisation network, grid generator creates set of points known as sampling grid. Sampling grid and input feature map (or input image) is used by a sampler to generate the transformed output map. Each pixel of the output feature map is computed using sampling kernel centred at a definite input feature map location.
For a input feature map pixel and learned 2D affine transformation parameters , the output feature map pixel is computed as follows
Using the transformation defined in (2) we can perform operations such as translation, rotation, scale, skew and cropping in the input feature map. Parameters are computed using equation (1). Interestingly this transformation requires only 6 parameters to be learned by localisation network.
A modified version of GoogLeNet 
with batch normalization is used as parent network for the classification task. GoogLeNet is based on the Inception architecture. Several Inception modules stacked upon each other to produce the final output. At the inception module varied size of convolutional filters were use to capture features of different abstraction. High level of abstraction is captured with filters of higher size and that of a lower level using small size filters. Processing visual information at different scales and aggregating them result in an efficient level of abstraction. Since directly applying more convolutional filters with image data and concatenating them is computationally expensive, so in the final Inception model a dimensionality reduction filters was used very applying abstraction level filters. For dimensionality reduction convolutional filters are used. Besides being very successful for dimensionality reduction, this filters also come to be useful as rectified linear activation. Inception architecture is efficient in terms of computational complexity with respect to number of units at each stage. For our classification task a modified version of Inception module is used. For traffic sign classification local abstract features play important role. Signs belonging to same group have slight difference in local structure with each other, which make it hard to distinguish. A extra
convolutional reduction kernel is added with max pooling at the top of it to capture discriminative local structure at the beginning itself. Signs belonging to different groups has global abstraction which can be captured usingconvolutional reduction kernel. Improved performance is observed with this architecture over normal Inception module. Figure 4 shows Google inception module and Fig. 5 shows our proposed inception module.
|output size||#1x1||#3x3 reduce||#3x3||#5x5 reduce||#5x5||#3x3 reduce pool||#3x3||params|
We extensively evaluate our proposed deep networks on GTSRB (German Traffic Sign Recognition Benchmark)  using our modified networks and also with original GoogLeNet. GTSRB is the standard state-of-the-art revelation benchmark for traffic sign recognition/classification. There are significant similarities of German traffic sign with other European countries and with Indian conventions, which make it suitable to explore.
The proposed network was trained and tested using the machine learning library Torch and two NVIDIA Tesla K40c GPU. For the implementation of the spatial transformer network, stn  package was used.
For training and testing, GTSRB dataset contains 51839 images in 43 classes. We have selected 39,209 images for training and rest for testing. Images with deformation due to viewpoint variation, occlusion due to obstacles like trees, building etc., natural degrading, weather condition are considered in this dataset. We have resized all input images to
using cubic interpolation method.
For training, we have used SGD with momentum, with minibatch size of 20 images and learning rate of 0.00032. Dropout (40%) was used for the fully connected layer. For SGD we have used momentum 0.9 with weight decay of 0.0918. Also, it has been observed that learning rate primarily influence training process. For activation Parametric Rectified Linear Unit (PReLU)
is used. Instead of using parameter free ReLU, we have used PReLU for better accuracy. Parameters of PReLU are learned during training of the network. Also, networks weights are initialized using MSRA methods, which proved to be useful for PReLU activation unit based networks. In Table I a detailed description of proposed network is given. Number of filters used for dimensionality reduction before and convolutions are referred as “# reduce” and “# reduce”. Also “# reduce pool” refers to number of filters used before convolution and max pooling. In addition to that in Table I refers to our proposed inception module. It’s notable that apart from the inception module, proposed method primary network also have slight difference from original GoogLeNet. The proposed network is 21 layers deep including only the layers with parameters, excluding pooling layers and spatial transformer layers (have parameters). Since spatial transformer layer (ST) is a features transformer layer with its own network parameters, it doesn’t impact the features learning process of rest of the network. If pooling layers are included, then our module would have a depth of 3. Including pooling and ST the network is 39 layers deep. We have used four spatial transformer layers (networks), two of them before two convolutional layers and other two before modified inception modules. Also the network configurations of spatial transformer layers ST1, ST2 and ST3 are different and detailed information is given in Table II.
|ST2||128, 5x5/2||yes||192, 5x5/2||no||192||192|
|ST3||128, 3x3/2||no||192, 3x3/1||yes||192||192|
|Committee of CNNs ||99.47||99.93||99.72||99.89||99.07||99.22|
|Multi-Scale CNN ||98.61||99.87||94.44||97.18||98.3||98.63|
|Random Forest ||95.95||99.13||87.50||99.27||92.08||98.73|
This method has several advantages over existing state of the art methods in terms of performance, scalability and memory requirement. Recent high performed method Committee of CNNs have used 25 networks with 3 convolutional layers and 2 fully connected layers along with manual data augmentation. On original dataset they have modified each image using translation, rotation etc., to get five modified version of that image. Committee of CNNs end up with total around 90 Million parameters whereas, in our method, we have around 10.5 Million parameters.
Overall accuracy comparisons with different high performing approaches are shown in Table IV. We have also reported the accuracy of our deep networks with the Google inception module, which is slightly lower than the accuracy obtained using our modified inception module.
GTSRB dataset composed of mainly 6 high-level groups. Classification of images belong to different groups is easier than that of same group images. We have reported the accuracy obtained for each group in Table III. Also, comparisons with other state-of-the-arts methods are presented in Table III.
This paper proposes a deep convolutional network with a fewer number of parameters and memory requirements in comparisons to existing methods. The presented network doesn’t need data jittering and handcrafted data augmentations. Our main contribution includes the development of modified inception module and a deep network using spatial transformer layer for traffic sign classification.
I would like to show deep acknowledgement to my professors at IIT Guwahati and IIIT Bangalore for valuable suggestion during this work. Also NVIDIA corporation for donating GPUs for this work.
McCall, J. C., Trivedi, M. M. Video-based lane estimation and tracking for driver assistance: survey, system, and evaluation. IEEE transactions on intelligent transportation systems, 7(1), 20-37. (2006).