*alpr has been a frequent topic of research [1, 2, 3] due to many practical applications, such as automatic toll collection, traffic law enforcement, private spaces access control and road traffic monitoring.
*alpr systems typically have three stages: *lp detection, character segmentation and character recognition. The earlier stages require higher accuracy or almost perfection, since failing to detect the *lp would probably lead to a failure in the next stages either. Many approaches search first for the vehicle and then its *lp in order to reduce processing time and eliminate false positives.
Although *alpr has been frequently addressed in the literature, many studies and solutions are still not robust enough on real-world scenarios. These solutions commonly depend on certain constraints, such as specific cameras or viewing angles, simple backgrounds, good lighting conditions, search in a fixed region, and certain types of vehicles (they would not detect *lp from vehicles such as motorcycles, trucks or buses).
To the best of our knowledge, the *ssig  is the largest public dataset of Brazilian *lp. This dataset contains less than training examples and has several constraints such as: it uses a static camera mounted always in the same position, all images have very similar and relatively simple backgrounds, there are no motorcycles and only a few cases where the *lp are not well aligned.
When recording the *dataset dataset, we sought to eliminate many of the constraints found in *alpr applications by using three different non-static cameras to capture , images from different types of vehicles (cars, motorcycles, buses, trucks, among others) with complex backgrounds and under different lighting conditions. The vehicles are in different positions and distances to the camera. Furthermore, in some cases, the vehicle is not fully visible on the image. To the best of our knowledge, there are no public datasets for *alpr with annotations of cars, motorcycles, *lp and characters. Therefore, we can point out two main challenges in our dataset. First, usually, car and motorcycle *lp have different aspect ratios, not allowing *alpr approaches to use this constraint to filter false positives. Also car and motorcycle *lp have different layouts and positions.
As great advances in object detection were achieved through YOLO-inspired models [9, 10], we decided to fine-tune it for *alpr. YOLOv2  is a state-of-the-art real-time object detection that uses a model with convolutional layers and maxpooling layers. On the other hand, Fast-YOLO  is a model focused on a speed/accuracy trade-off that uses fewer convolutional layers ( instead of ) and fewer filters in those layers. Therefore, Fast-YOLO is much faster but less accurate than YOLOv2.
In this work, we propose a new robust real-time *alpr system based on the YOLO object detection *cnn. Since we are processing video frames, we also employ temporal redundancy such that we process each frame independently and then combine the results to create a more robust prediction for each vehicle.
The proposed system outperforms previous results and two commercial systems in the *ssig dataset and also in our proposed *dataset. The main contributions of this paper can be summarized as follows:
A new real-time end-to-end *alpr system using the state-of-the-art YOLO object detection *cnn222The entire *alpr system, i.e., the architectures and weights, is publicly available for academic purposes.;
A robust two-stage approach for character segmentation and recognition mainly due to simple data augmentation tricks for training data such as inverted *lp and flipped characters.
A public dataset for *alpr with , fully annotated images (over , *lp characters) focused on usual and different real-world scenarios, showing that our proposed *alpr system yields outstanding results in both scenarios.
A comparative evaluation among the proposed approach, previous works in the literature and two commercial systems in the *dataset dataset.
This paper is organized as follows. We briefly review related work in Section II. The *dataset dataset is introduced in Section III. Section IV presents the proposed *alpr system using object detection *cnn. We report and discuss the results of our experiments in Section V. Conclusions and future work are given in Section VI.
Ii Related Work
In this section, we briefly review several recent works that use *dl approaches in the context of *alpr. For relevant studies using conventional image processing techniques, please refer to [13, 14, 15, 16, 1, 17, 2, 18, 19]. More specifically, we discuss works related to each *alpr stage, and specially studies works that not fit into the other subsections. This section concludes with final remarks.
*lp Detection: Many authors have addressed the *lp detection stage with object detection *cnn. Montazzolli and Jung  used a single *cnn arranged in a cascaded manner to detect both car frontal-views and its *lp, achieving high recall and precision rates. Hsu et al.  customized *cnn exclusively for *lp detection and demonstrated that the modified versions perform better. Rafique et al.  applied *svm and *rcnn for *lp detection, noting that *rcnn are best suited for real-time systems.
Li and Chen  trained a *cnn based on characters cropped from general text to perform a character-based *lp detection, achieving higher recall and precision rates than previous approaches. Bulan et al. 
first extracts a set of candidate *lp regions using a weak *snow classifier and then filters them using a strong *cnn, significantly improving the baseline method.
Character Segmentation: *alpr systems based on *dl techniques usually address the character segmentation and recognition together. Montazzolli and Jung  propose a *cnn to segment and recognize the characters within a cropped *lp. They have segmented more than of the characters correctly, outperforming the baseline by a large margin.
Bulan et al.  achieved very high accuracy in *lp recognition jointly performing the character segmentation and recognition using *hmm where the most likely *lp was determined by applying the Viterbi algorithm.
Character Recognition: Menotti et al.  proposed the use of random *cnn to extract features for character recognition, achieving a significantly better performance than using image pixels or learning the filters weights with back-propagation. Li and Chen  proposed to perform the character recognition as a sequence labelling problem. A *rnn with *ctc is employed to label the sequential data, recognizing the whole *lp without the character-level segmentation.
Although Svoboda et al.  have not perform the character recognition itself, they achieved high quality *lp deblurring reconstructions using a text deblurring *cnn, which can be very useful in character recognition.
Miscellaneous: Masood et al.  presented an end-to-end *alpr system using a sequence of deep *cnn. As this is a commercial system, little information is given about the used *cnn. Li et al.  propose a unified *cnn that can locate *lp and recognize them simultaneously in a single forward pass. In addition, the model size is highly decreased by sharing many of its convolutional features.
Final Remarks: Many papers only address part of the *alpr pipeline (e.g., *lp detection) or perform their experiments on datasets that do not represent real-world scenarios, making it difficult to accurately evaluate the presented methods. In addition, most of the approaches are not capable of recognizing *lp in real-time, making it impossible for them to be applied in some applications. In this sense, we employ the YOLO object detection *cnn in each stage to create a robust and efficient end-to-end *alpr system. In addition, we perform data augmentation for character recognition, since this stage is the bottleneck in some *alpr systems.
Iii The UFPR-ALPR Dataset
The dataset contains , images taken from inside a vehicle driving through regular traffic in an urban environment. These images were obtained from videos with duration of second and frame rate of *fps. Thus, the dataset is divided into vehicles, each with images with only one visible *lp in the foreground. It is noteworthy that no stabilization method was used. Fig. 1 shows the diversity of the dataset.
The images were acquired with three different cameras and are available in the *png format with size of , , pixels. The cameras used were: GoPro Hero4 Silver, Huawei P9 Lite and iPhone 7 Plus. Images obtained with different cameras do not necessarily have the same quality, although they have the same resolution and frame rate. This is due to different camera specifications, such as autofocus, bit rate, focal length and optical image stabilization.
There are minor variations in the camera position due to repeated mountings of the camera and also to simulate a real condition, where the camera is not always placed in exactly the same position.
We collected , images with each camera, divided as follows: of cars with gray *lp, of cars with red *lp and of motorcycles with gray *lp. In Brazil, the *lp have size and color variations depending on the type of the vehicle and its category. Cars’ *lp have a size of cm cm, while motorcycles *lp have cm cm. Private vehicles have gray *lp, while buses, taxis and other transportation vehicles have red *lp. There are other color variations for specific categories such as official or older cars. Fig. 2 shows some of the different types of *lp found in the dataset.
The dataset is split as follows: for training, for testing and for validation, using the same protocol division proposed by Gonçalves et al.  in the *ssig dataset. The dataset distribution was made so that each split has the same number of images obtained with each camera, taking into account the type and position of the vehicle, the color and the characters of the vehicle’s *lp, the distance of the vehicle from the camera (based on the height of the *lp in pixels) such that each split is as representative as possible.
The heat maps of the distribution of the vehicles and *lp for the image frame in both *ssig and *dataset datasets are shown in Fig. 3. As can be seen, the vehicles and *lp are much better distributed in our dataset.
In Brazil, each state uses particular starting letters for its *lp which results in a specific range. In Paraná (where the dataset was collected), *lp range from AAA-0001 to BEZ-9999. Therefore, the letters A and B have many more examples than the others, as shown in Fig. 4.
Every image has the following annotations available in a text file: the camera in which the image was taken, the vehicle’s position and information such as: type (car or motorcycle), manufacturer, model and year; the identification and position of the *lp, as well as the position of its characters. Fig. 1 shows the bounding boxes of different types of vehicle and *lp.
Iv Proposed ALPR Approach
This section describes the proposed approach and it is divided into four subsections, one for each of the *alpr stages (i.e., vehicle and *lp detection, character segmentation and character recognition) and one for temporal redundancy. Fig. 5 illustrates the *alpr pipeline, explained throughout this section.
We use specific *cnn for each *alpr stage. Thus, we can tune the parameters separately in order to improve the performance for each task. The models used are: Fast-YOLO, YOLOv2 and CR-NET , an architecture inspired by Fast-YOLO for character segmentation and recognition.
Iv-a Vehicle and LP Detection
We evaluated both Fast-YOLO and YOLOv2 models at this stage to be able to handle simpler (i.e., *ssig) and more realistic (i.e., *dataset) data. For simpler scenarios, the Fast-YOLO should be able to detect the vehicles and their *lp correctly in much less time. However, for more realistic scenarios it might not be deep enough to perform these tasks.
In order to use both YOLO models333For training YOLOv2 and Fast-YOLO we used convolutional weights pre-trained on ImageNet , available at https://pjreddie.com/darknet/yolo/., we need to change the number of filters in the last convolutional layer to match the number of classes. YOLO uses anchor boxes to predict bounding boxes (we use = ) each with four coordinates , confidence and class probabilities , so the number of filters is given by
In a dataset such as the *ssig dataset, we intend to detect only one class in both vehicle and *lp detection (first the car and then its *lp), so the number of filters in each task has been reduced to . On the other hand, the *dataset dataset includes images from cars and motorcycles (two classes), so the number of filters in the vehicle detection task must be . In our tests, the results were better when using two classes (instead of just one class called ‘vehicle’). The Fast-YOLO’s architecture used in both tasks is shown in Table I. The same changes were made in the YOLOv2 model architecture (not shown due to lack of space).
While the entire frame and the vehicle coordinates are used as inputs to train the vehicle detection *cnn, the vehicle patch (with a margin) and the coordinates of its *lp are used to learn the *lp detection network. The size of the margin is defined as follows. We evaluated, in the validation set, the required margin so that all *lp would be completely within the bounding boxes of the vehicles found by the vehicle detection *cnn. This is done to avoid losing *lp in cases where the vehicle is not very well detected/segmented.
By default, YOLO only returns objects detected with a confidence of or higher. In the validation set, we evaluated the best threshold in order to detect all vehicles having the lowest false positive rate. A negative recognition result is given in cases where no vehicle is found. For *lp detection we use threshold equal , as there might be cases where the *lp is detected with very low confidence (e.g., ). We keep only the detection with the largest confidence in cases where more than one *lp is detected, since each vehicle has only one *lp.
Iv-B Character Segmentation
Once the *lp has been detected, we employ the *cnn proposed by Montazzolli and Jung  (CR-NET) for character segmentation and recognition. However, instead of performing both stages at the same time through an architecture with classes (0-9, A-Z, where the letter O is detected jointly with the digit 0), we chose to first use a network to segment the characters and then another two to recognize them. Knowing that all Brazilian *lp have the same format: three letters and four digits, we use classes for letters and classes for digits. As pointed out by Gonçalves et al. , this reduces the incorrect classification.
The character segmentation *cnn (architecture described in Table II) is trained using the *lp patch (with a margin) and the characters coordinates as inputs. As in the previous stage, this margin is defined based on the validation set to ensure that all characters are completely within its predicted *lp.
The *cnn input size ( ) was chosen based on the *lp’s ratio of Brazilian cars ( ), however the motorcycles *lp are nearly square ( ). That way, we enlarged horizontally all detected *lp (to ) before performing the character segmentation.
We also create a negative image of each *lp, thereby doubling the number of training samples. Since the color of the characters in the Brazilian *lp depends on the category of the vehicle (e.g., private or commercial), the negative images simulate characters from other categories.
In some cases, more than characters might be detected. If there are no overlaps (*iou ), we discard the ones with the lowest confidence levels. Otherwise, we perform the union between the overlapping characters, turning them into a single character. As motorcycle *lp can be very tilted, we use a higher threshold (*iou ) to consider the overlap between its characters.
Iv-C Character Recognition
Since many characters might not be perfectly segmented, containing missing parts, and as each character is relatively small, even one pixel difference between the ground truth and the prediction might impair the character’s recognition. Therefore, we evaluate different padding values (- pixels) in the segmented characters to achieve higher recognition rates. As Fig. 6 illustrates, the more padding pixels the more noise information is added (e.g., portions of other characters or the *lp frame).
As previously mentioned, we use two networks for character recognition. For training these networks, the characters and their labels are passed as input. For digit recognition, we removed the first four layers of the character segmentation *cnn, since in our tests the results were similar, but with a lower computational cost. However, for letter recognition (more classes and fewer examples) we still use the entire architecture of the character segmentation *cnn. The networks for digit and letter recognition have and filters in the last convolutional layer, respectively (see Eq. 1).
The use of two networks allows the tuning of network parameters (e.g., input/output size) for each task. The best network sizes found in our experiments are and for digits and letters, respectively.
Having knowledge of the specific *lp country layout (e.g., the Brazilian layout), we know which characters are letters and which are digits by their position. We sort the segmented characters by their horizontal and vertical positions for cars and motorcycles, respectively. The first three characters correspond to the letters and the last four to the digits, even in cases where the *lp is considerably tilted. It is worth noting that a country (e.g., USA) might have different *lp layouts, so this approach would not be suitable in such cases.
In addition to performing the training with the characters available in the training set, we also perform data augmentation in two ways. First, we create negative images to simulate characters from other vehicle categories (as in the character segmentation stage) and then, we also check which characters can be flipped both horizontally and vertically to create new instances. Table III shows which characters can be flipped in each direction.
|Vertical||0, 1, 3, 8, B, C, D, E, H, I, K, O, X|
|Horizontal||0, 1, 8, A, H, I, M, O, T, U, V, W, X, Y|
|Both||0, 1, 6(9), 8, 9(6), H, I, N, O, S, X, Z|
As in the *lp detection step, we use confidence threshold = and consider only the detection with the largest confidence. Hence, we ensure that a class is predicted for every segmented character.
Iv-D Temporal Redundancy
After performing the *lp recognition on single frames, we explore the temporal redundancy information through the union of all frames belonging to the same vehicle. Thus, the final recognition is composed of the most frequently predicted character at each *lp position (majority vote).
V Experimental Results
In this section, we conduct experiments to verify the effectiveness of the proposed *alpr system. All the experiments were performed on a NVIDIA Titan XP GPU (, CUDA cores and GB of RAM) using the Darknet framework .
We consider as correct only the detections with *iou . This value was chosen based on previous works [20, 6, 18]. In addition, the following parameters were used for training the networks: iterations (max batches) and learning rate = [-, -, -] with steps at and iterations.
Experiments were conducted in two datasets: *ssig and *dataset. We report the results obtained by the proposed system and compare with previous work and two commercial systems444OpenALPR and Sighthound systems have Cloud APIs available at https://www.openalpr.com/cloud-api.html and https://www.sighthound.com/products/cloud, respectively. The results presented here were obtained on January, 2018.: Sighthound  and OpenALPR555Although it has an open-source version, the commercial version uses different algorithms for OCR trained with larger datasets to improve accuracy. . According to the authors, both are robust in the detection and recognition of Brazilian *lp.
It is important to emphasize that although the commercial systems were not tuned for these datasets, they use much larger private datasets, which is a great advantage especially in *dl approaches.
In the OpenALPR system we choose which *lp’s style we want to detect (i.e., Brazilian) and we do not need to make any changes. On the other hand, Sighthound uses a single model for *lp from different countries. Therefore, we made some adjustments in its prediction so that it fits the Brazilian *lp format, such as swapping by O and vice versa.
V-a Evaluation on the *ssig Dataset
The *ssig dataset  is composed of , images of vehicles with the following annotations: the position of the vehicle’s *lp, its identification (e.g., ABC-1234) and each character’s position.
The high resolution images (, , pixels) were acquired with a static digital camera and are available in the *png format. A sample frame of the dataset is shown in Fig. 7.
The *ssig dataset uses the following evaluation protocol: of the dataset to training, to validation and
to test. According to the authors, this protocol was adopted because many character segmentation approaches do not require model estimation and a larger test set allows the reported results to be more statistically significant.
We report only the results obtained with the Fast-YOLO model in the vehicle and *lp detection subsections, since it achieved impressive recall and precision rates in both tasks.
V-A1 Vehicle Detection
Since the *ssig dataset does not have vehicle annotations, we manually label the vehicle’s bounding box on each image of the dataset. Another possible approach would be to train a vehicle detector using the large-scale CompCars dataset , but that way many vehicles (including those in the background) would also be detected.
To perform the vehicle detection, we first evaluate different confidence thresholds. We started with confidence of , however some vehicles were not detected. All vehicles in the validation set were successfully detected when the threshold was reduced to . Based on that, we decided to use half of this value (i.e., ) in the test set to increase the chance that all vehicles are detected. With this threshold, we achieved a recall of and precision above (only false positives).
V-A2 *lp Detection
Every vehicle in the validation set was well segmented with its *lp completely within the predicted bounding box. Therefore, we use the vehicle patches without any margin to train the *lp detection network. As expected, all *lp were correctly detected in both validation and test sets (recall and precision = ).
V-A3 Character Segmentation
A margin of (of the bounding box size) is required so each detected *lp contains all its characters fully. Therefore, we double this value (i.e., ) in the test set and in the training of the character segmentation *cnn.
We evaluated, in the validation set, the following confidence thresholds: , and , but the recall achieved was , regardless. Therefore, we chose to use a lower threshold (i.e., ) in the test set to miss as few characters as possible. That way, we achieved (,/,) recall.
V-A4 Character Recognition
The padding values that yielded the best recognition rates in the validation set were pixels for letters and pixel for digits. In addition, data augmentation with flipped characters only improved letter recognition, hampering digit recognition. We believe that a greater padding and data augmentation improve letter recognition because each class have far fewer training examples, compared to digits.
We first analyzed the results without temporal redundancy information. The proposed system achieved recognition rate of , recognizing all three letters and all four digits in and of the time, respectively.
The results are greatly improved when taking advantage of temporal redundancy information. The final recognition rate is , since the digits are correctly recognized in all vehicles and the letters in of them. This result is given based on the number of frames correctly recognized, thereby vehicles with more frames have greater weight in the final result.
The recognition rates accomplished by the proposed system were considerably better than those obtained in previous works ( ), as shown in Table IV. As expected, the commercial systems have also achieved great recognition rates, but only the proposed system was able to recognize correctly at least of the characters in all *lp. This is particularly important since the *lp’s identification can be combined with the vehicle’s manufacturer/model  or its appearance  to further enhance the recognition.
|ALPR||characters||All correct (vehicles)|
|Montazzolli and Jung |
|Gonçalves et al.  (with redundancy)||(/)|
|Sighthound (with redundancy)||(/)|
|OpenALPR (with redundancy)||(/)|
|Proposed (with redundancy)||100.00%||93.53% (37/40)|
According to our experiments, the great improvement in our system lies on separating the letter and digits recognition on two networks, so each one is tuned specifically for its task. Moreover, data augmentation was essential for letter recognition, since some classes (e.g., C, V) have less than training examples.
In Table V, we report the recall/accuracy rate achieved in each *alpr stage separately, as well as the time required for the proposed system to perform each stage. The reported time is the average time spent processing all inputs in each stage, assuming that the network weights are already loaded.
|ALPR Stage||Recall/Accuracy||Time (ms)||*fps|
|License Plate Detection|
|*alpr (all correct)|
|*alpr (with redundancy)|
Since the same model is used for vehicle and *lp detection, the time required for both stages is very similar. The same is true for character segmentation and recognition, but the latter is performed times (one time for each character). The average processing time for each frame was seconds, an average of *fps.
Our system had no difficulty recognizing red *lp, even with less training examples. According to our experiments, this is due to the negative images used in the training of the character segmentation and recognition *cnn. Due to the agreement terms of the *ssig dataset, we can not show qualitative results. Only a few *lp (all from the training set) can be shown for illustrations of publications.
V-B Evaluation on the *dataset Dataset
V-B1 Vehicle Detection
We first evaluated the Fast-YOLO model, but the recognition rates achieved were not satisfactory. After evaluations with different confidence thresholds, the best recall rate achieved was . This was expected since this dataset has greater variability in vehicle types and positions.
We chose to use the YOLOv2 model for vehicle detection, despite its higher computational cost. We evaluated several confidence thresholds, being the best one, as in the *ssig dataset. The recall and precision rates achieved were and , respectively. Fig. 8 shows a motorcycle and a car detected with the YOLOv2 model.
V-B2 *lp Detection
We note that in more challenging images (usually of motorcycles), the vehicle’s *lp is not entirely within its predicted bounding box, requiring a small margin ( in the validation set) so that the entire *lp is completely within the predicted vehicle’s bounding box. Therefore, we use a margin in the test set and in the training of the *lp detection *cnn.
The recognition rates obtained by both YOLO models were very similar (less than half a percent difference). Thus, we use the Fast-YOLO model for *lp detection. The recall rate attained was (,/,). We were not able to detect the *lp in just one vehicle (in its frames), because a false positive was predicted with greater confidence than the actual *lp, as shown in Fig. 9.
We could use the character segmentation CNN to perform a post-processing in cases where more than one *lp is detected, for example: evaluate on each detected *lp if there are characters or consider only the *lp where the characters’ confidence is greater. However, since the actual *lp can be detected with very low confidence levels (i.e., ), many false negatives would have to be analyzed, increasing the overall computational cost of the system.
V-B3 Character Segmentation
In the validation set, a margin of is required so each detected *lp contains all its characters fully. We decided not to double the margin in the test set, as would add a considerable amount of noise and background in the *lp patches.
The recall obtained was when disregarding the *lp not detected in the previous stage and when considering the whole test set. We accomplished better results in the *ssig dataset, but it is worth noting that our dataset has different *lp types and many of them are tilted. Fig. 10 depicts some *lp from different categories properly segmented, even when the *lp is tilted or in presence of shadows.
V-B4 Character Recognition
The best results were obtained with pixel of padding and data augmentation, for both letters and digits. The proposed system achieved a recognition rate of when processing frames individually and (/ vehicles) with temporal redundancy.
Despite the great results obtained in the previous dataset, both commercial systems did not achieve satisfactory results in the *dataset dataset. Analyzing the results we noticed that a substantial part of the errors were in motorcycles images, highlighting this constraint in both systems. This suggests that those systems are not so well trained for motorcycles. OpenALPR performed better than Sighthound, attaining a recognition rate of when exploring temporal redundancy information. Table VI shows all results obtained in the *dataset dataset.
|ALPR||characters||All correct (vehicles)|
|Sighthound (with redundancy)||(/)|
|OpenALPR (with redundancy)||(/)|
|Proposed (with redundancy)||88.33%||78.33% (47/60)|
We report the recall/accuracy rate achieved in each *alpr stage separately in Table VII, as well as the time required for the proposed system to perform each stage. The vehicle detection stage is more time-consuming in this dataset, as we use a larger *cnn architecture (i.e., YOLOv2).
|ALPR Stage||Recall/Accuracy||Time (ms)||*fps|
|License Plate Detection|
|*alpr (all correct)|
|*alpr (with redundancy)|
It is worth noting that despite using a deeper *cnn model in vehicle detection (i.e., YOLOv2), our system is still able to process images at *fps (against *fps using Fast-YOLO). This is sufficient for real-time usage, as commercial cameras generally record videos at *fps.
Fig. 11 illustrates some of the recognition results obtained by the proposed system in the *dataset dataset. It is noteworthy that our system can generalize well and correctly recognize *lp under different lighting conditions.
In this paper, we have presented a robust real-time end-to-end *alpr system using the state-of-the-art YOLO object detection *cnn. We trained a network for each *alpr stage, except for the character recognition where letters and digits are recognized separately (with two distinct *cnn).
We also introduced a public dataset for *alpr that includes , fully annotated images (with over , *lp characters) from vehicles in real-world scenarios where both vehicle and camera (inside another vehicle) are moving. Compared to the largest Brazilian dataset (*ssig) for this task, our dataset has more than twice the images and contains a larger variety in different aspects.
At present, the bottleneck of *alpr systems is the character segmentation and recognition stages. In this sense, we performed several approaches to increase recognition rates in both stages, such as data augmentation to simulate *lp from other vehicle’s categories and to increase characters with few instances in the training set. Although simple, these strategies were essential to accomplish outstanding results.
Our system was capable to achieve a full recognition rate of ( without temporal redundancy) in the SSIG dataset, considerably outperforming previous results ( with temporal redundancy  and without ) and presenting a performance slightly better than commercial systems (). In addition, the proposed system was the only to correctly recognize at least characters in all *lp.
We also evaluated our proposed *alpr system and two commercial systems as baselines on the new dataset. The results demonstrated that the *dataset dataset is very challenging since both commercial systems reached recognition rates below . Our system performed better, with recognition rate of . However, this result is still not satisfactory for some real-world *alpr applications.
As future work, we intend to explore new *cnn architectures to further optimize (in terms of speed) vehicle and *lp detection stages. We also intend to correct the alignment of inclined *lp and characters in order to improve the character segmentation and recognition. Additionally, we plan to explore the vehicle’s manufacturer and model in the *alpr pipeline as our new dataset provides such information. Although our system was conceived and evaluated on two country-specific datasets from Brazil, we believe that the proposed *alpr system is robust to locate vehicle, *lp and alphanumeric characters from any other country. In this direction, aiming a fully robust system we just need to design a character recognition module that is independent of the *lp layout.
This work was supported by grants from the National Council for Scientific and Technological Development (CNPq) (# 428333/2016-8, # 311053/2016-5 and # 313423/2017-2), the Minas Gerais Research Foundation (FAPEMIG) (APQ-00567-14 and PPM-00540-17) and the Coordination for the Improvement of Higher Education Personnel (CAPES) (DeepEyes Project).
We thank the NVIDIA Corporation for the donation of the GeForce GTX Titan XP Pascal GPU used for this research.
-  S. Du, M. Ibrahim, M. Shehata, and W. Badawy, “Automatic license plate recognition (ALPR): A state-of-the-art review,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 2, pp. 311–325, Feb 2013.
C. Gou, K. Wang, Y. Yao, and Z. Li, “Vehicle license plate recognition based on extremal regions and restricted boltzmann machines,”IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 4, pp. 1096–1107, April 2016.
-  O. Bulan, V. Kozitsky, P. Ramesh, and M. Shreve, “Segmentation- and annotation-free license plate recognition with deep localization and failure identification,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 9, pp. 2351–2363, Sept 2017.
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “ImageNet: A
large-scale hierarchical image database,” in
2009 IEEE Conference on Computer Vision and Pattern Recognition, June 2009, pp. 248–255.
-  H. Li and C. Shen, “Reading car license plates using deep convolutional neural networks and LSTMs,” CoRR, vol. abs/1601.05610, 2016. [Online]. Available: http://arxiv.org/abs/1601.05610
-  H. Li, P. Wang, and C. Shen, “Towards end-to-end car license plates detection and recognition with deep neural networks,” CoRR, vol. abs/1709.08828, 2017. [Online]. Available: http://arxiv.org/abs/1709.08828
-  S. Z. Masood, G. Shu, A. Dehghan, and E. G. Ortiz, “License plate detection and recognition using deeply learned convolutional neural networks,” CoRR, vol. abs/1703.07330, 2017. [Online]. Available: http://arxiv.org/abs/1703.07330
-  G. R. Gonçalves, S. P. G. da Silva, D. Menotti, and W. R. Schwartz, “Benchmark for license plate character segmentation,” Journal of Electronic Imaging, vol. 25, no. 5, pp. 053 034–053 034, 2016.
-  G. Ning, Z. Zhang, C. Huang, X. Ren, H. Wang, C. Cai, and Z. He, “Spatially supervised recurrent convolutional neural networks for visual object tracking,” in 2017 IEEE International Symposium on Circuits and Systems (ISCAS), May 2017, pp. 1–4.
-  B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, “SqueezeDet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), July 2017, pp. 446–454.
-  J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 6517–6525.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 779–788.
-  C. N. E. Anagnostopoulos, I. E. Anagnostopoulos, I. D. Psoroulas, V. Loumos, and E. Kayafas, “License plate recognition from still images and video sequences: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 3, pp. 377–391, Sept 2008.
-  G. S. Hsu, J. C. Chen, and Y. Z. Chung, “Application-oriented license plate recognition,” IEEE Transactions on Vehicular Technology, vol. 62, no. 2, pp. 552–561, Feb 2013.
-  A. H. Ashtari, M. J. Nordin, and M. Fathy, “An iranian license plate recognition system based on color features,” IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 4, pp. 1690–1705, Aug 2014.
-  M. S. Sarfraz, A. Shahzad, M. A. Elahi, M. Fraz, I. Zafar, and E. A. Edirisinghe, “Real-time automatic license plate recognition for CCTV forensic applications,” Journal of Real-Time Image Processing, vol. 8, no. 3, pp. 285–295, Sep 2013.
-  R. Panahi and I. Gholampour, “Accurate detection and recognition of dirty vehicle plate numbers for high-speed applications,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 4, pp. 767–779, April 2017.
-  Y. Yuan, W. Zou, Y. Zhao, X. Wang, X. Hu, and N. Komodakis, “A robust and efficient approach to license plate detection,” IEEE Transactions on Image Processing, vol. 26, no. 3, pp. 1102–1114, March 2017.
-  S. Azam and M. M. Islam, “Automatic license plate detection in hazardous condition,” Journal of Visual Communication and Image Representation, vol. 36, pp. 172 – 186, 2016.
-  S. Montazzolli and C. R. Jung, “Real-time brazilian license plate detection and recognition using deep convolutional neural networks,” in 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images, Oct 2017, pp. 55–62.
-  G. S. Hsu, A. Ambikapathi, S. L. Chung, and C. P. Su, “Robust license plate detection in the wild,” in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Aug 2017, pp. 1–6.
-  M. A. Rafique, W. Pedrycz, and M. Jeon, “Vehicle license plate detection using region-based convolutional neural networks,” Soft Computing, Jun 2017.
-  D. Menotti, G. Chiachia, A. X. Falcão, and V. J. O. Neto, “Vehicle license plate recognition with random convolutional networks,” in 2014 27th SIBGRAPI Conference on Graphics, Patterns and Images, Aug 2014, pp. 298–303.
-  P. Svoboda, M. Hradiš, L. Maršík, and P. Zemcík, “CNN for license plate motion deblurring,” in 2016 IEEE International Conference on Image Processing (ICIP), Sept 2016, pp. 3832–3836.
-  G. R. Gonçalves, D. Menotti, and W. R. Schwartz, “License plate recognition based on temporal redundancy,” in 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), Nov 2016, pp. 2577–2582.
-  M. Donoser, C. Arth, and H. Bischof, “Detecting, tracking and recognizing license plates,” in Computer Vision – ACCV 2007, Y. Yagi, S. B. Kang, I. S. Kweon, and H. Zha, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 447–456.
-  J. Redmon, “Darknet: Open source neural networks in C,” http://pjreddie.com/darknet/, 2013–2016.
-  OpenALPR Cloud API, http://www.openalpr.com/cloud-api.html.
-  L. Yang, P. Luo, C. C. Loy, and X. Tang, “A large-scale car dataset for fine-grained categorization and verification,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3973–3981.
-  L. Dlagnekov and S. J. Belongie, Recognizing cars. Department of Computer Science and Engineering, University of California, San Diego, 2005.