A Robust Real-Time Automatic License Plate Recognition based on the YOLO Detector

02/26/2018 ∙ by Rayson Laroca, et al. ∙ Universidade Federal do Paraná Universidade Federal de Minas Gerais 0

Automatic License Plate Recognition (ALPR) has been a frequent topic of research due to many practical applications. However, many of the current solutions are still not robust in real-world situations, commonly depending on many constraints. This paper presents a robust and efficient ALPR system based on the state-of-the-art YOLO object detection. The Convolutional Neural Networks (CNNs) are trained and fine-tuned for each ALPR stage so that they are robust under different conditions (e.g., variations in camera, lighting, and background). Specially for character segmentation and recognition, we design a two-stage approach employing simple data augmentation tricks such as inverted License Plates (LPs) and flipped characters. The resulting ALPR approach achieved impressive results in two datasets. First, in the SSIG dataset, composed of 2,000 frames from 101 vehicle videos, our system achieved a recognition rate of 93.53 than both Sighthound and OpenALPR commercial systems (89.80 respectively) and considerably outperforming previous results (81.80 targeting a more realistic scenario, we introduce a larger public dataset, called UFPR-ALPR dataset, designed to ALPR. This dataset contains 150 videos and 4,500 frames captured when both camera and vehicles are moving and also contains different types of vehicles (cars, motorcycles, buses and trucks). In our proposed dataset, the trial versions of commercial systems achieved recognition rates below 70 with recognition rate of 78.33

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

*alpr has been a frequent topic of research [1, 2, 3] due to many practical applications, such as automatic toll collection, traffic law enforcement, private spaces access control and road traffic monitoring.

*alpr systems typically have three stages: *lp detection, character segmentation and character recognition. The earlier stages require higher accuracy or almost perfection, since failing to detect the *lp would probably lead to a failure in the next stages either. Many approaches search first for the vehicle and then its *lp in order to reduce processing time and eliminate false positives.

Although *alpr has been frequently addressed in the literature, many studies and solutions are still not robust enough on real-world scenarios. These solutions commonly depend on certain constraints, such as specific cameras or viewing angles, simple backgrounds, good lighting conditions, search in a fixed region, and certain types of vehicles (they would not detect *lp from vehicles such as motorcycles, trucks or buses).

Many computer vision tasks have recently achieved a great increase in performance mainly due to the availability of large-scale annotated datasets (i.e., ImageNet 

[4]) and the hardware (GPUs) capable of handling a large amount of data. In this scenario, *dl techniques arise. However, despite the remarkable progress of *dl approaches in *alpr [5, 6, 7], there is still a great demand for *alpr datasets with vehicles and *lp annotations. The amount of training data is determinant for the performance of *dl techniques. Higher amounts of data allow the use of more robust network architectures with more parameters and layers. Hence, we propose a larger benchmark dataset, called *dataset, focused on different real-world scenarios.

To the best of our knowledge, the *ssig [8] is the largest public dataset of Brazilian *lp. This dataset contains less than training examples and has several constraints such as: it uses a static camera mounted always in the same position, all images have very similar and relatively simple backgrounds, there are no motorcycles and only a few cases where the *lp are not well aligned.

When recording the *dataset dataset, we sought to eliminate many of the constraints found in *alpr applications by using three different non-static cameras to capture , images from different types of vehicles (cars, motorcycles, buses, trucks, among others) with complex backgrounds and under different lighting conditions. The vehicles are in different positions and distances to the camera. Furthermore, in some cases, the vehicle is not fully visible on the image. To the best of our knowledge, there are no public datasets for *alpr with annotations of cars, motorcycles, *lp and characters. Therefore, we can point out two main challenges in our dataset. First, usually, car and motorcycle *lp have different aspect ratios, not allowing *alpr approaches to use this constraint to filter false positives. Also car and motorcycle *lp have different layouts and positions.

As great advances in object detection were achieved through YOLO-inspired models [9, 10], we decided to fine-tune it for *alpr. YOLOv2 [11] is a state-of-the-art real-time object detection that uses a model with convolutional layers and maxpooling layers. On the other hand, Fast-YOLO [12] is a model focused on a speed/accuracy trade-off that uses fewer convolutional layers ( instead of ) and fewer filters in those layers. Therefore, Fast-YOLO is much faster but less accurate than YOLOv2.

In this work, we propose a new robust real-time *alpr system based on the YOLO object detection *cnn. Since we are processing video frames, we also employ temporal redundancy such that we process each frame independently and then combine the results to create a more robust prediction for each vehicle.

The proposed system outperforms previous results and two commercial systems in the *ssig dataset and also in our proposed *dataset. The main contributions of this paper can be summarized as follows:

  • A new real-time end-to-end *alpr system using the state-of-the-art YOLO object detection *cnn222The entire *alpr system, i.e., the architectures and weights, is publicly available for academic purposes.;

  • A robust two-stage approach for character segmentation and recognition mainly due to simple data augmentation tricks for training data such as inverted *lp and flipped characters.

  • A public dataset for *alpr with , fully annotated images (over , *lp characters) focused on usual and different real-world scenarios, showing that our proposed *alpr system yields outstanding results in both scenarios.

  • A comparative evaluation among the proposed approach, previous works in the literature and two commercial systems in the *dataset dataset.

This paper is organized as follows. We briefly review related work in Section II. The *dataset dataset is introduced in Section III. Section IV presents the proposed *alpr system using object detection *cnn. We report and discuss the results of our experiments in Section V. Conclusions and future work are given in Section VI.

Ii Related Work

In this section, we briefly review several recent works that use *dl approaches in the context of *alpr. For relevant studies using conventional image processing techniques, please refer to [13, 14, 15, 16, 1, 17, 2, 18, 19]. More specifically, we discuss works related to each *alpr stage, and specially studies works that not fit into the other subsections. This section concludes with final remarks.

*lp Detection: Many authors have addressed the *lp detection stage with object detection *cnn. Montazzolli and Jung [20] used a single *cnn arranged in a cascaded manner to detect both car frontal-views and its *lp, achieving high recall and precision rates. Hsu et al. [21] customized *cnn exclusively for *lp detection and demonstrated that the modified versions perform better. Rafique et al. [22] applied *svm and *rcnn for *lp detection, noting that *rcnn are best suited for real-time systems.

Li and Chen [5] trained a *cnn based on characters cropped from general text to perform a character-based *lp detection, achieving higher recall and precision rates than previous approaches. Bulan et al. [3]

first extracts a set of candidate *lp regions using a weak *snow classifier and then filters them using a strong *cnn, significantly improving the baseline method.

Character Segmentation: *alpr systems based on *dl techniques usually address the character segmentation and recognition together. Montazzolli and Jung [20] propose a *cnn to segment and recognize the characters within a cropped *lp. They have segmented more than of the characters correctly, outperforming the baseline by a large margin.

Bulan et al. [3] achieved very high accuracy in *lp recognition jointly performing the character segmentation and recognition using *hmm where the most likely *lp was determined by applying the Viterbi algorithm.

Character Recognition: Menotti et al. [23] proposed the use of random *cnn to extract features for character recognition, achieving a significantly better performance than using image pixels or learning the filters weights with back-propagation. Li and Chen [5] proposed to perform the character recognition as a sequence labelling problem. A *rnn with *ctc is employed to label the sequential data, recognizing the whole *lp without the character-level segmentation.

Although Svoboda et al. [24] have not perform the character recognition itself, they achieved high quality *lp deblurring reconstructions using a text deblurring *cnn, which can be very useful in character recognition.

Miscellaneous: Masood et al. [7] presented an end-to-end *alpr system using a sequence of deep *cnn. As this is a commercial system, little information is given about the used *cnn. Li et al. [6] propose a unified *cnn that can locate *lp and recognize them simultaneously in a single forward pass. In addition, the model size is highly decreased by sharing many of its convolutional features.

Final Remarks: Many papers only address part of the *alpr pipeline (e.g., *lp detection) or perform their experiments on datasets that do not represent real-world scenarios, making it difficult to accurately evaluate the presented methods. In addition, most of the approaches are not capable of recognizing *lp in real-time, making it impossible for them to be applied in some applications. In this sense, we employ the YOLO object detection *cnn in each stage to create a robust and efficient end-to-end *alpr system. In addition, we perform data augmentation for character recognition, since this stage is the bottleneck in some *alpr systems.

Iii The UFPR-ALPR Dataset

Fig. 1: Sample images of the *dataset dataset. First three rows show the variety in backgrounds, lighting conditions, as well as vehicle/*lp positions and types. Fourth row shows examples of vehicle and *lp annotations. The *lp were blurred due to privacy constraints.

The dataset contains , images taken from inside a vehicle driving through regular traffic in an urban environment. These images were obtained from videos with duration of second and frame rate of *fps. Thus, the dataset is divided into vehicles, each with images with only one visible *lp in the foreground. It is noteworthy that no stabilization method was used. Fig. 1 shows the diversity of the dataset.

The images were acquired with three different cameras and are available in the *png format with size of , , pixels. The cameras used were: GoPro Hero4 Silver, Huawei P9 Lite and iPhone 7 Plus. Images obtained with different cameras do not necessarily have the same quality, although they have the same resolution and frame rate. This is due to different camera specifications, such as autofocus, bit rate, focal length and optical image stabilization.

There are minor variations in the camera position due to repeated mountings of the camera and also to simulate a real condition, where the camera is not always placed in exactly the same position.

We collected , images with each camera, divided as follows: of cars with gray *lp, of cars with red *lp and of motorcycles with gray *lp. In Brazil, the *lp have size and color variations depending on the type of the vehicle and its category. Cars’ *lp have a size of cm  cm, while motorcycles *lp have cm  cm. Private vehicles have gray *lp, while buses, taxis and other transportation vehicles have red *lp. There are other color variations for specific categories such as official or older cars. Fig. 2 shows some of the different types of *lp found in the dataset.

(a)
(b)
Fig. 2: Examples of the different *lp types found in the *dataset dataset. In Brazil, cars’ *lp have  letters and  digits in the same row and motorcycles’ *lp have  letters in one row and  digits in another.

The dataset is split as follows: for training,  for testing and for validation, using the same protocol division proposed by Gonçalves et al. [8] in the *ssig dataset. The dataset distribution was made so that each split has the same number of images obtained with each camera, taking into account the type and position of the vehicle, the color and the characters of the vehicle’s *lp, the distance of the vehicle from the camera (based on the height of the *lp in pixels) such that each split is as representative as possible.

The heat maps of the distribution of the vehicles and *lp for the image frame in both *ssig and *dataset datasets are shown in Fig. 3. As can be seen, the vehicles and *lp are much better distributed in our dataset.

(a)
(b)
(c)
(d)
Fig. 3: Heat maps illustrating the distribution of vehicles and *lp in the *ssig and *dataset datasets. The heat maps are log-normalized, meaning the distribution is even more concentrated than it appears.

In Brazil, each state uses particular starting letters for its *lp which results in a specific range. In Paraná (where the dataset was collected), *lp range from AAA-0001 to BEZ-9999. Therefore, the letters A and B have many more examples than the others, as shown in Fig. 4.

Fig. 4: Letters distribution in the *dataset dataset.

Every image has the following annotations available in a text file: the camera in which the image was taken, the vehicle’s position and information such as: type (car or motorcycle), manufacturer, model and year; the identification and position of the *lp, as well as the position of its characters. Fig. 1 shows the bounding boxes of different types of vehicle and *lp.

Iv Proposed ALPR Approach

This section describes the proposed approach and it is divided into four subsections, one for each of the *alpr stages (i.e., vehicle and *lp detection, character segmentation and character recognition) and one for temporal redundancy. Fig. 5 illustrates the *alpr pipeline, explained throughout this section.

Fig. 5: An usual *alpr pipeline having temporal redundancy at the end.

We use specific *cnn for each *alpr stage. Thus, we can tune the parameters separately in order to improve the performance for each task. The models used are: Fast-YOLO, YOLOv2 and CR-NET [20], an architecture inspired by Fast-YOLO for character segmentation and recognition.

Iv-a Vehicle and LP Detection

We train two *cnn in this stage: one for vehicle detection in the input image and other for *lp detection in the detected vehicle. Recent works [20, 25] also performed the vehicle detection first.

We evaluated both Fast-YOLO and YOLOv2 models at this stage to be able to handle simpler (i.e., *ssig) and more realistic (i.e., *dataset) data. For simpler scenarios, the Fast-YOLO should be able to detect the vehicles and their *lp correctly in much less time. However, for more realistic scenarios it might not be deep enough to perform these tasks.

In order to use both YOLO models333For training YOLOv2 and Fast-YOLO we used convolutional weights pre-trained on ImageNet [4], available at https://pjreddie.com/darknet/yolo/., we need to change the number of filters in the last convolutional layer to match the number of classes. YOLO uses anchor boxes to predict bounding boxes (we use = ) each with four coordinates , confidence and class probabilities [11], so the number of filters is given by

(1)

In a dataset such as the *ssig dataset, we intend to detect only one class in both vehicle and *lp detection (first the car and then its *lp), so the number of filters in each task has been reduced to . On the other hand, the *dataset dataset includes images from cars and motorcycles (two classes), so the number of filters in the vehicle detection task must be . In our tests, the results were better when using two classes (instead of just one class called ‘vehicle’). The Fast-YOLO’s architecture used in both tasks is shown in Table I. The same changes were made in the YOLOv2 model architecture (not shown due to lack of space).

Layer Filters Size Input Output
conv
max
conv
max
conv
max
conv
max
conv
max
conv
max
conv
conv
conv
detection
TABLE I: Fast-YOLO network used in both vehicle and *lp detection. There are either or filters in the last convolutional layer to detect one or two classes, respectively.

While the entire frame and the vehicle coordinates are used as inputs to train the vehicle detection *cnn, the vehicle patch (with a margin) and the coordinates of its *lp are used to learn the *lp detection network. The size of the margin is defined as follows. We evaluated, in the validation set, the required margin so that all *lp would be completely within the bounding boxes of the vehicles found by the vehicle detection *cnn. This is done to avoid losing *lp in cases where the vehicle is not very well detected/segmented.

By default, YOLO only returns objects detected with a confidence of or higher. In the validation set, we evaluated the best threshold in order to detect all vehicles having the lowest false positive rate. A negative recognition result is given in cases where no vehicle is found. For *lp detection we use threshold equal , as there might be cases where the *lp is detected with very low confidence (e.g., ). We keep only the detection with the largest confidence in cases where more than one *lp is detected, since each vehicle has only one *lp.

Iv-B Character Segmentation

Once the *lp has been detected, we employ the *cnn proposed by Montazzolli and Jung [20] (CR-NET) for character segmentation and recognition. However, instead of performing both stages at the same time through an architecture with classes (0-9, A-Z, where the letter O is detected jointly with the digit 0), we chose to first use a network to segment the characters and then another two to recognize them. Knowing that all Brazilian *lp have the same format: three letters and four digits, we use  classes for letters and  classes for digits. As pointed out by Gonçalves et al. [25], this reduces the incorrect classification.

The character segmentation *cnn (architecture described in Table II) is trained using the *lp patch (with a margin) and the characters coordinates as inputs. As in the previous stage, this margin is defined based on the validation set to ensure that all characters are completely within its predicted *lp.

Layer Filters Size Input Output
conv
max
conv
max
conv
conv
conv
max
conv
conv
conv
conv
conv
conv
conv
detection
TABLE II: Character segmentation *cnn, proposed in [20]. We changed the number of filters in the last convolutional layer to , as we want to first segment the character (one class).

The *cnn input size ( ) was chosen based on the *lp’s ratio of Brazilian cars ( ), however the motorcycles *lp are nearly square ( ). That way, we enlarged horizontally all detected *lp (to ) before performing the character segmentation.

We also create a negative image of each *lp, thereby doubling the number of training samples. Since the color of the characters in the Brazilian *lp depends on the category of the vehicle (e.g., private or commercial), the negative images simulate characters from other categories.

In some cases, more than characters might be detected. If there are no overlaps (*iou  ), we discard the ones with the lowest confidence levels. Otherwise, we perform the union between the overlapping characters, turning them into a single character. As motorcycle *lp can be very tilted, we use a higher threshold (*iou  ) to consider the overlap between its characters.

Iv-C Character Recognition

Since many characters might not be perfectly segmented, containing missing parts, and as each character is relatively small, even one pixel difference between the ground truth and the prediction might impair the character’s recognition. Therefore, we evaluate different padding values (

- pixels) in the segmented characters to achieve higher recognition rates. As Fig. 6 illustrates, the more padding pixels the more noise information is added (e.g., portions of other characters or the *lp frame).

(a)
(b)
(c)
(d)
(e)
(f)
(g)
Fig. 6: Comparison of different values of padding.

As previously mentioned, we use two networks for character recognition. For training these networks, the characters and their labels are passed as input. For digit recognition, we removed the first four layers of the character segmentation *cnn, since in our tests the results were similar, but with a lower computational cost. However, for letter recognition (more classes and fewer examples) we still use the entire architecture of the character segmentation *cnn. The networks for digit and letter recognition have and filters in the last convolutional layer, respectively (see Eq. 1).

The use of two networks allows the tuning of network parameters (e.g., input/output size) for each task. The best network sizes found in our experiments are        and        for digits and letters, respectively.

Having knowledge of the specific *lp country layout (e.g., the Brazilian layout), we know which characters are letters and which are digits by their position. We sort the segmented characters by their horizontal and vertical positions for cars and motorcycles, respectively. The first three characters correspond to the letters and the last four to the digits, even in cases where the *lp is considerably tilted. It is worth noting that a country (e.g., USA) might have different *lp layouts, so this approach would not be suitable in such cases.

In addition to performing the training with the characters available in the training set, we also perform data augmentation in two ways. First, we create negative images to simulate characters from other vehicle categories (as in the character segmentation stage) and then, we also check which characters can be flipped both horizontally and vertically to create new instances. Table III shows which characters can be flipped in each direction.

Flip Direction Characters
Vertical 0, 1, 3, 8, B, C, D, E, H, I, K, O, X
Horizontal 0, 1, 8, A, H, I, M, O, T, U, V, W, X, Y
Both 0, 1, 6(9), 8, 9(6), H, I, N, O, S, X, Z
TABLE III: The characters that can be flipped in each direction to create new instances. We also use the numbers and as training examples for the letters O and I, respectively.

As in the *lp detection step, we use confidence threshold =  and consider only the detection with the largest confidence. Hence, we ensure that a class is predicted for every segmented character.

Iv-D Temporal Redundancy

After performing the *lp recognition on single frames, we explore the temporal redundancy information through the union of all frames belonging to the same vehicle. Thus, the final recognition is composed of the most frequently predicted character at each *lp position (majority vote).

Temporal information has already been explored previously in *alpr [25, 26]. In both studies, the use of majority voting has greatly increased recognition rates.

V Experimental Results

In this section, we conduct experiments to verify the effectiveness of the proposed *alpr system. All the experiments were performed on a NVIDIA Titan XP GPU (, CUDA cores and GB of RAM) using the Darknet framework [27].

We consider as correct only the detections with *iou  . This value was chosen based on previous works [20, 6, 18]. In addition, the following parameters were used for training the networks:  iterations (max batches) and learning rate = [---] with steps at and iterations.

Experiments were conducted in two datasets: *ssig and *dataset. We report the results obtained by the proposed system and compare with previous work and two commercial systems444OpenALPR and Sighthound systems have Cloud APIs available at https://www.openalpr.com/cloud-api.html and https://www.sighthound.com/products/cloud, respectively. The results presented here were obtained on January, 2018.: Sighthound [7] and OpenALPR555Although it has an open-source version, the commercial version uses different algorithms for OCR trained with larger datasets to improve accuracy. [28]. According to the authors, both are robust in the detection and recognition of Brazilian *lp.

It is important to emphasize that although the commercial systems were not tuned for these datasets, they use much larger private datasets, which is a great advantage especially in *dl approaches.

In the OpenALPR system we choose which *lp’s style we want to detect (i.e., Brazilian) and we do not need to make any changes. On the other hand, Sighthound uses a single model for *lp from different countries. Therefore, we made some adjustments in its prediction so that it fits the Brazilian *lp format, such as swapping by O and vice versa.

V-a Evaluation on the *ssig Dataset

The *ssig dataset [8] is composed of , images of vehicles with the following annotations: the position of the vehicle’s *lp, its identification (e.g., ABC-1234) and each character’s position.

The high resolution images (, , pixels) were acquired with a static digital camera and are available in the *png format. A sample frame of the dataset is shown in Fig. 7.

Fig. 7: A sample frame of the *ssig dataset. It should be noted that there are vehicles in the background that do not have annotations. The *lp were blurred due to privacy constraints.

The *ssig dataset uses the following evaluation protocol: of the dataset to training, to validation and

to test. According to the authors, this protocol was adopted because many character segmentation approaches do not require model estimation and a larger test set allows the reported results to be more statistically significant.

We report only the results obtained with the Fast-YOLO model in the vehicle and *lp detection subsections, since it achieved impressive recall and precision rates in both tasks.

V-A1 Vehicle Detection

Since the *ssig dataset does not have vehicle annotations, we manually label the vehicle’s bounding box on each image of the dataset. Another possible approach would be to train a vehicle detector using the large-scale CompCars dataset [29], but that way many vehicles (including those in the background) would also be detected.

To perform the vehicle detection, we first evaluate different confidence thresholds. We started with confidence of , however some vehicles were not detected. All vehicles in the validation set were successfully detected when the threshold was reduced to . Based on that, we decided to use half of this value (i.e., ) in the test set to increase the chance that all vehicles are detected. With this threshold, we achieved a recall of and precision above (only false positives).

V-A2 *lp Detection

Every vehicle in the validation set was well segmented with its *lp completely within the predicted bounding box. Therefore, we use the vehicle patches without any margin to train the *lp detection network. As expected, all *lp were correctly detected in both validation and test sets (recall and precision = ).

V-A3 Character Segmentation

A margin of (of the bounding box size) is required so each detected *lp contains all its characters fully. Therefore, we double this value (i.e., ) in the test set and in the training of the character segmentation *cnn.

We evaluated, in the validation set, the following confidence thresholds: , and , but the recall achieved was , regardless. Therefore, we chose to use a lower threshold (i.e., ) in the test set to miss as few characters as possible. That way, we achieved (,/,) recall.

V-A4 Character Recognition

The padding values that yielded the best recognition rates in the validation set were pixels for letters and pixel for digits. In addition, data augmentation with flipped characters only improved letter recognition, hampering digit recognition. We believe that a greater padding and data augmentation improve letter recognition because each class have far fewer training examples, compared to digits.

We first analyzed the results without temporal redundancy information. The proposed system achieved recognition rate of , recognizing all three letters and all four digits in  and  of the time, respectively.

The results are greatly improved when taking advantage of temporal redundancy information. The final recognition rate is , since the digits are correctly recognized in all vehicles and the letters in of them. This result is given based on the number of frames correctly recognized, thereby vehicles with more frames have greater weight in the final result.

The recognition rates accomplished by the proposed system were considerably better than those obtained in previous works (  ), as shown in Table IV. As expected, the commercial systems have also achieved great recognition rates, but only the proposed system was able to recognize correctly at least of the characters in all *lp. This is particularly important since the *lp’s identification can be combined with the vehicle’s manufacturer/model [30] or its appearance [25] to further enhance the recognition.

ALPR characters All correct (vehicles)
Montazzolli and Jung [20]
Sighthound [7]
Proposed
OpenALPR [28]
Gonçalves et al. [25] (with redundancy) (/)
Sighthound (with redundancy) (/)
OpenALPR (with redundancy) (/)
Proposed (with redundancy) 100.00% 93.53% (37/40)
TABLE IV: Recognition rates obtained by the proposed *alpr system, previous work and commercial systems in the *ssig dataset.

According to our experiments, the great improvement in our system lies on separating the letter and digits recognition on two networks, so each one is tuned specifically for its task. Moreover, data augmentation was essential for letter recognition, since some classes (e.g., C, V) have less than  training examples.

In Table V, we report the recall/accuracy rate achieved in each *alpr stage separately, as well as the time required for the proposed system to perform each stage. The reported time is the average time spent processing all inputs in each stage, assuming that the network weights are already loaded.

ALPR Stage Recall/Accuracy Time (ms) *fps
Vehicle Detection
License Plate Detection
Character Segmentation
Character Recognition
*alpr (all correct)
*alpr (with redundancy)
TABLE V: Results obtained and the computational time required in each *alpr stage in the *ssig dataset. Recall stands for detection and segmentation, and Accuracy stands for recognition.

Since the same model is used for vehicle and *lp detection, the time required for both stages is very similar. The same is true for character segmentation and recognition, but the latter is performed times (one time for each character). The average processing time for each frame was seconds, an average of *fps.

Our system had no difficulty recognizing red *lp, even with less training examples. According to our experiments, this is due to the negative images used in the training of the character segmentation and recognition *cnn. Due to the agreement terms of the *ssig dataset, we can not show qualitative results. Only a few *lp (all from the training set) can be shown for illustrations of publications.

V-B Evaluation on the *dataset Dataset

V-B1 Vehicle Detection

We first evaluated the Fast-YOLO model, but the recognition rates achieved were not satisfactory. After evaluations with different confidence thresholds, the best recall rate achieved was . This was expected since this dataset has greater variability in vehicle types and positions.

We chose to use the YOLOv2 model for vehicle detection, despite its higher computational cost. We evaluated several confidence thresholds, being the best one, as in the *ssig dataset. The recall and precision rates achieved were and , respectively. Fig. 8 shows a motorcycle and a car detected with the YOLOv2 model.

Fig. 8: Examples of the detection obtained with the YOLOv2 model.

V-B2 *lp Detection

We note that in more challenging images (usually of motorcycles), the vehicle’s *lp is not entirely within its predicted bounding box, requiring a small margin ( in the validation set) so that the entire *lp is completely within the predicted vehicle’s bounding box. Therefore, we use a margin in the test set and in the training of the *lp detection *cnn.

The recognition rates obtained by both YOLO models were very similar (less than half a percent difference). Thus, we use the Fast-YOLO model for *lp detection. The recall rate attained was (,/,). We were not able to detect the *lp in just one vehicle (in its frames), because a false positive was predicted with greater confidence than the actual *lp, as shown in Fig. 9.

Fig. 9: A sample frame from the *dataset dataset where the actual *lp was not predicted with the highest confidence. The predicted position and ground truth are outlined in red and green, respectively. The *lp was blurred due to privacy constraints.

We could use the character segmentation CNN to perform a post-processing in cases where more than one *lp is detected, for example: evaluate on each detected *lp if there are characters or consider only the *lp where the characters’ confidence is greater. However, since the actual *lp can be detected with very low confidence levels (i.e.,  ), many false negatives would have to be analyzed, increasing the overall computational cost of the system.

V-B3 Character Segmentation

In the validation set, a margin of is required so each detected *lp contains all its characters fully. We decided not to double the margin in the test set, as would add a considerable amount of noise and background in the *lp patches.

The recall obtained was when disregarding the *lp not detected in the previous stage and when considering the whole test set. We accomplished better results in the *ssig dataset, but it is worth noting that our dataset has different *lp types and many of them are tilted. Fig. 10 depicts some *lp from different categories properly segmented, even when the *lp is tilted or in presence of shadows.

Fig. 10: *lp from different categories properly segmented.

V-B4 Character Recognition

The best results were obtained with pixel of padding and data augmentation, for both letters and digits. The proposed system achieved a recognition rate of when processing frames individually and (/ vehicles) with temporal redundancy.

Despite the great results obtained in the previous dataset, both commercial systems did not achieve satisfactory results in the *dataset dataset. Analyzing the results we noticed that a substantial part of the errors were in motorcycles images, highlighting this constraint in both systems. This suggests that those systems are not so well trained for motorcycles. OpenALPR performed better than Sighthound, attaining a recognition rate of when exploring temporal redundancy information. Table VI shows all results obtained in the *dataset dataset.

ALPR characters All correct (vehicles)
Sighthound [7]
OpenALPR [28]
Proposed
Sighthound (with redundancy) (/)
OpenALPR (with redundancy) (/)
Proposed (with redundancy) 88.33% 78.33% (47/60)
TABLE VI: Recognition rates obtained by the proposed *alpr system and commercial systems in the *dataset dataset.

We report the recall/accuracy rate achieved in each *alpr stage separately in Table VII, as well as the time required for the proposed system to perform each stage. The vehicle detection stage is more time-consuming in this dataset, as we use a larger *cnn architecture (i.e., YOLOv2).

ALPR Stage Recall/Accuracy Time (ms) *fps
Vehicle Detection
License Plate Detection
Character Segmentation
Character Recognition
*alpr (all correct)
*alpr (with redundancy)
TABLE VII: Results obtained and the computational time required in each stage in the *dataset dataset. Recall stands for detection and segmentation, and Accuracy stands for recognition.

It is worth noting that despite using a deeper *cnn model in vehicle detection (i.e., YOLOv2), our system is still able to process images at *fps (against *fps using Fast-YOLO). This is sufficient for real-time usage, as commercial cameras generally record videos at  *fps.

Fig. 11 illustrates some of the recognition results obtained by the proposed system in the *dataset dataset. It is noteworthy that our system can generalize well and correctly recognize *lp under different lighting conditions.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
Fig. 11: Qualitative results obtained by the proposed *alpr system in the *dataset dataset. The first two rows shows examples of correctly detected and incorrectly recognized *lp, while the following rows show samples of *lp (from different categories) successfully recognized.

Vi Conclusions

In this paper, we have presented a robust real-time end-to-end *alpr system using the state-of-the-art YOLO object detection *cnn. We trained a network for each *alpr stage, except for the character recognition where letters and digits are recognized separately (with two distinct *cnn).

We also introduced a public dataset for *alpr that includes , fully annotated images (with over , *lp characters) from vehicles in real-world scenarios where both vehicle and camera (inside another vehicle) are moving. Compared to the largest Brazilian dataset (*ssig) for this task, our dataset has more than twice the images and contains a larger variety in different aspects.

At present, the bottleneck of *alpr systems is the character segmentation and recognition stages. In this sense, we performed several approaches to increase recognition rates in both stages, such as data augmentation to simulate *lp from other vehicle’s categories and to increase characters with few instances in the training set. Although simple, these strategies were essential to accomplish outstanding results.

Our system was capable to achieve a full recognition rate of ( without temporal redundancy) in the SSIG dataset, considerably outperforming previous results ( with temporal redundancy [25] and without [20]) and presenting a performance slightly better than commercial systems (). In addition, the proposed system was the only to correctly recognize at least characters in all *lp.

We also evaluated our proposed *alpr system and two commercial systems as baselines on the new dataset. The results demonstrated that the *dataset dataset is very challenging since both commercial systems reached recognition rates below . Our system performed better, with recognition rate of . However, this result is still not satisfactory for some real-world *alpr applications.

As future work, we intend to explore new *cnn architectures to further optimize (in terms of speed) vehicle and *lp detection stages. We also intend to correct the alignment of inclined *lp and characters in order to improve the character segmentation and recognition. Additionally, we plan to explore the vehicle’s manufacturer and model in the *alpr pipeline as our new dataset provides such information. Although our system was conceived and evaluated on two country-specific datasets from Brazil, we believe that the proposed *alpr system is robust to locate vehicle, *lp and alphanumeric characters from any other country. In this direction, aiming a fully robust system we just need to design a character recognition module that is independent of the *lp layout.

Acknowledgments

This work was supported by grants from the National Council for Scientific and Technological Development (CNPq) (# 428333/2016-8, # 311053/2016-5 and # 313423/2017-2), the Minas Gerais Research Foundation (FAPEMIG) (APQ-00567-14 and PPM-00540-17) and the Coordination for the Improvement of Higher Education Personnel (CAPES) (DeepEyes Project).

We thank the NVIDIA Corporation for the donation of the GeForce GTX Titan XP Pascal GPU used for this research.

References

  • [1] S. Du, M. Ibrahim, M. Shehata, and W. Badawy, “Automatic license plate recognition (ALPR): A state-of-the-art review,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 2, pp. 311–325, Feb 2013.
  • [2]

    C. Gou, K. Wang, Y. Yao, and Z. Li, “Vehicle license plate recognition based on extremal regions and restricted boltzmann machines,”

    IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 4, pp. 1096–1107, April 2016.
  • [3] O. Bulan, V. Kozitsky, P. Ramesh, and M. Shreve, “Segmentation- and annotation-free license plate recognition with deep localization and failure identification,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 9, pp. 2351–2363, Sept 2017.
  • [4] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    , June 2009, pp. 248–255.
  • [5] H. Li and C. Shen, “Reading car license plates using deep convolutional neural networks and LSTMs,” CoRR, vol. abs/1601.05610, 2016. [Online]. Available: http://arxiv.org/abs/1601.05610
  • [6] H. Li, P. Wang, and C. Shen, “Towards end-to-end car license plates detection and recognition with deep neural networks,” CoRR, vol. abs/1709.08828, 2017. [Online]. Available: http://arxiv.org/abs/1709.08828
  • [7] S. Z. Masood, G. Shu, A. Dehghan, and E. G. Ortiz, “License plate detection and recognition using deeply learned convolutional neural networks,” CoRR, vol. abs/1703.07330, 2017. [Online]. Available: http://arxiv.org/abs/1703.07330
  • [8] G. R. Gonçalves, S. P. G. da Silva, D. Menotti, and W. R. Schwartz, “Benchmark for license plate character segmentation,” Journal of Electronic Imaging, vol. 25, no. 5, pp. 053 034–053 034, 2016.
  • [9] G. Ning, Z. Zhang, C. Huang, X. Ren, H. Wang, C. Cai, and Z. He, “Spatially supervised recurrent convolutional neural networks for visual object tracking,” in 2017 IEEE International Symposium on Circuits and Systems (ISCAS), May 2017, pp. 1–4.
  • [10] B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, “SqueezeDet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), July 2017, pp. 446–454.
  • [11] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 6517–6525.
  • [12] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 779–788.
  • [13] C. N. E. Anagnostopoulos, I. E. Anagnostopoulos, I. D. Psoroulas, V. Loumos, and E. Kayafas, “License plate recognition from still images and video sequences: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 3, pp. 377–391, Sept 2008.
  • [14] G. S. Hsu, J. C. Chen, and Y. Z. Chung, “Application-oriented license plate recognition,” IEEE Transactions on Vehicular Technology, vol. 62, no. 2, pp. 552–561, Feb 2013.
  • [15] A. H. Ashtari, M. J. Nordin, and M. Fathy, “An iranian license plate recognition system based on color features,” IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 4, pp. 1690–1705, Aug 2014.
  • [16] M. S. Sarfraz, A. Shahzad, M. A. Elahi, M. Fraz, I. Zafar, and E. A. Edirisinghe, “Real-time automatic license plate recognition for CCTV forensic applications,” Journal of Real-Time Image Processing, vol. 8, no. 3, pp. 285–295, Sep 2013.
  • [17] R. Panahi and I. Gholampour, “Accurate detection and recognition of dirty vehicle plate numbers for high-speed applications,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 4, pp. 767–779, April 2017.
  • [18] Y. Yuan, W. Zou, Y. Zhao, X. Wang, X. Hu, and N. Komodakis, “A robust and efficient approach to license plate detection,” IEEE Transactions on Image Processing, vol. 26, no. 3, pp. 1102–1114, March 2017.
  • [19] S. Azam and M. M. Islam, “Automatic license plate detection in hazardous condition,” Journal of Visual Communication and Image Representation, vol. 36, pp. 172 – 186, 2016.
  • [20] S. Montazzolli and C. R. Jung, “Real-time brazilian license plate detection and recognition using deep convolutional neural networks,” in 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images, Oct 2017, pp. 55–62.
  • [21] G. S. Hsu, A. Ambikapathi, S. L. Chung, and C. P. Su, “Robust license plate detection in the wild,” in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Aug 2017, pp. 1–6.
  • [22] M. A. Rafique, W. Pedrycz, and M. Jeon, “Vehicle license plate detection using region-based convolutional neural networks,” Soft Computing, Jun 2017.
  • [23] D. Menotti, G. Chiachia, A. X. Falcão, and V. J. O. Neto, “Vehicle license plate recognition with random convolutional networks,” in 2014 27th SIBGRAPI Conference on Graphics, Patterns and Images, Aug 2014, pp. 298–303.
  • [24] P. Svoboda, M. Hradiš, L. Maršík, and P. Zemcík, “CNN for license plate motion deblurring,” in 2016 IEEE International Conference on Image Processing (ICIP), Sept 2016, pp. 3832–3836.
  • [25] G. R. Gonçalves, D. Menotti, and W. R. Schwartz, “License plate recognition based on temporal redundancy,” in 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), Nov 2016, pp. 2577–2582.
  • [26] M. Donoser, C. Arth, and H. Bischof, “Detecting, tracking and recognizing license plates,” in Computer Vision – ACCV 2007, Y. Yagi, S. B. Kang, I. S. Kweon, and H. Zha, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 447–456.
  • [27] J. Redmon, “Darknet: Open source neural networks in C,” http://pjreddie.com/darknet/, 2013–2016.
  • [28] OpenALPR Cloud API, http://www.openalpr.com/cloud-api.html.
  • [29] L. Yang, P. Luo, C. C. Loy, and X. Tang, “A large-scale car dataset for fine-grained categorization and verification,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3973–3981.
  • [30] L. Dlagnekov and S. J. Belongie, Recognizing cars.   Department of Computer Science and Engineering, University of California, San Diego, 2005.